EM(maximization algorithm)算法寻找蛋白质motif
这是一道课程作业:给出了2000条蛋白质序列,每条长度为50,要求使用EM算法寻找其中的motif(假设motif长度为7).
Expectation Maximization (EM) Algorithm
EM is a two-stage iterative process. An initial guess is made as to the location and size of a sequence pattern (a motif or domain) in each sequence in a set of related sequences. These regions are aligned to create a “trial” alignment for the set of sequences. Using the trial alignment, the residue composition of each column in the alignment is first calculated and used to create a PSSM.
Step 1. Expectation
Using the values in the PSSM, the probability of finding the pattern at every possible position in each sequence is calculated.
Step 2. Maximization
The probabilities from step 1 are used to weight the values in the PSSM, essentially providing new information about the likely location of the pattern in each sequence. The values in the PSSM are updated using these weights.
Steps 1 and 2 are repeated until the values in the PSSM don’t change with continued iterations.