EM(maximization algorithm)算法寻找蛋白质motif


Expectation Maximization (EM) Algorithm
EM is a two-stage iterative process. An initial guess is made as to the location and size of a sequence pattern (a motif or domain) in each sequence in a set of related sequences. These regions are aligned to create a “trial” alignment for the set of sequences. Using the trial alignment, the residue composition of each column in the alignment is first calculated and used to create a PSSM.
Step 1. Expectation
Using the values in the PSSM, the probability of finding the pattern at every possible position in each sequence is calculated.
Step 2. Maximization
The probabilities from step 1 are used to weight the values in the PSSM, essentially providing new information about the likely location of the pattern in each sequence. The values in the PSSM are updated using these weights.
Steps 1 and 2 are repeated until the values in the PSSM don’t change with continued iterations.


2.2000 * 7条矩阵按照以下步骤生成20*7(氨基酸种类数量为20种)矩阵:

EM(maximization algorithm)算法寻找蛋白质motif

3.20 * 7 权重打分矩阵对于每一条蛋白质顺序的每一种可能的motif进行打分(如本题50个顺序那么motif起点可能在0,43),每一条选取得分最高的7个片段重新组成2000 * 7矩阵。
4.重复2 3步骤直到收敛(我的理解每一条蛋白质的7个片段高度相似),即为可能的motif序列。
