北京大学生物信息学-第三周-序列数据库 BLAST
序列数据库
- Genbank是美国国家生物技术信息中心(National Center for Biotechnology Information ,NCBI)建立的DNA序列数据库,从公共资源中获取序列数据
- SRA(Sequence ReadArchive)数据库是用于存储二代测序的原始数据,包括 454,Illumina,SOLiD,IonTorrent,Helicos 和 CompleteGenomics。除了原始序列数据外,SRA现在也存在raw reads在参考基因的比对信息。
BLAST算法初探
BLAST Ideas: Seeding‐and‐extending:种子-扩展
- Find matches (seed) between the query and subject;查找查询序列和目标序列之间的匹配(种子)
- Extend seed into High Scoring Segment Pairs (HSPs);将种子扩展成高分段对(HSPs)
– Run Smith‐Waterman algorithm on the specified region only.特定区域 - Assess the reliability of the alignment.评估校准的可靠性
Seeding:
For a given word length w (usually 3 for proteins and 11 for
nucleotides), slicing the query sequence into multiple
对于给定的单词长度w(通常蛋白质用3和核苷酸用11),将查询序列切片成多个。
continuous “seed words” 种子单字
Speedup: Index database 索引数据库-加速
The database was pre‐indexed to quickly locate all positions in the database for a given seed.
数据库被预索引以快速定位数据库中给定种子的所有位置。
E‐Value: How a match is likely to arise by chance:匹配偶然发生的可能
• The expected number of alignments with a given score that would be expected to occur at random in the database that has been searched
在已搜索的数据库中,与预期的给定分数对齐的预期数量
– e.g. if E=10, 10 matches with scores this high are expected to be found by chance
例如,如果E=10,期望随机找到与该分数相匹配的10个匹配项
BLAST详解
Why BLAST?
- "Homology is the central concept for all of biology.”
同源性是所有生物学的中心概念。 - BLAST is the tool most frequently used for calculating sequence similarity, by searching the databases.
BLAST是通过搜索数据库来计算序列相似性的最常用工具。 - If you work with one or a few proteins or genes, it can tell you about their conservation, active sites, structure and regulation in other organisms, etc.
如果你研究一个或几个蛋白质或基因,它可以告诉你它们在其他有机体中的保存、活性位点、结构和调控等。
What BLAST does? -
Identity: the occurrence of exactly the same nucleotide or amino acid in the same position in aligned sequences.
完全相同的核苷酸或氨基酸在同一排列顺序的相同位置上的出现。 -
Similarity: measure the sameness or difference of the sequences
度量序列的相同点或不同点 -
Homology: is defined in terms of shared ancestors. Homologous sequences are often similar. Sequence regions that are homologous are also called conserved regions.
定义在共同的祖先。同源序列通常是相似的。
动态规划:准确度高,速度不快
FASTA,BLAST:不完全准确,但速度快
0 Filtering 过滤
-
To prevent the production of large numbers of statistically significant but biologically uninteresting results.
为了防止产生大量在统计上有意义,但在生物学生无趣的结果。 -
Low complexity and repeats 低复杂性和重复性, i.e.
-
(CA)n
-
KLKLKLKLKLKL
-
Cover these regions with the following letters 用下列字母覆盖这些区域
-
Ns (for nucleotide residues) 核苷酸残余
-
Xs(for amino acid residues) 氨基酸残余
-
-Fflag: filter query sequence 过滤查询序列
1 Seeding 播种
- Make a w-letter word list of the query sequence
- Usually 3 for amino acid sequences, and
- 11 for nucleotide sequences
- For a query sequence with n letters, the number of words is n –w + 1
- -W flag: word size
2 Search word hits 搜索词命中率
- Scoring matrix
- for amino acids, BLOSUMor PAM
- For DNA words, a match is scored as +5 and a mismatch as -4, or as +2 and -3
- No gaps are allowed
- The words whose scores are greater than the thresholdTwill remain in the possible matching words list
得分大于阈值的单词将保留在可能匹配的单词列表中
3 Scanning 扫描
- HashTable: direct addressing method 直接寻址方法
- Deterministic finite automaton/finite state machine: much faster
确定性有限自动机/有限状态机:更快
4 Extending->HSP 扩展
- Cutoff score S
5 Significance evaluation 重要评估
-
Raw scores: have little meaning without detailed knowledge of the scoring system used.
原始分数:如果不详细了解所使用的评分系统,就没有什么意义。 -
Bit scores: normalizing a raw score using the formula.
位分数:使用公式对原始分数进行规范化。
- E values: corresponding to a given bit score
对应于给定的位分数