序列数据库

Genbank是美国国家生物技术信息中心(National Center for Biotechnology Information ，NCBI)建立的DNA序列数据库，从公共资源中获取序列数据
SRA（Sequence ReadArchive）数据库是用于存储二代测序的原始数据，包括 454，Illumina，SOLiD，IonTorrent，Helicos 和 CompleteGenomics。除了原始序列数据外，SRA现在也存在raw reads在参考基因的比对信息。

BLAST算法初探

BLAST Ideas: Seeding‐and‐extending：种子-扩展

Find matches (seed) between the query and subject；查找查询序列和目标序列之间的匹配（种子）
Extend seed into High Scoring Segment Pairs (HSPs)；将种子扩展成高分段对（HSPs）
– Run Smith‐Waterman algorithm on the specified region only.特定区域
Assess the reliability of the alignment.评估校准的可靠性

Seeding：
For a given word length w (usually 3 for proteins and 11 for
nucleotides), slicing the query sequence into multiple
对于给定的单词长度w（通常蛋白质用3和核苷酸用11），将查询序列切片成多个。
continuous “seed words” 种子单字
Speedup: Index database 索引数据库-加速
The database was pre‐indexed to quickly locate all positions in the database for a given seed.
数据库被预索引以快速定位数据库中给定种子的所有位置。

E‐Value: How a match is likely to arise by chance：匹配偶然发生的可能
• The expected number of alignments with a given score that would be expected to occur at random in the database that has been searched
在已搜索的数据库中，与预期的给定分数对齐的预期数量
– e.g. if E=10, 10 matches with scores this high are expected to be found by chance
例如，如果E=10，期望随机找到与该分数相匹配的10个匹配项

Why BLAST？

"Homology is the central concept for all of biology.”
同源性是所有生物学的中心概念。
BLAST is the tool most frequently used for calculating sequence similarity, by searching the databases.
BLAST是通过搜索数据库来计算序列相似性的最常用工具。
If you work with one or a few proteins or genes, it can tell you about their conservation, active sites, structure and regulation in other organisms, etc.
如果你研究一个或几个蛋白质或基因，它可以告诉你它们在其他有机体中的保存、活性位点、结构和调控等。
What BLAST does?
Identity: the occurrence of exactly the same nucleotide or amino acid in the same position in aligned sequences.
完全相同的核苷酸或氨基酸在同一排列顺序的相同位置上的出现。
Similarity: measure the sameness or difference of the sequences
度量序列的相同点或不同点
Homology: is defined in terms of shared ancestors. Homologous sequences are often similar. Sequence regions that are homologous are also called conserved regions.
定义在共同的祖先。同源序列通常是相似的。
动态规划：准确度高，速度不快
FASTA,BLAST：不完全准确，但速度快

To prevent the production of large numbers of statistically significant but biologically uninteresting results.
为了防止产生大量在统计上有意义，但在生物学生无趣的结果。
Low complexity and repeats 低复杂性和重复性, i.e.
(CA)n
KLKLKLKLKLKL
Cover these regions with the following letters 用下列字母覆盖这些区域
Ns (for nucleotide residues) 核苷酸残余
Xs(for amino acid residues) 氨基酸残余
-Fflag: filter query sequence 过滤查询序列

Make a w-letter word list of the query sequence
- Usually 3 for amino acid sequences, and
- 11 for nucleotide sequences
For a query sequence with n letters, the number of words is n –w + 1
-W flag: word size

Scoring matrix
- for amino acids, BLOSUMor PAM
- For DNA words, a match is scored as +5 and a mismatch as -4, or as +2 and -3
No gaps are allowed
The words whose scores are greater than the thresholdTwill remain in the possible matching words list
得分大于阈值的单词将保留在可能匹配的单词列表中

HashTable: direct addressing method 直接寻址方法
Deterministic finite automaton/finite state machine: much faster
确定性有限自动机/有限状态机：更快

Raw scores: have little meaning without detailed knowledge of the scoring system used.
原始分数：如果不详细了解所使用的评分系统，就没有什么意义。
Bit scores: normalizing a raw score using the formula.
位分数：使用公式对原始分数进行规范化。
E values: corresponding to a given bit score
对应于给定的位分数