BMS8110复习(一):Lecture 1-Introduction to Bioinformatics

Wiki definition:

  • Bioinformatics is an interdisciplinary field that develpes methods and software tools for understanding biological data.
  • As an interdisciplinary field of science, bioinformatics combines computer science, statistics, mathmatics, and engineering to analyze and interpret biological data.

 National Institutes of Health (NIH) definition:

  • Bioinformatics is "research, development, or application of computational tools and approaches for expanding the use of boological, medical, behavioral, or health data, including those to acquire, store, organize, analyze, or visualize such data."

Fields of Bioinformatics:

  • Sequence analysis
  1. Genome annotation
  2. Computational evolutionary biology
  3. Comparative genomics
  4. Cancer Genomics
  • Structural bioinformatics
  • Network and systems biology
  • High-throughput image analysis
  • Literature analysis / Text mining
  • Databases
  • Software and tools

The role and contribution of computational biology has often been misunderstood and undervalued!

All moder biology is computational biology:

  • Computational thinking and computational methods are so central to the quest of understanding life that today all biology is computational biology.
  • Computational biology:
  1. brings order into our understanding of life
  2. lets you see the big picture
  3. provides an atlas of life
  4. turns ideas into hypotheses

Computational Biology Bioinformatics
Is Science Is Engineering or CS
Aims to discover Aims to develop
Biological mechanisms Better models/algorithms
Treat bioinformatics as tools (just like pipettes, qPCR, western Blots, etc., to wet-lab biologists) Treat Biological questions as case studies to demonstrate the "better performance"
Needs deeper Biological/Biomedical/Clinical knowledge Needs more CS/Maths/Stats background

Ronald Fisher: a biologist and a statistician  (correlation does not imply causation)

National Center for Biotechnology Information (NCBI)

Major milestones of NCBI: 1990 BLAST; 1992 GenBank; 1996 OMIM; 1999 Human Genome; 1999 Suite of Genomic Resources; 2000 PubMed Central; 2000 GEO; 2002 WGS; 2003 Entrez Gene; 2004 PubChem; 2006 dbGaP; 2008 1000 Genomies Project; 2010 dbVar; 2013 ClinVar

European Bioinformatics Institute (EBI)

Improve the efficiency of your research with public data

  • Testing your hypothesis using published data before your own experiments
  1. Supportive -> Proceed with your own experiments
  2. Discouraging -> Adjust/Give up your hypothesis
  • Validating your results using one or multiple independent data set(s)
  1. Supportive -> Strengthen your manuscript
  2. Discouraging-> Repeat your experiments

GEO: Gene expression Omnibus

ENCODE: Encyclopedia of DNA Elements

TCGA: The Cancer Genome Atlas

ICGC: International Cancer Genome Consortium

All models are wrong, some are useful.

We like the idea of using simple statistics to solve real, important problems.

We aren't fans of unnecessary complication -- that just leads to lies, damn lies and something else.

Basic Types of Analyses:

  • Summarize data
  • Test for difference between groups
  • Analyze rates and proportions
  • Test for trends

Types of data: Interval data; Nominal or categorical data; Ordinal data

Mean, median and mode

Variance, standard deviation. standard error of the mean

Hypothesis Test

  • In statistics, a result is called statistically significant if it is unlikely to have occurred by chance.
  • Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the actually observed one?
  • The probability is known as the p-value
  • "more extreme" is dependent on the way the hypothesis is tested.

How to test for differences between groups: Normal distribution; Student's t-distribution; Student t-test; 

How to test for associations: Fisher's Exact test for association

How to test for overrepresentations: Hypergenometric test for overrepresentation- the basic for Gene Ontology analysis; Gene Ontology analysis

How to test for trends:

Pearson Product-Moment Correlation Coefficient, but this method has some limitations:

  • Only represent linear relationships
  • Does not distinguish slopes of linear relationships
  • Cannot reflect nonlinear relationships

Spearman Rank Correlation Coefficient:

  • Defined as the Pearson correlation coefficient between the ranked variables
  • Used when the two variables being compared are monotonically related, even if their relationship is not linear
  • Less sensitive than the Pearson correlation to strong outliers

Data visualization is increasingly important for biological research