Paper reading (四十四): Machine learning methods for metabolic pathway prediction

论文题目:Machine learning methods for metabolic pathway prediction

scholar 引用:149

页数:14

发表时间:2010.01

发表刊物:BMC Bioinformatics

作者:Joseph M Dale, Liviu Popescu, Peter D Karp

摘要:

Background

A key challenge in systems biology is the reconstruction of an organism's metabolic network from its genome sequence. One strategy for addressing this problem is to predict which metabolic pathways, from a reference database of known pathways, are present in the organism, based on the annotated genome of the organism.

Results

To quantitatively validate methods for pathway prediction, we developed a large "gold standard" dataset of 5,610 pathway instances known to be present or absent in curated metabolic pathway databases for six organisms. We defined a collection of 123 pathway features, whose information content we evaluated with respect to the gold standard. Feature data were used as input to an extensive collection of machine learning (ML) methods, including naïve Bayes, decision trees, and logistic regression, together with feature selection and ensemble methods. We compared the ML methods to the previous PathoLogic algorithm for pathway prediction using the gold standard dataset. We found that ML-based prediction methods can match the performance of the PathoLogic algorithm. PathoLogic achieved an accuracy of 91% and an F-measure of 0.786. The ML-based prediction methods achieved accuracy as high as 91.2% and F-measure as high as 0.787. The ML-based methods output a probability for each predicted pathway, whereas PathoLogic does not, which provides more information to the user and facilitates filtering of predicted pathways.

Conclusions

ML methods for pathway prediction perform as well as existing methods, and have qualitative advantages in terms of extensibility, tunability, and explainability. More advanced prediction methods and/or more sophisticated input features may improve the performance of ML methods. However, pathway prediction performance appears to be limited largely by the ability to correctly match enzymes to the reactions they catalyze based on genome annotations.

结论:

  • metabolic pathway prediction from an annotated genome
  •  general machine learning methods, 这是2010年的paper,所以都是传统机器学习方法
  • In particular, the fraction of reactions in the pathway with enzymes was the single most informative numeric feature.
  • The predictors often also included other features - newly developed in this work - which are less informative individually, but can contribute to prediction performance in the context of an automatically trained ML predictor.
  • The machine learning approach decomposes the problem into three essential steps: (1) procuring labeled training data; (2) developing a modular library of useful features; (3) applying a general prediction algorithm. 
  • Each of these steps can be optimized in the future to yield continued improvements in pathway prediction performance. 
  • the resulting pathway predictions can be tuned by the user to suit different preferences for sensitivity versus specificity. 
  • the structure and parameters of the model are accessible, and can be used to explain predictions to users of the pathway prediction software.

Discussion:

  • machine learning methods perform as well as PathoLogic
  • a tradeoff between sensitivity and specificity (precision and recall) by virtue of providing estimates of the probabilities of pathways being present in an organism, rather than simply binary present/absent calls. 
  • The main cause of false negative classifications (not predicting to be present pathways that do occur in the organism) was inability of the enzyme matcher component of Pathway Tools (which is shared by PathoLogic and the ML methods) to find enzymes catalyzing some reactions in the pathway. 
  • Another factor contributing to prediction errors is the existence of promiscuous reactions. 
  • in decision trees, these features typically appeared very close to the root of the tree.

Background:

  • methods are needed for computational characterization of metabolic networks.
  • (1) The reactome prediction problem; (2) The pathway prediction problem
  • PathoLogic: for automatically constructing a Pathway/Genome Database (PGDB) describing the metabolic network of an organism, meaning the metabolic reactions catalyzed by enzymes in the organism and their organization into pathways.
  • Prediction of pathways is hard for three reasons.
  1. Errors and omissions in genome annotations introduce noise into the primary source of evidence for pathways, namely, the set of metabolic enzymes in the genome.
  2. Enzymes that catalyze "promiscuous" reactions -- reactions that participate in multiple pathways -- are ambiguous in supporting the presence of more than one pathway. In the version of MetaCyc used for this work, 4,558 reactions participate in pathways. Of these, 779 reactions (17%) appear in at least two pathways.
  3. Groups of variant pathways in MetaCyc (pathways that carry out the same biological function) often share several reactions, making it difficult to distinguish which variant is present.
  • the PathoLogic algorithm suffers from several limitations: make more false positive pathway predictions; making the algorithm difficult to maintain and extend; limited explanations; limited to Boolean predictions and only a coarse measure of prediction confidence is provided
  •  We applied several commonly used machine learning (ML) algorithms to the pathway prediction problem. The best resulting ML-based algorithm achieved a small improvement over the performance of PathoLogic.

正文组织架构:

1. Background

2. Methods

2.1 Construction of a Gold Standard Pathway Collection

2.2 Evidence Gathering and the PathoLogic Algorithm

2.3 Feature Extraction and Processing

2.4 Performance Evaluation

2.5 Training and Prediction

3. Results

3.1 Feature Performance

3.2 Predictor Performance

4. Discussion

5. Conclusion

正文部分内容摘录:

1. Biological Problem: What biological problems have been solved in this paper?

  • metabolic pathway prediction from an annotated genome

2. Main discoveries: What is the main discoveries in this paper?

  • We found that ML-based prediction methods can match the performance of the PathoLogic algorithm.
  • PathoLogic achieved an accuracy of 91% and an F-measure of 0.786. The ML-based prediction methods achieved accuracy as high as 91.2% and F-measure as high as 0.787. 
  • The ML-based methods output a probability for each predicted pathway, whereas PathoLogic does not, which provides more information to the user and facilitates filtering of predicted pathways.

3. ML(Machine Learning) Methods: What are the ML methods applied in this paper?

  • machine learning (ML) methods, including naïve Bayes, decision trees, and logistic regression, together with feature selection and ensemble methods(bagging, random forest).
  • Paper reading (四十四): Machine learning methods for metabolic pathway prediction

4. ML Advantages: Why are these ML methods better than the traditional methods in these biological problems?

  • traditional methods: PathoLogic 
  • have qualitative advantages in terms of extensibility, tunability, and explainability.

5. Biological Significance: What is the biological significance of these ML methods’ results?

  • the resulting pathway predictions can be tuned by the user to suit different preferences for sensitivity versus specificity
  • the structure and parameters of the model are accessible, and can be used to explain predictions to users of the pathway prediction software.

6. Prospect: What are the potential applications of these machine learning methods in biological science?

  • More advanced prediction methods and/or more sophisticated input features may improve the performance of ML methods. 
  • pathway prediction performance appears to be limited largely by the ability to correctly match enzymes to the reactions they catalyze based on genome annotations.
  • The machine learning approach decomposes the problem into three essential steps: (1) procuring labeled training data; (2) developing a modular library of useful features; (3) applying a general prediction algorithm.  Each of these steps can be optimized in the future to yield continued improvements in pathway prediction performance. 

7. Mine Question(Optional)

  • 2010~2019  现在做metabolic pathway prediction最好的方法是什么了呢?