Positive-Unlabeled Learning for Disease Gene Identification

szypanther — Thu, 30 Aug 2012 05:41:33 +0000

Background: Identifying disease genes from human genome is an important but challenging task in biomedical research. Machine learning methods can be applied to discover new disease genes based on the known ones. Existing machine learning methods typically use the known disease genes as the positive training set P and the unknown genes as the negative training set N (non-disease gene set does not exist) to build classifiers to identify new disease genes from the unknown genes. However, such kind of classifiers is actually built from a noisy negative set N as there can be unknown disease genes inN itself. As a result, the classifiers do not perform as well as they could be.

Results: Instead of treating the unknown genes as negative examples in N, we treat them as an unlabeled set U. We design a novel Positive-Unlabeled (PU) learning algorithm PUDI (PU learning for Disease gene Identification) to build a classifier using P and U. We first partition Uinto four sets, namely, reliable negative set RN, likely positive set LP, likely negative set LN, and weak negative set WN. The Weighted Support Vector Machines are then used to build a multi-level classifier based on the four training sets and positive training set P to identify disease genes. Our experimental results demonstrate that our proposed PUDI algorithm outperformed the existing methods significantly.

Conclusion: The proposed PUDI algorithm is able to identify disease genes more accurately by treating the unknown data more appropriately as unlabeled set U instead of negative set N. Given that many machine learning problems in biomedical research do involve positive and unlabeled data instead of negative data, it is possible that the machine learning methods for these problems can be further improved by adopting PU learning methods, as we have done here for disease gene identification.

Availability: The executable program and data are available at:http://www1.i2r.a-star.edu.sg/~xlli/PUDI/PUDI.html.

FacPad: Bayesian Sparse Factor Modeling for the Inference of Pathways Responsive to Drug Treatment

szypanther — Thu, 30 Aug 2012 05:36:13 +0000

Motivation: It is well recognized that the effects of drugs are far beyond targeting individual proteins, but rather influencing the complex interactions among many relevant biological pathways. Genome-wide expression profiling before and after drug treatment has become a powerful approach for capturing a global snapshot of cellular response to drugs, as well as to understand drugs’ mechanism of action. Therefore, it is of great interest to analyze this type of transcriptomic profiling data for the identification of pathways responsive to different drugs. However, few computational tools exist for this task.

Results: We have developed FacPad, a Bayesian sparse factor model for the inference of pathways responsive to drug treatments. This model represents biological pathways as latent factors, and aims to describe the variation among drug-induced gene expression alternations in terms of a much smaller number of latent factors. We applied this model to the Connectivity Map dataset (build 02), and demonstrated that FacPad is able to identify many drug-pathway associations, some of which have been validated in the literature. Although this method was originally designed for the analysis of drug-induced transcriptional alternation data, it can be naturally applied to many other settings beyond polypharmacology.

Availability: The R package “FacPad” is publically available at:http://cran.open-source-solution.org/web/packages/FacPad/

Contact: hongyu.zhao@yale.edu

小生这厢有礼了(BioFaceBook Personal Blog) » drug

Positive-Unlabeled Learning for Disease Gene Identification

FacPad: Bayesian Sparse Factor Modeling for the Inference of Pathways Responsive to Drug Treatment