Yuan, Y.,
Failmezger, H.,
Rueda, OM.,
Ali, HR.,
Gräf, S.,
Chin, SF.,
Schwarz, RF.,
Curtis, C.,
Dunning, MJ.,
Bardwell, H.,
et al.
(2012)
Quantitative image analysis of cellular heterogeneity in breast tumors complements genomic profiling. Sci Transl Med, Vol.4(157),
pp.157ra143-,
Show Abstract
Solid tumors are heterogeneous tissues composed of a mixture of cancer and normal cells, which complicates the interpretation of their molecular profiles. Furthermore, tissue architecture is generally not reflected in molecular assays, rendering this rich information underused. To address these challenges, we developed a computational approach based on standard hematoxylin and eosin-stained tissue sections and demonstrated its power in a discovery and validation cohort of 323 and 241 breast tumors, respectively. To deconvolute cellular heterogeneity and detect subtle genomic aberrations, we introduced an algorithm based on tumor cellularity to increase the comparability of copy number profiles between samples. We next devised a predictor for survival in estrogen receptor-negative breast cancer that integrated both image-based and gene expression analyses and significantly outperformed classifiers that use single data types, such as microarray expression signatures. Image processing also allowed us to describe and validate an independent prognostic factor based on quantitative analysis of spatial patterns between stromal cells, which are not detectable by molecular assays. Our quantitative, image-based method could benefit any large-scale cancer study by refining and complementing molecular assays of tumor samples.
Curtis, C.,
Shah, SP.,
Chin, SF.,
Turashvili, G.,
Rueda, OM.,
Dunning, MJ.,
Speed, D.,
Lynch, AG.,
Samarajiwa, S.,
Yuan, Y.,
et al.
(2012)
The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature, Vol.486(7403),
pp.346-352,
Show Abstract
The elucidation of breast cancer subgroups and their molecular drivers requires integrated views of the genome and transcriptome from representative numbers of patients. We present an integrated analysis of copy number and gene expression in a discovery and validation set of 997 and 995 primary breast tumours, respectively, with long-term clinical follow-up. Inherited variants (copy number variants and single nucleotide polymorphisms) and acquired somatic copy number aberrations (CNAs) were associated with expression in ~40% of genes, with the landscape dominated by cis- and trans-acting CNAs. By delineating expression outlier genes driven in cis by CNAs, we identified putative cancer genes, including deletions in PPP2R2A, MTAP and MAP2K4. Unsupervised analysis of paired DNA–RNA profiles revealed novel subgroups with distinct clinical outcomes, which reproduced in the validation cohort. These include a high-risk, oestrogen-receptor-positive 11q13/14 cis-acting subgroup and a favourable prognosis subgroup devoid of CNAs. Trans-acting aberration hotspots were found to modulate subgroup-specific gene networks, including a TCR deletion-mediated adaptive immune response in the ‘CNA-devoid’ subgroup and a basal-specific chromosome 5 deletion-associated mitotic network. Our results provide a novel molecular stratification of the breast cancer population, derived from the impact of somatic CNAs on the transcriptome.
Yuan, Y.,
Curtis, C.,
Caldas, C. &
Markowetz, F.
(2012)
A sparse regulatory network of copy-number driven gene expression reveals putative breast cancer oncogenes. IEEE/ACM Trans Comput Biol Bioinform, Vol.9(4),
pp.947-954,
Show Abstract
Copy number aberrations are recognized to be important in cancer as they may localize to regions harboring oncogenes or tumor suppressors. Such genomic alterations mediate phenotypic changes through their impact on expression. Both cis- and transacting alterations are important since they may help to elucidate putative cancer genes. However, amidst numerous passenger genes, trans-effects are less well studied due to the computational difficulty in detecting weak and sparse signals in the data, and yet may influence multiple genes on a global scale. We propose an integrative approach to learn a sparse interaction network of DNA copy-number regions with their downstream transcriptional targets in breast cancer. With respect to goodness of fit on both simulated and real data, the performance of sparse network inference is no worse than other state-of-the-art models but with the advantage of simultaneous feature selection and efficiency. The DNA-RNA interaction network helps to distinguish copy-number driven expression alterations from those that are copy-number independent. Further, our approach yields a quantitative copy-number dependency score, which distinguishes cis- versus trans-effects. When applied to a breast cancer data set, numerous expression profiles were impacted by cis-acting copy-number alterations, including several known oncogenes such as GRB7, ERBB2, and LSM1. Several trans-acting alterations were also identified, impacting genes such as ADAM2 and BAGE, which warrant further investigation. Availability: An R package named lol is available from www.markowetzlab.org/software/lol.html.
Yuan, Y.,
Li, C-T. &
Windram, O.
(2011)
Directed Partial Correlation: Inferring Large-Scale Gene Regulatory Network through Induced Topology Disruptions PLOS ONE, Vol.6(4),
pp.e16835-,
ISSN: 1932-6203
Yuan, Y.,
Savage, RS. &
Markowetz, F.
(2011)
Patient-specific data fusion defines prognostic cancer subtypes. PLoS Comput Biol, Vol.7(10),
pp.e1002227-,
Show Abstract
Different data types can offer complementary perspectives on the same biological phenomenon. In cancer studies, for example, data on copy number alterations indicate losses and amplifications of genomic regions in tumours, while transcriptomic data point to the impact of genomic and environmental events on the internal wiring of the cell. Fusing different data provides a more comprehensive model of the cancer cell than that offered by any single type. However, biological signals in different patients exhibit diverse degrees of concordance due to cancer heterogeneity and inherent noise in the measurements. This is a particularly important issue in cancer subtype discovery, where personalised strategies to guide therapy are of vital importance. We present a nonparametric Bayesian model for discovering prognostic cancer subtypes by integrating gene expression and copy number variation data. Our model is constructed from a hierarchy of Dirichlet Processes and addresses three key challenges in data fusion: (i) To separate concordant from discordant signals, (ii) to select informative features, (iii) to estimate the number of disease subtypes. Concordance of signals is assessed individually for each patient, giving us an additional level of insight into the underlying disease structure. We exemplify the power of our model in prostate cancer and breast cancer and show that it outperforms competing methods. In the prostate cancer data, we identify an entirely new subtype with extremely poor survival outcome and show how other analyses fail to detect it. In the breast cancer data, we find subtypes with superior prognostic value by using the concordant results. These discoveries were crucially dependent on our model's ability to distinguish concordant and discordant signals within each patient sample, and would otherwise have been missed. We therefore demonstrate the importance of taking a patient-specific approach, using highly-flexible nonparametric Bayesian methods.
Yuan, Y.,
Rueda, OM.,
Curtis, C. &
Markowetz, F.
(2011)
Penalized regression elucidates aberration hotspots mediating subtype-specific transcriptional responses in breast cancer. Bioinformatics, Vol.27(19),
pp.2679-2685,
Show Abstract
Copy number alterations (CNAs) associated with cancer are known to contribute to genomic instability and gene deregulation. Integrating CNAs with gene expression helps to elucidate the mechanisms by which CNAs act and to identify the transcriptional downstream targets of CNAs. Such analyses can help to sort functional driver events from the many accompanying passenger alterations. However, the way CNAs affect gene expression can vary in different cellular contexts, for example between different subtypes of the same cancer. Thus, it is important to develop computational approaches capable of inferring differential connectivity of regulatory networks in different cellular contexts.
Li, CT.,
Yuan, Y. &
Wilson, R.
(2008)
An unsupervised conditional random fields approach for clustering gene expression time series. Bioinformatics, Vol.24(21),
pp.2467-2473,
Show Abstract
MOTIVATION: There is a growing interest in extracting statistical patterns from gene expression time-series data, in which a key challenge is the development of stable and accurate probabilistic models. Currently popular models, however, would be computationally prohibitive unless some independence assumptions are made to describe large-scale data. We propose an unsupervised conditional random fields (CRF) model to overcome this problem by progressively infusing information into the labelling process through a small variable voting pool. RESULTS: An unsupervised CRF model is proposed for efficient analysis of gene expression time series and is successfully applied to gene class discovery and class prediction. The proposed model treats each time series as a random field and assigns an optimal cluster label to each time series, so as to partition the time series into clusters without a priori knowledge about the number of clusters and the initial centroids. Another advantage of the proposed method is the relaxation of independence assumptions.
Yuan, Y.,
Li, CT. &
Wilson, R.
(2008)
Partial mixture model for tight clustering of gene expression time-course. BMC Bioinformatics, Vol.9
pp.287-,
Show Abstract
Tight clustering arose recently from a desire to obtain tighter and potentially more informative clusters in gene expression studies. Scattered genes with relatively loose correlations should be excluded from the clusters. However, in the literature there is little work dedicated to this area of research. On the other hand, there has been extensive use of maximum likelihood techniques for model parameter estimation. By contrast, the minimum distance estimator has been largely ignored.
Yuan, Y. &
Li, CT.
(2008)
A Bayes random field approach for integrative large-scale regulatory network analysis. J Integr Bioinform, Vol.5(2),
Show Abstract
We present a Bayes-Random Fields framework which is capable of integrating unlimited data sources for discovering relevant network architecture of large-scale networks. The random field potential function is designed to impose a cluster constraint, teamed with a full Bayesian approach for incorporating heterogenous data sets. The probabilistic nature of our framework facilitates robust analysis in order to minimize the influence of noise inherent in the data on the inferred structure in a seamless and coherent manner. This is later proved in its applications to both large-scale synthetic data sets and Saccharomyces Cerevisiae data sets. The analytical and experimental results reveal the varied characteristic of different types of data and refelct their discriminative ability in terms of identifying direct gene interactions.
Li, C-T. &
Yuan, Y.
(2006)
Digital watermarking scheme exploiting nondeterministic dependence for image authentication OPTICAL ENGINEERING, Vol.45(12),
ISSN: 0091-3286