I work in the intersection between machine learning methods and computational biology problems. So many of the projects that I am involved with have both components, resulting in overlaps in the descriptions below.

The word cloud on the right has been obtained using Wordle on the text on this page.





Computational Biology View

Structural variation for clonal evolution in cancer

Cancer arises through DNA mutations in the genome. As populations of cancer cells divide, they continue to mutate and pass on their mutations to their daughter cells. Subpopulations of similar cells (called clones) can be identified by the proportions of mutations from bulk-sequenced tissue. Tracking these clones allows us to obtain valuable information about the tumour’s evolution. For example, we may identify mutations in certain clones that may confer resistance to particular treatments, allowing us to target treatment most effectively. Most approaches that identify clones look at single-nucleotide mutations, which are the most common mutation type we observe in cancers. Large rearrangements of DNA, called structural variation (SV), also occur and can have significant effects on the tumour genome, however, currently we lack approaches to characterise the SVs specific to certain clones. To address this, we developed a probabilistic model, called SVclone, that takes the SVs we have detected in our tumour sequencing sample and uses this information to identify clones. We applied our approach to a cohort of over 2600 sequenced tumour samples across 38 tumour types and found distinct patterns across different cancer types. We identified one particular subset of samples with clones containing a characteristic set of rearrangements. Patients with these clones had reduced overall survival, suggesting that finding patterns such as this may help identify clinically relevant subtypes of cancers, ultimately leading to better monitoring and treatment.

Biomarker discovery in genomics data


One of the main aims of genome-wide association studies (GWAS) is to identify DNA features which if not causal, are at least statistically significantly associated with increased risk of various diseases/traits or increased benefit from specific treatments. However, an increased risk of developing complex diseases such as cancer or diabetes is caused by mutations in combinations of genes, rather than any one gene. These genetic variations may reduce resistance to infections or trigger auto-immune reactions that lead to disease. We study these combinations of locations, called epistatic interactions, which are jointly strongly associated with disease. This can be considered in the framework of feature selection methods in machine learning, with challenges such as dependent features, high dimensional low sample size data, and uncertain data.

Medical Imaging for Cancer Pathology

We consider an automated processing pipeline for tissue micro array analysis of renal cell carcinoma. It consists of several consecutive tasks, which can be mapped to machine learning challenges. We investigate tasks such as nuclei detection and segmentation, nuclei classification, and staining estimation. This work is done in collaboration with the Peter MacCallum Cancer Center in Melbourne.



Experimental Design for Systems Biology

Our long term goal is to determine the nonlinear dynamical system underlying glucose signalling in yeast. In order to do so, we developed a computational method which proposes biological experiments that would maximize the information gained. This work is done under the umbrella of the YeastX project.





Machine Learning View

Machine Learning guided Experimental Design

Machine learning is often used to analyse large datasets in science and society, with the goal of increasing our knowledge about underlying processes. However, the process of scientific discovery is an interplay between going from data to knowledge and going from knowledge to data.

This research direction explores this second (often forgotten) view: How can we use a machine learning predictor, that captures the knowledge we have, to decide what, where and when to measure. We will see how machine learning ideas such as active learning, bandits, choice theory, and design of experiments (ABCDE) can be used to significantly improve the impact of particular measurements by maximising the information gained.

Latent variable estimation


Cancer arises through DNA mutations in the genome. As populations of cancer cells divide, they continue to mutate and pass on their mutations to their daughter cells. Groups of these cells, which we term ‘clones’, can be identified by their mutational proportions from bulk-sequenced tissue. Tracking clones allows us to track valuable information about the tumour’s evolution. For example, we may identify mutations in certain clones that may confer resistance to particular treatments, allowing us to target treatment most effectively. Whole-genome sequencing technology allows us to obtain DNA sequences from bulk tumour cells, and computationally reconstruct the clones based on the mutation frequencies observed. Most approaches use the frequencies of single-nucleotide variants (SNVs) to perform this inference, applying machine learning techniques due to a number of latent variables. Currently, methods are lacking to perform clonal inference using structural variation (SV), an important type of variation in cancer that has been understudied in the area of tumour evolution. SVs have a number of differences from SNVs that must be accounted for: i) there are fewer SV data points available for inference (typically SVs number in the hundreds in cancer samples, while SNVs may number in the thousands); ii) an SV may cause a copy-number change, skewing its own frequency calculation; and iii) each SV event affects two ends of the genome while SNVs only affect a single location. To address these differences, we developed an approach, called SVclone, to identify clones from SV data. We comprehensively filter SVs to minimise false positives, extract read data around each variant, perform copy-number adjustment and incorporate information from both SV ends into the model. SVclone uses a bespoke Bayesian mixture model, implemented using variational inference to approximate posterior distributions for unknown parameters. By applying the approach to a large set of tumour samples (>2600 across 38 subtypes), we were able to identify distinct patterns of clonality across tumour types, including a subtype that correlated with reduced survival. SVclone can be used for tracking SV and SNV clonal frequencies in whole-genome tumour samples, as well as identifying the approximate number of clones using either data type, allowing the landscape of SV evolution in tumours to be better understood. A similar approach may also be useful for tackling related deconvolution problems, such as those arising in population genetics.

Learning the kernel

A kernel captures the similarity between objects, and its choice is highly important for the success of the classifier. When it is not known which kernel is best, one can attempt to learn the similarity from data.

Multiple kernel learning (MKL) is a way of optimizing kernel weights while training the SVM. In other words, for a number of different kernels, we choose a small number of good ones for the task at hand. In addition to leading to good classification accuracies, MKL can also be useful for identifying relevant and meaningful features. We recently derived the structured output learning version of MKL in "Multiclass multiple kernel learning", and related various slightly different formulations. When we do not have an explicit number of different kernels, the idea of "hyperkernels" comes into play, i.e. a kernel on kernels. We also had some success in learning the similarities between outputs (learning output kernels), by optimizing over the whole space of positive semidefinite matrices.

Structured Output Spaces

In many applications, the classifier has to make a prediction that is more complex than just a single value, which is the case in binary classification or regression. Further, the label is structured in the sense that there are dependencies and correlations between individual parts. One approach is to use our prior knowledge of the problem to define a structure on the set of labels, using tools such as graphical models. We have applied this to problems such as gene finding (mGene), spliced alignment (PALMA) and image denoising.

There are currently two main approaches to discriminatively learn such models, conditional random fields and structured output support vector machines. We have shown that they are essentially differently regularized models (entropy and margin maximization), and derived efficient approximations to train such models for intractable graphical models (tree approximation).


Older projects

Dynamical Systems Modeling in Neuroscience

Measurements of signals in neural tissue presents a particularly challenging scenario for traditional machine learning, since it is of high dimension and there are very few subjects. We propose to model each subject by a corresponding dynamical system, called a dynamic causal model, which is a neurophysiologically motivated model of brain activity. Our generative embedding approach has resulted in interpretable models and significantly more accurate results in predicting a spectrum disorder (aphasia) in human subjects from fMRI data. We also analysed whisker stimulus and auditory oddball detection from electical activity in the mouse model. This work is done in collaboration with the University of Zurich and the Wellcome Trust Centre for Neuroimaging, University College London.

Indefinite Kernels

One of the key requirements in Support Vector Machines (SVMs) is the positive definiteness of the kernel. It turns out that most of the theory still works when one relaxes positive definiteness to just symmetric but "indefinite" kernels. The corresponding concept is a Reproducing kernel Krein space.

Gene Finding

Today, next-generation technologies have rendered genome sequencing an almost routine process, allowing individual scientists to obtain the sequences of their favorite organisms. The task of annotating new genomes may therefore move partly into the domain of individual researchers or labora- tories. Consequently, labor intensive procedures like manual annotation by experts, albeit presumably most precise, are not always affordable, and highly automated computational methods are called upon to fill the gap. We developed mGene, which is a complete discriminative gene finder.

Alternative Splicing

Alternative splicing has been linked to the complexity of higher organisms. However, the incidence of alternative splicing is still unclear, and the mechanisms not well understood. Using information from publicly available genome and EST databases, we constructed a database of transcription units, which are summarized by splicegraphs, for several model organisms. From the splicegraphs, we identify exon skipping, intron retention, alternative 5' and alternative 3' splicing, and use these transcript confirmed events to train a SVM to predict novel events.

Protein Subcellular Localization

Protein subcellular localization is a crucial ingredient to many important inferences about cellular processes, including prediction of protein function and protein interactions. We investigate the problem of predicting the subcellular localization of a protein from its peptide sequence. We propose a general class of protein sequence kernels which considers all motifs, including motifs with gaps, and use multiple kernel learning to optimize over many kernels to obtain state of the art results.

Spliced Alignment

Despite many years of research on how to properly align sequences in the presence of sequencing errors, alternative splicing and micro-exons, the correct alignment of mRNA sequences to genomic DNA is still a challenging task. We present a novel approach based on large margin learning that combines accurate splice site predictions with common sequence alignment techniques. By solving a convex optimization problem, our algorithm -- called PALMA -- tunes the parameters of the model such that true alignments score higher than other alignments. We study the accuracy of alignments of mRNAs containing artificially generated micro-exons to genomic DNA.