Schedule for: 18w5202 - Statistical and Computational Challenges in High-Throughput Genomics with Application to Precision Medicine

Beginning on Sunday, November 4 and ending Friday November 9, 2018

All times in Oaxaca, Mexico time, CST (UTC-6).

Sunday, November 4
14:00 - 23:59 Check-in begins (Front desk at your assigned hotel)
19:30 - 22:00 Dinner (Restaurant Hotel Hacienda Los Laureles)
20:30 - 21:30 Informal gathering (Hotel Hacienda Los Laureles)
Monday, November 5
07:30 - 08:45 Breakfast (Restaurant at your assigned hotel)
08:45 - 09:00 Introduction and Welcome (Conference Room San Felipe)
09:00 - 09:30 Maria Avila: Afromexico genomics project: The genetic ancestry and health of the Afrodescendant population of Mexico.
Despite having received 200,000 Africans during the slave trade, no study in Mexico has focused on the characterization of the African genetic ancestry of its Afro-descendant population. In this study we worked together with Afro-Mexican communities to characterize their genetic ancestry using dense genome-wide genotyping. The dataset consists of 380 self-identified Afro-descendants, indigenous and mestizo population from three Mexican states. To complement the genome-wide genotype data, we collected genealogical, self-identification and phenotypes such as skin pigmentation, height, weight, hip to waist ratio and hemoglobin. By exploring local genetic ancestry and admixture patterns in this population, as well as correlations between the genetic ancestry, self-identification and phenotypes; we have characterized aspects of their demographic history and some trends of public health relevance. Lastly, since Afro-Mexicans currently suffer from poverty, discrimination, lack of recognition as a vulnerable minority, and limited access to health services, this study contributes to their appreciation as part of Mexico’s mosaic of diversity and will hopefully set the stage for health interventions.
(Conference Room San Felipe)
09:30 - 10:00 Mashaal Sohail: Polygenic adaptation signals for height are confounded by population structure.
n the past six years, numerous publications have reported directional selection signals of polygenic selection on height. Here, we show that many of these results are severely confounded by uncontrolled population stratification. In particular, signals of polygenic adaptation based on summary statistics from the from the GIANT consortium meta-analysis are dramatically reduced in magnitude and, in many cases no longer statistically significant when using summary statistics derived from the UK Biobank. This polygenic adaptation signal was apparently independently confirmed by a tests based on the singleton density score statistic (SDS). However, we show that this signal too is only present when using GIANT but not UK Biobank summary statistics. Specifically, the Spearman correlation between p-value and SDS statistic is 2e-65 using GIANT statistics but only 0.077 using UK Biobank. We further show that correlations between effect size estimates and allele frequency differences between North- and South- European populations underlie most of these discrepancies. The confounding with population stratification appears to be most severe for less significant SNPs; restricting the analyses to genome-wide significant SNPs results in a higher concordance between GIANT and UK Biobank data. While stratification-free within-family estimates suggest that the phenotypic north-south height gradient in Europe is indeed paralleled by genetically predicted height as reported before, the magnitude of this effect had been greatly overestimated. Height is widely used as a model for polygenic traits, which raises the question whether other methods using similar summary statistics might suffer from the same kind of confounding. Existing methods that aim to identify the presence of population stratification in GWAS summary statistics are not always applicable, and we provide simple suggestions that can supplement these methods in order to detect uncontrolled stratification and avoid resulting biases.
(Conference Room San Felipe)
10:00 - 10:30 Ingo Ruczinski: Inferring rare disease risk variants based on exact probabilities of sharing among multiple affected relatives.
Sequencing DNA in extended multiplex families can help to identify high penetrance disease variants too rare in the population to be detected through tests of association in population based studies, but co-segregate with disease in families. When only few affected subjects per family are sequenced, evidence that a rare single nucleotide or copy number variant may be causal can be quantified from the probability of sharing alleles by all affected relatives given it was seen in any one family member under the null hypothesis of complete absence of linkage and association. We present a general framework for calculating such sharing probabilities when two or more affected subjects per family are sequenced, and show how information from multiple families can be combined by calculating a p-value as the sum of the probabilities of sharing events as (or more) extreme. We present case studies from families with multiple members born with oral clefts, and introduce the Bioconductor package RVS.
(Conference Room San Felipe)
10:30 - 11:00 Coffee Break (Conference Room San Felipe)
11:00 - 11:30 Chad Huff: Genomic analysis tools for familial and case-control sequencing association studies.
Advances in high-throughput sequencing technologies are transforming the landscape of biomedical research. As the generation of genomic data becomes commoditized, research efforts are increasingly being shifted to data analysis and interpretation. However, the interpretation of large-scale genomic datasets in the context of human disease is greatly complicated by complexities inherent to the population genetics of disease-causing variation and heterogeneity in high-throughput sequencing technologies. In this talk, I will present an overview of the computational tools we have developed to analyze high-throughput sequencing data for identification and characterization of genetic variation influencing disease risk. Topics will include relationship estimation, pedigree reconstruction, functional variant prediction, familial and case-control rare variant association analysis, and strategies for leveraging phenotypic ontologies. A particular focus of the presentation will involve techniques to overcome biases inherent to heterogeneous sequencing technologies in association studies which combine sequencing datasets from multiple sources. These techniques have recently been implemented in the Cross-Platform Association Toolkit (XPAT), a software package which includes a suite of tools to support large-scale association studies. Features implemented in XPAT include cross-platform aware variant calling, quality control filtering, gene-based association testing, and rare variant effect size estimation. I will highlight examples throughout the talk from Mendelian and complex diseases, with an emphasis on common cancers.
(Conference Room San Felipe)
11:30 - 12:00 Alejandra Eugenia Medina Rivera: Characterizing the effect of genetic variants within promoters with distal enhancer functions.
Mammals gene regulation is mediated by interactions between promoters and enhancers. In previous years promoters with enhancers functions were reported, nevertheless, those were considered as atypical. Using a capture array combined with STARR-seq assay in HELA and K562 cell lines, we reported the first high throughput assessment of enhancers functional potential of all annotated promoters in the human genome. Promoters with enhancer potential (ePromoters) showed to have characteristic epigenetic marks and frequent interactions with other gene promoters. Moreover, we determined that GTEx reported eQTLs within these regions were more likely to be found in ePromoters and have a higher impact on gene expression, especially when these variants were determined to affect transcription factor binding sites, underlining the relevance of these regulatory mechanisms. We are now working on the integration of ePromoter target gene information, defined by ChIA-PET and eQTL interactions, with BioVU data of gene expression associated to disease annotations to assess the relevance of these regulatory mechanisms in human phenotypes.
(Conference Room San Felipe)
12:00 - 12:30 Mark Segal: A principle curve approach to three-dimensional chromatin configuration reconstruction.
The three-dimensional (3D) configuration of chromosomes within the eukaryote nucleus is consequential for several cellular functions including gene expression regulation and is also strongly associated with cancer-causing translocation events. While visualization of such architecture remains limited to low resolution and throughput imaging modalities, the ability to infer 3D structure at increasing resolution has been enabled by recently-devised chromosome conformation capture techniques, notably Hi-C. Such assays, which utilize cross-linking, followed by restriction digestion and proximity ligation, enable identification of (pairwise) genomic loci that are spatially close via next generation sequencing. Subsequent binning yields a matrix of chromatin contact or interaction counts. Various algorithms have been advanced to operate on these contact matrices to produce reconstructed 3D configurations. Many of these are based on multidimensional scaling (MDS) following conversion of contact matrices to distance matrices. However, none of the proposed methods exploit, or actively impose, the fact that the target solution for an individual chromosome is a (smooth) one-dimensional (1D) curve in 3-space. This basic attribute of chromatin contiguity is either ignored or indirectly addressed by the imposition of constraints. Here we demonstrate the utility of principle curves in directly obtaining 1D solutions that best recapitulate the contact matrix. Our target 1D curve in 3D is a vector function with three coordinates each indexed by 1D genomic distance. Since we seek coordinate functions that are smooth with respect to genomic distance we represent each using a spline basis, parameterized such that the level of smoothness can be prescribed by a degrees of freedom (df) specification. This enables a principle curve solution to the metric scaling problem — (Frobenius norm) approximation of the contact matrix — using a readily obtained eigen-decomposition. While the suite of solutions resulting from a range of df is informative with respect to differing scales of chromatin architecture we also detail methods for selecting a single summary structure. Illustrative examples featuring chromosomes 20, 21 and 22 from IMR90 cells are showcased since the existence of orthogonal multiplex FISH imaging allows for external validation. Joint work with Trevor Hastie
(Conference Room San Felipe)
12:30 - 12:40 Group Photo (Hotel Hacienda Los Laureles)
12:40 - 14:00 Lunch (Restaurant Hotel Hacienda Los Laureles)
15:00 - 15:30 Sunduz Keles: mHi-C to the rescue for leveraging multi-mapping reads of Hi-C datasets.
Current Hi-C analysis approaches are unable to account for reads that align to multiple locations, and hence underestimate biological signal from repetitive regions of genomes. We developed mHi-C, a multi-read mapping strategy to probabilistically allocate Hi-C multi-reads. mHi-C exhibited superior performance over utilizing only uni-reads and heuristic approaches aimed at rescuing multi-reads on benchmarks. Specifically, mHi-C increased the sequencing depth by an average of 20\% leading to higher reproducibility of contact matrices and larger number of significant contacts across biological replicates. The impact of the multi-reads on the identification of novel significant contacts is influenced marginally by relative contribution of multi-reads to the sequencing depth compared to uni-reads, cis-to-trans ratio of contacts, and the broad data quality as reflected by the proportion of mappable reads of datasets. Computational experiments highlighted that in Hi-C studies with short read lengths, mHi-C rescued multi-reads can emulate the effect of longer reads. mHi-C also revealed biologically supported bona fide promoter-enhancer interactions and topologically associating domains involving repetitive genomic regions, thereby unlocking a previously masked portion of the genome for conformation capture studies. Joint work with Ye Zheng and Ferhat Ay.
(Conference Room San Felipe)
15:30 - 16:00 Arjun Baghela: Machine learning approaches to classify patients progressing to sepsis.
Sepsis is a syndrome that represents an abnormal immune response to infection. Despite advances in modern medicine, severe sepsis remains a major cause of mortality globally, with approximately 5 million deaths annually [1]. Currently, physicians rely on their interpretations of heterogeneous clinical symptoms, which often results in misdiagnosis. Biomarker discovery studies tend to be inconclusive resulting in no effective prognostic tools [2]. In this ongoing global study of 1000 emergency room patients, we applied machine learning techniques to assess the predictive power of combining blood RNA-Sequencing (RNA-Seq) and Emergency Medical Record (EMR) data to identify patients progressing to severe sepsis. We applied the Support Vector Machine, AdaBoost, Random Forest, Gradient Tree Boosting, and naive Bayes classifiers, with and without feature selection, on RNA-Seq, EMR, and the combined data. The highest area of the receiver operator characteristic curve was 0.91 obtained using Random Forest on a subset of expression and clinical features. Feature importance was assessed by the mean decrease in Gini Impurity over all trees for each feature. Key transcriptomic features included genes involved in Endotoxin Tolerance, a cellular reprogramming phenotype responsible for sepsis. Top predictive clinical features included white blood cell count, triage blood pressure, and acute kidney injury. Ultimately, supervised learning algorithms using combined features predicted different groups (endotypes) of ER patients that progress to sepsis, allowing clinicians to provide timely and specific medical interventions. 1. Angus D et al. (2001) Crit. Care Med. 29:1303-10. 2. Biron B et al. (2015) Biomarker Insights,10:7-17.
(Conference Room San Felipe)
16:00 - 16:30 Benilton Carvalho: Brazilian initiative on precision medicine: statistical perspectives.
Significant efforts are being made in favor of the development of Genomic Medicine (GM). In this framework, clinicians use molecular profiles to guide the delivery of treatments to their patients. Such profiles are built using data generated from Next-Generation Sequencing and other high-throughput experiments, like DNA-Seq, RNA-Seq, ChIP-Seq, high-density microarrays and mass spectrometry. Building these profiles requires a series of complex statistical procedures that should be capable of handling terabytes of data in an efficient manner. In addition to that, it is essential that the medical doctors have access to a comprehensive molecular profile of the reference population. This will allow the proper identification of molecular patterns that are not found on the healthy population and may become candidates for causal variants. I will discuss the strategies used by the Brazilian Initiative on Precision Medicine (BIPMed) to bring to the public information and methods on the genomic complexity of the Brazilian population. BIPMed is the first initiative of its kind in Latin America and has been recognized by the international community for its pioneering efforts on Precision Medicine. I will present the solutions proposed by BIPMed to provide and analyze genomic data under the medical context, including protocols for data distribution and statistical approaches implemented in our software, which is distributed freely by Bioconductor. I will present information on the molecular characterization on Brazilian subjects, discuss the importance of national initiatives of Genomic Medicine and present our latest findings in the field, showing the potential of associating Statistics and Medicine, resulting in significant changes on the current medical practice.
(Conference Room San Felipe)
16:30 - 17:00 Coffee Break (Conference Room San Felipe)
17:30 - 18:00 Venkatraman Seshan: Copy number analysis of circulating cell-free DNA.
Cell-free plasma DNA (cfDNA) is defined as fragments of DNA present in the extracellular fluid. These DNA fragments are mostly derived from cells that underwent apoptosis. In a subject with cancer, cfDNA are derived from both normal and tumor cells and thus it will reflect somatic changes present in the tumor DNA. In this talk we will present a method to estimate tumor fraction from shallow whole genome sequencing and one to estimate allele specific copy numbers from targeted deep sequencing. We will demonstrate the statistical issues and computational challenges faced with data from data from samples that have undergone these assays.
(Conference Room San Felipe)
18:30 - 20:00 Dinner (Restaurant Hotel Hacienda Los Laureles)
Tuesday, November 6
07:30 - 09:00 Breakfast (Restaurant at your assigned hotel)
09:00 - 09:30 Kai Kammers: Novel and concordant eQTLs from analysis of iPSC-derived megakaryocytes and platelets in the GeneticStudies of Atherosclerosis Risk (GeneSTAR) project.
Kai Kammers1, Margaret A Taub2, Benjamin Rodriguez4, Ingo Ruczinski2, Lisa R Yanek3, Andrew D Johnson4, Nauder Faraday3, Lewis C Becker3, Rasika A Mathias3. 1 Division of Biostatistics and Bioinformatics, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins, Johns Hopkins University School of Medicine, Baltimore, MD 2 Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 3 The GeneSTAR Research Program, Johns Hopkins University School of Medicine, Baltimore, MD 4 NHLBI Population Sciences Branch, The Framingham Heart Study, Framingham, MA GWAS studies have identified common variants associated with platelet aggregation, but because these variants are largely intronic/intergenic, their mechanistic link to platelet function is unclear. Additionally, extensive missing heritability may be resolved by integrating genetics and transcriptomics. To better understand the transcriptome signature and its genetic regulatory landscape in platelets and megakaryocytes (MKs), we performed expression-quantitative trait locus (eQTL) analyses of RNA sequencing (RNA-seq) data on both cell types in African American (AA) and European American (EA) subjects from the Genetic Studies of Atherosclerosis Risk (GeneSTAR) project. Using genotypes from the Illumina 1M GWAS array (1,003,451 SNPs), eQTL analyses were carried out stratified by ancestry and cell type, with a 1Mb window around each gene and adjusting for relevant covariates with the R package MatrixEQTL. Significance was defined as q-value < 0.05. Genes with median FPKM<= 1 were excluded, yielding ~10,000 genes in the MKs and ~3,000 in the platelets, 94% of which are also expressed in the MKs.
(Conference Room San Felipe)
09:30 - 10:00 Ana Beatriz Altamirano: PulmonDB: a gene expression lung diseases database.
There is a massive amount of transcriptomic data (microarrays and RNA-seq) accumulated since the development of this technology . Analyzing this data and integrate it to study a complex disease can be an overwhelming task, principally because it would require integrating data from different technologies and platforms. Moreover, the lack of uniformity on experimental annotations in public databases such as Gene Expression Omnibus (GEO) adds to the challenge. By integrating transcriptomic datasets from different sources and their curated annotations, we developed an online web resource to facilitate the exploration of gene expression profiles of two respiratory diseases: Idiopathic Pulmonary Fibrosis (IPF) and Chronic Obstructive Pulmonary Disease (COPD); our first aim was to build a database integrating existing transcriptomic data for the identification of differentially expressed genes that replicates in different experiments. This project sets the foundation to integrate transcriptomics data of other respiratory diseases and smoker phenotypes facilitating the identification of common and divergent pathways that lead to a pathological state. In 2011, Engelen et al. developed COMMAND, a platform that allows the comparison and integration of transcriptomics data from different sources and platforms into a compendium. COMMAND has been successfully used to build transcriptome data compendia in bacteria (Engelen K. et al., 2011) and grapevine (Moretto M., et al., 2016). We used COMMAND to create a human lung database that allows us to integrate, analyze and explore gene expression data from different sources by making contrast between controls and patients given clinical phenotypes (i.e age, gender, status of the disease, FEV1, etc.) when it is available. We selected the relevant transcriptome experiments for IPF and COPD by querying in GEO and ArrayExpress with selected key words. Each experiment was downloaded, imported to COMMAND and the experimental conditions were annotated, the contrast group was selected and a similar pipeline was used to normalized the data to create PulmonDB. PulmonDB is an exploratory web interface that contains gene expression data of IPF and COPD experiments, the platform will be expanded to include other abnormal lung phenotypes. This resource facilitates the exploration of gene expression profiles under different pathological conditions, and allows the identification of co-expression patterns. PulmonDB can help the scientific community to study which genes have a distinct expression profile related with a disease, explore the reproducibility across technologies and platforms, identify interesting co expression patterns across diseases and to find relationships among distinct clinical or experimental variables.
(Conference Room San Felipe)
10:00 - 10:30 Pei Wang: A new method to study the change of miRNAmRNA interactions due to environmental exposures.
Integrative approaches characterizing the interactions among different types of biological molecules have been demonstrated to be useful for revealing informative biological mechanisms. One such example is the interaction between microRNA (miRNA) and messenger RNA (mRNA), whose deregulation may be sensitive to environmental insult leading to altered phenotypes. In this work, we introduce a new network approach—integrative Joint Random Forest (iJRF), which characterizes the regulatory system between miRNAs and mRNAs using a network model. iJRF is designed to work under the high-dimension low-sample-size regime, and can borrow information across different treatment conditions to achieve more accurate network inference. It also effectively takes into account prior information of miRNA–mRNA regulatory relationships from existing databases. We then apply iJRF to data from an animal experiment designed to investigate the effect of low-dose environmental chemical exposure on normal mammary gland development. We detected a few important miRNAs that regulated a large number of mRNAs in the control group but not in the exposed groups, suggesting the disruption of miRNA activity due to chemical exposure. Effects of chemical exposure on two affected miRNAs were further validated using breast cancer human cell lines.
(Conference Room San Felipe)
10:30 - 11:00 Coffee Break (Conference Room San Felipe)
11:00 - 11:30 Andrew McDavid: Combining immune cell repertoire sequencing and functional expression with applications to autoimmune disease.
The cells of the adaptive immune system have the capacity to respond to a lifetime of diverse pathogens, in part due to somatic recombination of genes comprising the T and B cell receptors ([TB]CR). Several protocols now exist for simultaneously resolving [TB]CR identity and scRNAseq expression of all other polyadenalated mRNAs. I describe a pipeline for inferring *clonal* subpopulations of cells, that is, cells that share a [TB]CR from a common ancestor. This pipeline can scale to tens of thousands of cells. A series of statistical models are proposed for testing for expansion of specific clones in covariate groups, as well the omnibus propensity for clonal expansion. Lastly, I consider how [TB]CR identity can be combined with scRNAseq expression for unsupervised clustering using methods from multiview learning. I apply these methods to a data set of B cells isolated from the synovium of rheumatoid arthritis patients.
(Conference Room San Felipe)
11:30 - 12:00 Kasper Hansen: Co-expression patterns define epigenetic regulators associated with neurological dysfunction.
Coding variants in genes encoding for epigenetic regulators are an emerging cause of neurological dysfunction and cancer. However, a systematic effort to identify disease candidates within the human epigenetic machinery (EM) has not been performed, and it is unclear whether features exist that distinguish between variation-intolerant and variation-tolerant EM genes, and between EM genes associated with neurological dysfunction versus cancer. Here, we rigorously define a set of 295 human genes with a direct role in epigenetic regulation (writers, erasers, remodelers, readers). Systematic exploration of these genes reveals that while individual enzymatic functions are always mutually exclusive, readers often also exhibit enzymatic activity (dual function EM genes). We find that the majority of EM genes are very intolerant to loss-of-function variation, even when compared to the dosage sensitive group of transcription factors. Using this strategy, we identify 103 novel EM disease candidates. We show that the intolerance to loss-of-function variation is driven by the protein domains encoding the epigenetic function, strongly suggesting that disease is caused by a perturbed chromatin state. Unexpectedly, we also describe a large subset of EM genes that are co-expressed within multiple tissues. This subset is almost exclusively populated by extremely variation-intolerant EM genes, and shows enrichment for dual function EM genes. It is also highly enriched for genes associated with neurological dysfunction, even when accounting for dosage sensitivity, but not for cancer-associated EM genes. These findings prioritize novel disease candidate EM genes, and suggest that the co-expression itself may play a functional role in normal neurological homeostasis.
(Conference Room San Felipe)
12:00 - 12:30 Nuno Luis Barbosa-Morais: Biologist-intelligible alternative splicing analysis of RNA-seq data.
I will give an overview of our lab’s efforts on making the analysis of alternative splicing from RNA-seq datasets more intuitive and informative to biologists. I will introduce psichomics, a modular and extensible Bioconductor package with an intuitive Shiny-based graphical interface for alternative splicing quantification and downstream dimensionality reduction, differential splicing and gene expression and survival analyses based on The Cancer Genome Atlas, the Genotype-Tissue Expression, and the Sequence Read Archive (via recount2) projects, as well as user-provided data. These integrative analyses can incorporate clinical and molecular sample-associated features and be performed on a laptop. I will also discuss how Beta distributions can be exploited in modeling exon inclusion levels, incorporating information about the coverage-associated precision of their estimates by using the numbers of reads supporting exon inclusion and exclusion as surrogates of the distribution’s shape parameters. Beta distributions provide a sensible framework for differential splicing analysis of small sample size datasets.
(Conference Room San Felipe)
12:30 - 14:00 Lunch (Restaurant Hotel Hacienda Los Laureles)
15:00 - 15:30 Paul Scheet: Genomic profiling of normal, premalignant and heterogeneous tissues in cancer patients.
Normal tissues adjacent to tumor and premalignant lesions present an opportunity for in vivo human models of early disease pathology. Genomic studies of such “at risk” tissues may identify molecular pathways involved in a transition to malignant phenotypes and/or targets for personalized prevention or precision medicine. Yet, challenges to this objective include: 1) the small size of lesions or limited available tissue, often presenting “either or” choices for molecular technologies (e.g. DNA or RNA, NGS or arrays); and 2) low mutant cell fractions due to heterogeneous tissues and their corresponding early stages of disease. To confront these, we have 1) conducted targeted next-generation sequencing of DNA and RNA and, when possible, genome-wide DNA SNP arrays, 2) employed a suite of off-the-shelf single-nucleotide variant calling algorithms and considered various ensemble strategies for determining mutation authenticity, and 3) developed sensitive haplotype-based techniques (hapLOH) to determine megabase-scale regions of allelic imbalance that reflect chromosomal alterations such as deletions, duplications and copy-neutral loss-of-heterozygosity. We have applied combinations of these strategies to normal appearing epithelial airway samples adjacent to non-small cell lung cancers, premalignancies of the lung, colon and skin, and to public data from paired normal and tumor samples from 11,000 patients of The Cancer Genome Atlas (TCGA). In mutational analyses of tissues annotated by alterations discovered in paired tumors, we identify key molecular drivers, document two-hit models of tumorigenesis, highlight immune-related expression phenotypes in premalignant lesions and, where possible, construct phylogenetic trees of intra-patient samples. We also give examples of systematic errors in copy number changes in TCGA that can be corrected by hapLOH.
(Conference Room San Felipe)
15:30 - 16:00 Shrabanti Chowdhury: Proteogenomic analysis of carboplatin response in ovarian cancer cell lines and PDX models. (Conference Room San Felipe)
16:00 - 16:30 Kelly Street: Statistical methods and software for the study of stem cell differentiation using single-cell transcriptome sequencing.
Single-cell transcriptome sequencing (scRNA-Seq), which combines high-throughput single-cell extraction and sequencing capabilities, enables the transcriptomes of large numbers of individual cells to be assayed efficiently. Profiling of gene expression at the single-cell level for a large sample of cells is crucial for addressing many biologically relevant questions, such as, the investigation of rare cell types or primary cells (e.g., stem cell differentiation) and the examination of subpopulations of cells from a larger heterogeneous population (e.g., classifying cells in brain tissues). I will discuss some of the statistical and computational issues that have arisen in the context of a collaboration with the UC Berkeley Ngai Lab concerning the analysis of olfactory stem cell fate trajectories in mice. These issues, ranging from so-called low-level to high-level analysis, include: experimental design, exploratory data analysis (EDA) of scRNA-Seq reads, quality assessment/control (QA/QC), normalization to account for nuisance technical effects, cluster analysis to identify novel cell types, cell lineage and pseudotime inference, and differential expression analysis to identify genes involved in the differentiation process. Our statistical methods are implemented in open-source R packages released through the Bioconductor Project
(Conference Room San Felipe)
16:30 - 17:00 Coffee Break (Conference Room San Felipe)
17:00 - 17:30 Sohrab Shah: Single cell whole genome sequencing for population genetic inference of cancer dynamics. (Conference Room San Felipe)
17:30 - 18:00 Matt Ritchie: Design and analysis of a single cell RNA-seq benchmarking dataset to compare protocols and methods.
Authors: Matthew Ritchie [1], Luyi Tian [1], Xueyi Dong [1], Saskia Freytag [1], Shian Su [1], Daniela Amann-Zalcenstein [1], Tom Weber [1], Azadeh Seidi [2], Kim-Anh Lê Cao [3], Shalin Naik [1] Affiliations: 1. Walter and Eliza Hall Institute of Medical Research, Parkville, Australia. 2. Australian Genome Research Facility, Parkville, Australia. 3. Melbourne Integrative Genomics, The University of Melbourne, Parkville, Australia. Abstract: Single cell RNA sequencing (scRNA-seq) technology has undergone rapid development in recent years and has brought with it new challenges in data processing and analysis. This has led to an explosion of tailored analysis methods for scRNA-seq to address various biological questions, from cell type identification, to marker gene discovery and trajectory analysis. The current lack of gold-standard benchmarking datasets makes it difficult for researchers to evaluate the performance of the many different methods available in a systematic manner. To address this problem, we designed and generated a cross-platform benchmark dataset that has in-built truth in various forms as well as varying levels of biological noise. We use this dataset to compare different protocols, examine popular assumptions made in scRNA-seq analyses and compare methods for tasks ranging from normalization to trajectory analysis. We found significant differences in the results from the methods compared and have identified a few that performed well across protocols in high and low variability scenarios. Our dataset and analysis provide a valuable resource for understanding the nature of scRNA-seq data and can be used to guide algorithm selection in different biological settings.
(Conference Room San Felipe)
18:00 - 18:30 Gerald Quon: Characterizing cell type-specific responses to stimuli using single cell RNA sequencing.
Single cell RNA sequencing (scRNA-seq) technologies are quickly advancing our ability to characterize the transcriptional heterogeneity of biological samples, given their ability to identify novel cell types and characterize precise transcriptional changes during previously difficult-to-observe processes such as differentiation and cellular reprogramming. An emerging challenge in scRNA-seq analysis is the characterization of cell type-specific transcriptional responses to stimuli, when the similar collections of cells are assayed under two or more conditions, such as in control/treatment or cross-organism studies. In this talk, we will present a novel computational strategy for identifying cell type specific responses using deep neural networks to perform unsupervised domain adaptation. Compared to other existing approaches, ours does not require identification of all cell types before alignment, and can align more than two conditions simultaneously. We will discuss on-going applications of our model to two problem domains: characterizing hematopoietic progenitor populations and their response to inflammatory challenges (LPS), in which we have identified putative subpopulations of long term HSCs that differentially respond to the challenge, and characterizing the malaria cell cycle process, in which we identify transcriptional changes associated with sexual commitment.
(Conference Room San Felipe)
18:30 - 20:00 Dinner (Restaurant Hotel Hacienda Los Laureles)
18:30 - 19:00 Maribel Hernández Rosales (Hotel Hacienda Los Laureles)
Wednesday, November 7
07:30 - 09:00 Breakfast (Restaurant at your assigned hotel)
09:00 - 09:30 Patrick Kimes: Reproducible and replicable comparisons of methods controlling false discoveries in computational biology.
With the advancement of high-throughput technologies, data and computing have become key components of scientific discovery in biology. New computational methods to analyze genomic data are constantly being developed, with several methods often addressing the same biological question. As a result, researchers are now faced with the challenge of deciding between a plethora of tools, each leading to slightly different answers. For several common analyses in computational biology, benchmark comparisons have been published to help users pick an appropriate tool from a subset of alternatives. Despite the popularity of these comparisons, the implementation is often ad hoc, with little consistency across studies. To address this problem, we developed SummarizedBenchmark, an R package and framework for organizing and structuring benchmark comparisons. SummarizedBenchmark defines a general grammar for benchmarking and allows for easier setup and execution of benchmark comparisons, while improving the reproducibility and replicability of such comparisons. Using this framework, we perform a systematic benchmark of several recently developed false discovery rate (FDR)-controlling methods for multiple testing correction. These modern methods have the potential to improve power in biological studies by leveraging additional pieces of information available in the data ("informative covariates") to prioritize, weight, and group hypotheses. We investigate the advantages and limitations of these methods against classical FDR-controlling methods across six biological cases studies and various simulation settings. We provide a summary of our findings as a practical guide to aid users in the choice of methods to correct for false discoveries in future studies.
(Conference Room San Felipe)
09:30 - 10:00 Gabriela Cohen-Freue: Regularized instrumental variables estimators for disease classification.
Abstract: We have developed a novel approach under the framework of regularized instrumental variables estimators to build classifiers of a disease state. Instrumental variables estimators are analogous to classical regression estimators but they borrow strength from supplemental variables (the instruments) to address problems in the model, such as measurement errors and confounding factors, commonly encountered in genomics studies. Genetic data can be used as instruments in genomic biomarkers discovery studies to build tailored classifiers of a disease. Our approach can reduce the number of false positive discoveries and boost the identification of genomic biomarkers by exploiting and modeling the plausible biological mechanisms that relate genetics and gene expression information with a disease state. Authors: Joe Watson, David Kepplinger, and Gabriela Cohen Freue
(Conference Room San Felipe)
10:00 - 10:30 Svetlana Lyalina: Evaluating the comparability of clinical trials and real world omics data.
Somatic mutation data generated in the clinical setting have unique challenges related to the targeted sequencing approach used. As part of biomarker discovery we frequently face the challenging task of maximizing the available sample size while dealing with dataset-specific effects. The goal of the work presented here is to assess the comparability of somatic mutation data from clinical trial patients and real world patients. Having adjusted for disparities in the data and accounting for the particulars of the sequencing modality, we investigate prognostic biomarkers that are informative across the wider group. With the increased numbers afforded by our synthetic control, we can also more robustly evaluate the effectiveness of multiple investigational drugs.
(Conference Room San Felipe)
10:30 - 11:00 Coffee Break (Conference Room San Felipe)
11:00 - 11:30 Qi Long: Bayesian generalized bi-clustering for integrative analysis of multi-omics data.
Biclustering is a popular method for analysis of gene expression data. While existing biclustering methods have many desirable features, most of them are developed for continuous data and cannot efficiently handle omic data of different types, discrete and continuous, when used in the analysis of multi-omics data. In addition, none of existing biclustering methods can utilize biological information such as those from functional genomics. Recent work has shown that incorporating such biological information can improve variable selection and prediction performance in linear regression and multivariate analysis. In this work, we propose a Bayesian generalized biclustering method that can handle both continuous and discrete data jointly. It uses a Bayesian adaptive structured shrinkage prior that enables feature selection guided by biological information. An efficient EM algorithm is developed for estimation. The proposed method is shown to outperform several existing biclustering methods in simulation studies and in analyses of several –omics datasets.
(Conference Room San Felipe)
11:30 - 12:00 Annette Molinaro: Development of an enhanced deconvolution algorithm for methylation studies.
Immunomethylomics is a promising young field which is efficient in differentiation of immune cell types from a small quantity of (archived or fresh) tissue or blood. To quantify immune cell types, methylation arrays are employed and the data is pre-processed. Subsequently, an established deconvolution algorithm is utilized to quantify immune cell quantities per sample. In our recent transition from 450k to 850k methylation arrays, we have noted multiple issues with the current deconvolution algorithms in the estimation of quantities. There is an inefficiency in selection of CpGs for the immune cell type library reducing the robustness of the resulting library as well as an inability to distinguish multiple cell types with shared lineage, e.g. mMDSCs, gMDSCs, and granulocytes. Here we present our preliminary work on developing an enhanced deconvolution algorithm incorporating more advanced computational methods for precise estimation of the immune cell quantities.
(Conference Room San Felipe)
12:00 - 12:30 Garrett Weaver: Hierarchical regularized regression: incorporating external information in high-dimensional prediction models.
Garrett M. Weaver1 1Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, California, United States of America Growing repositories of data that describe the structure and function of the genome may contain annotations that are relevant to the effects of genomic features on clinical outcomes. We propose a novel extension of regularized regression that enables the inclusion of such annotations and has the potential to improve the prediction of health-related outcomes in high-dimensional regression models that utilize genomic data. A sparsity-inducing penalty on the external information allows our model to identify relevant annotations for the prediction task at hand. Through simulation, we show that when the external data is informative, our model has improved predictive ability compared to standard approaches that do not include the external information. We also show that the additional penalty on the external data ensures there is little to no reduction in prediction performance when the external data is non-informative. Our model is applied to two public data sets, the first uses gene expression data to predict survival in breast cancer patients, while the second uses methylation data to predict chronological age. Our method is available as an R package (https://github.com/USCbiostats/hierr) that utilizes coordinate descent to efficiently fit our model with the ability to apply some of the most commonly used penalties to both genomic features and external annotations.
(Conference Room San Felipe)
12:30 - 13:30 Lunch (Restaurant Hotel Hacienda Los Laureles)
13:30 - 18:30 Free Afternoon (Oaxaca)
18:30 - 20:00 Dinner (Restaurant Hotel Hacienda Los Laureles)
Thursday, November 8
07:30 - 09:00 Breakfast (Restaurant at your assigned hotel)
09:00 - 09:30 Aki Nishimura: Bayesian sparse regression and variable selection for large data with weak signals.
Sparse regression & variable selection play an essential role in finding an interpretable structure in the presence of a large number of potential predictors. Despite the growth over years in the amount of available data, statistical power remains an issue for many questions of scientific interest either because of weak signals or focus on fine scale structures. Bayesian framework is attractive for its favorable finite sample properties and ability to incorporate scientifically motivated structures into statistical models. Computational challenges, however, have been major obstacles to application of Bayesian methods to large-scale data. For example, the previously available computational methods for Bayesian sparse regression could only handle a small sample size, with some techniques strictly limited in scope for continuous outcomes. In this talk, I will discuss the recent progress in the scalable computational techniques for Bayesian sparse regression in the “large n and large p” setting for binary and survival time outcomes. When applied to a large-scale observational study with n = 72,489 and p = 22,175, our algorithm cuts down the computing time by an order of magnitude compared to the existing approach. My knowledge of genomics is limited, but I believe our proposed techniques are applicable to the problems in statistical genomics; I look forward to getting feedbacks and learning more about genomics.
(Conference Room San Felipe)
09:30 - 10:00 Maribel Hernández Rosales: Mutational dynamics in the mouse mitochondrial genome.
In the cell there are from hundreds to thousands of mitochondria. Mitochondrial mutant genomes can coexist with wild-type genomes. Mutations in the mitochondrial genome have been associated to several diseases, such as aging, Alzheimer’s disease, Parkinson’s disease, some forms of cancer, infertility, neuromuscular disorders, etc. In this work, we address the following questions: what is the mutation load in the mitochondrial genome? does the mutation load change in the mouse brain in different stages of life? does the frequency of individual mutations change in different stages of life? how are mutations distributed in the mitochondrial genome? I will show preliminary results of this study in the mouse mitochondrial genome that will give us insights about the mutational dynamics in the human mitochondrial genome.
(Conference Room San Felipe)
10:00 - 10:30 Maria Chikina: PLIER: Pathway Level Information ExtractoR
PLIER: Pathway Level Information ExtractoR Genome scale molecular datasets are often highly structured, with many correlated observations. This general phenomenon can be related to the underlying data generating process. In gene expression assays, groups of gene are co-regulated through shared transcription factors and signaling pathways. In the first half of the talk we will present a new constrained matrix decomposition approach that directly aligns a lower dimension representation with known biological pathways. Our method excellent accuracy in reconstructing known upstream variables through a biologically interpretabile decomposition. The inferred pathway activity estimates can be used in any downstream analysis and we will show how our approach can be used to gain new insights from several existing datasets.
(Conference Room San Felipe)
10:30 - 11:00 Coffee Break (Conference Room San Felipe)
11:00 - 11:30 Katherine Hoadley: Integrative genomic analyses of TCGA pan-cancer data.
The Cancer Genome Atlas has culminated over a decade of work characterization over 11,000 tumors from 33 different tumor types in a large scale, multidimensional data analysis called the PanCancer Atlas. This multi-institutional project resulted in 26 papers under three main themes – Cell of Origin, Oncogenic Processes, and Signaling Pathways. In the Cell of Origin marker paper, we explored the molecular classification of samples by chromosome arm level aneuploidy, DNA methylation, mRNA, miRNA, and protein data. Integrative clustering approaches identified groups of samples with shared molecular characteristics and identified diversity within tumor types and similarities that cross tumor types. This work provided support for additional analyses looking at pan-organ systems including pan-gynecological, pan-gastrointestinal, pan-kidney, pan-squamous, and stemness.
(Conference Room San Felipe)
11:30 - 12:00 Jean Yang: Multi-omics integration for identifying prognostic biomarkers in complex disease.
Recent studies in cancer and other complex diseases continues to highlight the extensive genetic diversity between and within cohorts. This intrinsic heterogeneity poses one of the central challenges to predicting patient clinical outcome and the personalization of treatments. Here, we will first discuss the concept of ‘classifiability` observed in multi-omics studies where individual patients’ samples may be considered as either ‘hard’ or ‘easy’ to classify by different platforms, reflected in moderate error rates with large ranges. We demonstrate in a cohort of 45 stage III melanoma patients that clinico-pathologic biomarkers can identify those patients that are most likely to be misclassified by a molecular biomarker. We propose a novel multi-step procedure to incorporate this information and was able to improved classification accuracy overall as well as identifying the specific clinical attributes that had made classification problematic in each cohort. Finally, we address an essential step towards utilizing these new biomarkers for therapeutic purposes by developing a novel standardization method which tackles this prospective experimental design problem.
(Conference Room San Felipe)
12:00 - 12:30 Ellis Patrick: Feature selection using differential correlation across ranked samples.
Genes act as a system and not in isolation. Thus, it is important to consider coordinated changes of gene expression rather than single genes when investigating biological phenomena. We have developed an approach for quantifying how changes in the association between pairs of genes may change across an outcome of interest called Differential Correlation Across Ranked Samples (DCARS). Modelling gene correlation across a continuous sample ranking does not require the dichotomisation of samples into distinct classes and can hence identify differences in gene correlation across the full range of an outcome as opposed to just its extremities. We have recently demonstrated the utility of DCARS when assessing differential correlation across survival ranking in various cancers and have extended its use to single-cell RNA-sequencing data. As examples, when examining prognosis in melanoma and other cancers, DCARS consistently finds communities of genes that are enriched for known cancer related genes, as well as further associations with somatic mutations in the genes belonging to the communities. When applied to single-cell RNA-sequencing of hepatoblasts, we observe clear evidence of a high level of network coordination to force cells down alternate differentiation paths. In these contexts, when DCARS is used in conjunction with network analysis and visualisation techniques it becomes a powerful tool for extracting biological meaning from multi-layered and complex data.
(Conference Room San Felipe)
12:30 - 14:00 Lunch (Restaurant Hotel Hacienda Los Laureles)
15:00 - 15:30 Rob Scharpf: Integrated genomic analyses of ovarian cancer cell lines to predict drug sensitivity.
To improve our understanding of ovarian cancer, we performed genome-wide analyses of 45 ovarian cancer cell lines. Given the challenges of genomic analyses of tumors without matched normal samples, we developed approaches for detection of somatic sequence and structural changes and integrated these with epigenetic and expression alterations. Alterations not previously implicated in ovarian cancer included amplification or overexpression of ASXL1 and H3F3B, deletion or underexpression of CDC73 and TGF beta receptor pathway members, and rearrangements of YAP1-MAML2 and IKZF2-ERBB4. Dose-response analyses to targeted therapies revealed molecular dependencies, including increased sensitivity of tumors with PIK3CA and PPP2R1A alterations to PI3K inhibitor GNE-493, MYC amplifications to PARP inhibitor BMN673, and SMAD3/4 alterations to MEK inhibitor MEK162. Genome-wide rearrangements provided an improved measure of sensitivity to PARP inhibition. This study provides a comprehensive and broadly accessible resource of molecular information for development of new therapeutic avenues in ovarian cancer.
(Conference Room San Felipe)
15:30 - 16:00 Sara Mostavafi: Combining heterogeneous genomics data to understand complex human traits.
Recent availability of large-scale multi-omics datasets from human cohort studies present new opportunities for deriving molecular mechanisms for complex disease. However, a central challenge in using these genomics data to understand complex traits is to disentangle causal, and hence reproducible and meaningful associations, from merely correlated ones. In this talk, I’ll describe ongoing projects that develop new statistical and computational methods for integrative analysis of multi-omics data, with the ultimate goal of providing insights into complex human diseases including major depression.
(Conference Room San Felipe)
16:00 - 16:30 Coffee Break (Conference Room San Felipe)
16:30 - 17:00 Kimberly Siegmund: Statistical approach for investigating change in mutational process during cancer growth and development.
Human cancer somatic mutations arise from a variety of biological processes. Different processes produce different patterns of somatic mutations called mutation signatures. Tumor growth, just like phylogeny and human development, requires genome replication, which generates intratumor heterogeneity from replication errors. Early somatic mutations accumulated between the zygote and the first initiating tumor cell should appear in all descendant cells, and those that appear later in growth, in progressively smaller subsets. These are called trunk and branch mutations, respectively. By multi-regional tumor sampling, we can distinguish trunk from branch mutations and ask whether the mutational signatures in the first tumor cell differ from the signatures of tumor growth. Presently, investigators use latent mixture models to infer the mutation signatures and the relative frequency each signature contributes to the overall tumor catalog. We develop a hierarchical mixed-membership model for testing whether the contributions of the signatures differ before and after tumor initiation. Our methods are applied to mutations identified using whole exome sequencing from a set of colon tumors.
(Conference Room San Felipe)
17:00 - 17:30 Richard Bonneau: Multi-task regulatory network inference applied to multi-study, multi-species and single cell genomic experimental designs. (Conference Room San Felipe)
17:30 - 19:00 Dinner (Restaurant Hotel Hacienda Los Laureles)
Friday, November 9
07:30 - 09:00 Breakfast (Restaurant at your assigned hotel)
09:00 - 09:30 Ingo Ruczinski: Detection of de novo copy number deletions from targeted sequencing of trios
De novo copy number deletions have been implicated in many diseases, but no formal methods existed that identify de novo deletions in parent-offspring trios from capture-based sequencing platforms. We developed Minimum Distance for Targeted Sequencing (MDTS) to fill this void. MDTS has similar sensitivity (recall), but a much lower false positive rate compared to less specific CNV callers, resulting in a much higher positive predictive value (precision). MDTS also exhibited much better scalability. We applied our method to 1,305 case-parent trios with targeted sequencing data of regions previously implicated in oral cleft. Across the 6.3Mb of capture, we detected one de novo deletion in gene TRAF3IP3, in addition to one rare inherited deletion and 2 copy number polymorphic regions.
(Conference Room San Felipe)
09:30 - 10:00 Benilton Carvalho - The Brazilian Approach (Conference Room San Felipe)
10:00 - 10:30 Jean Yang: Exploring single cell data (Conference Room San Felipe)
10:30 - 11:00 Coffee Break (Conference Room San Felipe)
11:00 - 12:00 Discussion (Conference Room San Felipe)
12:00 - 14:00 Lunch (Restaurant Hotel Hacienda Los Laureles)