Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data

JM Franks, G Cai, ML Whitfield - Bioinformatics, 2018 - academic.oup.com
Bioinformatics, 2018academic.oup.com
Motivation Molecular subtypes of cancers and autoimmune disease, defined by
transcriptomic profiling, have provided insight into disease pathogenesis, molecular
heterogeneity and therapeutic responses. However, technical biases inherent to different
gene expression profiling platforms present a unique problem when analyzing data
generated from different studies. Currently, there is a lack of effective methods designed to
eliminate platform-based bias. We present a method to normalize and classify RNA-seq data …
Motivation
Molecular subtypes of cancers and autoimmune disease, defined by transcriptomic profiling, have provided insight into disease pathogenesis, molecular heterogeneity and therapeutic responses. However, technical biases inherent to different gene expression profiling platforms present a unique problem when analyzing data generated from different studies. Currently, there is a lack of effective methods designed to eliminate platform-based bias. We present a method to normalize and classify RNA-seq data using machine learning classifiers trained on DNA microarray data and molecular subtypes in two datasets: breast invasive carcinoma (BRCA) and colorectal cancer (CRC).
Results
Multiple analyses show that feature specific quantile normalization (FSQN) successfully removes platform-based bias from RNA-seq data, regardless of feature scaling or machine learning algorithm. We achieve up to 98% accuracy for BRCA data and 97% accuracy for CRC data in assigning molecular subtypes to RNA-seq data normalized using FSQN and a support vector machine trained exclusively on DNA microarray data. We find that maximum accuracy was achieved when normalizing RNA-seq datasets that contain at least 25 samples. FSQN allows comparison of RNA-seq data to existing DNA microarray datasets. Using these techniques, we can successfully leverage information from existing gene expression data in new analyses despite different platforms used for gene expression profiling.
Availability and implementation
FSQN has been submitted as an R package to CRAN. All code used for this study is available on Github (https://github.com/jenniferfranks/FSQN).
Supplementary information
Supplementary data are available at Bioinformatics online.
Oxford University Press