Techniques of representing single-cell RNA-seq (scRNA-seq) data as low dimensional latent factors are performed to capture underlying gene expression patterns in high dimensional scRNA-seq data. Dimension reduction approaches called variational autoencoders (VAEs) are emerging as a popular tool as they can capture nonlinearity with excellent computational efficiency because of the deep neural networks involved.
In this project we develop a VAE method that incorporates annotated gene sets to guide the inference of the latent factors. Databases such as Reactome and MSigDB record large numbers of annotated gene sets. The gene sets have been recorded through previous studies and may have errors. The proposed method incorporates annotated gene sets in a flexible manner accounting for such potential errors. We are also interested in understanding the factor-to-gene relationship in our method by interpreting the neural networks.
We tested our method using simulated gene expression data with known ground truth information. Using annotated gene sets with errors, the method accurately inferred the ground truths by rectifying the errors. We also applied the method to real scRNA-seq data, illustrating its potential advantages. Overall, we have been developing an interpretable and computationally efficient nonlinear dimension reduction technique which incorporates annotated gene sets in scRNA-seq data analysis.