Poster Presentation 44th Lorne Genome Conference 2023

Statistical modelling for single cell ATAC-seq data (#215)

Aaron Wing Cheung Kwok 1 2 , Heejung Shim 1 , Davis J McCarthy 1 2
  1. Melbourne Integrative Genomics/School of Mathematics and Statistics, Faculty of Science, The University of Melbourne, Melbourne, VIC, Australia
  2. St. Vincent's Institute of Medical Research, Melbourne, VIC, Australia

Recently, single-cell Assay for Transposase Accessible Chromatin (scATAC-seq) has emerged as a promising technique to study chromatin accessibility and gene regulation at single-cell resolution. As with virtually all modern 'omics data appropriately performing statistical analysis on scATAC-seq data is a key challenge. One major hurdle is that chromatin accessibility is widely considered to be binary, hence binarization of the count matrix is a common first step in many scATAC-seq pipelines, followed by Latent Semantic Indexing (LSI) for dimensionality reduction and subsequent downstream tasks. However, recent studies show that the scATAC-seq counts are quantitative instead of qualitative, which violates the assumption of most tools. A deep understanding of statistical properties of scATAC-seq counts is also lacking. Here we show that using the newly proposed counting strategy by Miao and Kim (2022), Paired-Insertion-Counting (PIC), chromatin accessibility counts display characteristics similar to single-cell RNA sequencing (scRNA-seq). Our results also highlight the key differences in modelling scATAC-seq counts when compared to scRNA-seq and that existing normalisation methods should be used with caution when statistical models that assume homoscedasticity are involved in downstream analysis. Lastly, we propose a hierarchical model that provides a principled approach to infer the underlying chromatin accessibility states in scATAC-seq data and identify biologically heterogeneous features.