Poster Presentation 44th Lorne Genome Conference 2023

CaraVaN: Prioritising Cardiac Variants in the Non-coding genome using Gradient Boosting algorithm. (#233)

Gulrez Chahal 1 , Sonika Tyagi 2 , Mirana Ramialison 1
  1. Stem Cell Medicine,, Murdoch Children’s Research Institute, Royal Children’s Hospital; The Novo Nordisk Foundation Center for Stem Cell Medicine (reNEW) Melbourne , Parkville,, VIC 3052, , Australia
  2. Department of Infectious Disease and Monash eResearch Centre, , The Alfred Hospital and Monash University, , Melbourne, VIC, Australia

Congenital heart disease (CHD) are structural and functional defects that occur in the heart during its development. It is the leading cause of birth defects in babies. There are gene candidates that have been identified to understand CHD. However, in many cases the genetic cause still remains unknown. This can be attributed to the fact that protein-coding genes contribute only ~2% of the genome, while the non-coding genome (~98%) comprises of functional regions, such as cis-regulatory elements, that are involved in the regulation of the expression of the genes. Recent evidence indicates that variations in these regulatory regions such as enhancers and insulators impact gene expression and result in CHD. However, there is no method to investigate/predict non-coding variants in CHD yet.

 

Finding disease-causing variants in the non-coding genome poses a challenge, as they do not follow a genetic code. In recent years, several machine learning-based tools have been developed to prioritize these variants in the non-coding genome, however, they are not disease-specific. Given the unique complexity of heart development and its associated defects, we present CaraVaN, a cardiac-specific model which annotates and prioritises potential CHD causal variants in the non-coding genome using decision-tree based ensemble learning boosting algorithm. This model learns from cardiac-specific human functional, epigenomic and structural consequence features categories comprising of: cardiac-specific open chromatin histone marks (n=18), human cardiac transcription factor binding sites (n=98), cardiac-specific 3D chromatin organisation (n=4) and deleteriousness scores from existing non-coding genome variant assessment tools, to capture non-coding single nucleotide variants (SNVs) with potential pathogenicity in CHD. When prioritizing cardiac-pathogenic SNVs, CaraVaN demonstrated an improved performance (ROC AUC=0.704) in comparison to the state of art tools that are not tissue-specific (ROC AUC=0.612). We also validate the performance of CaraVaN to prioritise a functionally known non-coding variant in CHD in chromosome 12. We scored more than 48 million variants in chromosome 12, out of which this variant achieved a high score of 0.609 and was present in the top 8% of the score distribution of these variants. Furthermore, gene ontology (GO) analysis on these top-scoring non-coding variants revealed their association with 57 genes involved in human heart disease phenotypes including atrial fibrillation, supraventricular arrhythmia, primary atrial arrhythmia and abnormality of cardiovascular system physiology. Overall, CaraVaN is the first tool that evaluates the non-coding variants in CHD and other heart-related diseases.