Pan-genomic analysis of complex human diseases

PI: Sam Hokin
Co-PIs: Alan Cleary

The overall goal of this project is to develop a pan-genomic analysis tool using Frequented Regions (FRs) and machine learning for the classification of disease morbidity in human genomes. To date, the primary tool for the genomic study of diseases is the genome-wide association study (GWAS), in which the segregation of specific alleles, usually single-nucleotide polymorphisms (SNPs), between affected and unaffected individuals is associated with the disease of interest. This method works well for identifying isolated variants associated with a condition, but it does not connect those variants together in combinations which are, perhaps, even more strongly associated with the condition. In addition, GWAS tends to focus on SNPs and is therefore less focused on structural variants. GWAS is usually performed on variants called against the human reference genome, and is therefore biased toward that reference.

Given the prevalence of complex heritable diseases and the need to better understand their genomic origin in order to improve treatments, investigation of new analysis techniques is highly justified. The tool proposed here combines two new analysis concepts: pan-genomic graphs, which represent individuals’ genomes with paths through a graph of DNA sequence nodes; and Frequented Regions, a novel way of describing genomic variation within a pan-genomic graph. We combine these two concepts with the growing field of machine learning in order to produce a supervised classification algorithm for human diseases.

Our approach features several important improvements to genomic analysis of disease: (1) A pan-genomic approach is unbiased toward the human reference if the graph is constructed strictly from individuals’ DNA; (2) FRs are well-suited to the study of complex diseases, since they represent arbitrary genomic structures in the graph; (3) FRs are sensitive to any type of variation, since they are arbitrary clusters of DNA sequence; and (4) our approach is sensitive to the entire genome if the pan-genome is built from whole genome sequencing (WGS) reads.

In order to build this tool, we will employ a highly parallel GPU-based computational strategy in order to handle the vast amount of data in a pan-genome representing hundreds or thousands of individuals' DNA.

Although it has a substantial risk of failure, our project, if successful, has the potential of greatly enhancing human disease studies with a distinct and complementary method.

Publications

S. Hokin, A. Cleary and J. Mudge.
Disease association with frequented regions of genotype graphs
medRxiv, 2020
Pan-genomic analysis of complex human diseases Pangenomic Algorithms
S. Hokin and A. Cleary.
Disease Classification with Pan-Genome Frequented Regions and Machine Learning
Gordon Research Conference, 2019
Pan-genomic analysis of complex human diseases Pangenomic Algorithms
A. Cleary, T. Ramaraj, I. Kahanda, J. Mudge and B. Mumey.
Exploring frequented regions in pan-genomic graphs
IEEE/ACM transactions on computational biology and bioinformatics, 2018, DOI 10.1109/tcbb.2018.2864564
Pan-genomic analysis of complex human diseases Pangenomic Algorithms

Pan-genomic analysis of complex human diseases

Publications

About NCGR

Contact

Pan-genomic analysis of complex human diseases

Publications

About NCGR

Contact

Privacy Policy

OUR PLEDGE TO YOU

UPDATES AND CHANGES TO PRIVACY POLICY

HOW AND WHY WE GATHER INFORMATION

EMAIL COMMUNICATIONS

COOKIES AND BEACONS

HOW INFORMATION HELPS BOTH YOU AND US

SOCIAL MEDIA FEATURES

HOW AND WHY OUR COMPANY DISCLOSES YOUR INFORMATION TO THIRD PARTIES

OUR PRIVACY POLICY DOES NOT APPLY TO THIRD-PARTY ACTIVITIES OR SITES

KEEPING INFORMATION SAFE

OPT-OUT POLICY

REVIEWING, CHANGING OR CORRECTING INFORMATION

OUR COMMITMENT TO CHILDREN'S PRIVACY

INTERNATIONAL VISITORS

APPLICABLE LAW AND JURISDICTION

SECURITY

Contact Information

Effective November 4, 2019

Terms of Use

RULES AND RESTRICTIONS ON SUBMISSIONS

OWNER'S RIGHT TO MONITOR AND ADMINISTER THE WEBSITE

ACCOUNTS AND SECURITY

LINKED SITES

OWNERSHIP OF THE WEBSITE AND ITS CONTENTS AND ASSOCIATED TRADEMARKS

DISCLAIMERS

LIMITATIONS OF LIABILITY

REPRESENTATIONS BY USERS; INDEMNIFICATION

CHOICE OF LAW AND CONSENT TO FORUM

RESTRICTION, SUSPENSION AND TERMINATION

PROCEDURES FOR REQUESTING THE REMOVAL OF INFRINGING MATERIAL

HOW TO CONTACT OWNER