Gå til hovedindhold

The genomic revolution driven by Next Generation Sequencing is transforming life sciences and its computational needs

Professor Simon Rasmussen, Center for Biological Sequence analysis, DTU

Abstract:
Because DNA is a fundamental unit of life the ability to study DNA sequences can impact society by increasing our ability to develop novel medicines and bioindustries as well as facilitating a deeper understanding of our environment. Currently Next Generation Sequencing (NGS) is driving a revolution of life sciences characterized by the ability to read the sequence of DNA molecules at an unprecedented scale, producing vast amounts of data. A single Illumina HiSeq machine can produce up to 400 billion bases in a single run of 11 days - the equivalent of more than 1 Tb of data.

However the ability to deal with these vast quantities of data has likewise transformed bioinformatics with an explosion of software algorithms and additionally created a need for compute systems with extensive CPU, memory, IO and storage capabilities. The analysis of NGS data is often either based on alignment to a known genome sequence or assembly of a new genome sequence without using prior knowledge (de novo assembly). Both of the approaches require massive amount of compute and are only the initial steps needed where often several iterations or a combination of the methods must be used to achieve the results. Alignment, based on the seed and extend approach, has been made possible by a shift from hash based algorithms to applying burrows-wheeler transformation and suffix arrays. For de novo assembly traditional Overlap Layout Consensus strategies scales very poorly with the massive data amounts and graph based de Bruijn have been developed.

However still the state-of-the-art de novo assembly of complex mammalian genomes requires more than 500 CPU days and often in the range of terabytes of memory and storage. These requirements can be even higher for environmental samples such as soil. As the production of the data is distributed across many centres and laboratories it could be beneficial to reverse the traditional approach of downloading data to local hardware for analysis and instead provide a shared computable storage bringing data and the analysis tools together.

Revideret
22 dec 2015