
Srinivas
Aluru
Sequencing genomes is a path to new
discoveries in plant sciences, be the findings pieces of molecular machinery or
targets for trait enhancement.
Now new generation DNA sequencing,
using massively parallel sequencing technologies, is reducing costs and
accelerating sequence data acquisition rates. But it comes hand-in-glove with a
new problem - how to assemble and decode all the information.
To sequence genomes, you first break
them down and then, like Humpty- Dumpty, put the pieces back together again.
Normally the genome is randomly cut into a set of DNA fragments - each 700 to
1,000 base pairs long - then sequenced. Next, long overlapping sections are
matched sequentially and the original genome reassembled.
New generation DNA sequencing relies
upon simultaneously sequencing millions of short DNA pieces, each 36 to 75 base
pairs in length. So though this method is fast, reassembly is problematic.
"The probability of overlaps
matching with high confidence is greatly diminished with short reads,"
explains Srinivas Aluru, the Ross Martin Mehl and Marylyne Munas Mehl Professor of Computer Engineering in the Department of
Electrical and Computer Engineering.
Aluru, with support from the
institute's Innovative Grants Program, is developing a software program that
assists genome reassembly and accommodates assemblies where many end parts are
similar to the start of another - a quandary hammered home for researchers
reassembling the maize genome that proved to be 65 to 80 percent repetitive.
"One big problem is repetitive
sequence reads - genomic co-location," explains Aluru. "How do we
find where in the genome they come from when overlaps are not always from the
same genomic region?"
To appreciate the sheer number and
complexity the latest technology produces, one can imagine navigating across
Lake Superior from Sault Saint Marie to Thunder Bay in a kayak - with only one
data set of thousands of short navigational reads to go by. If an exact
geographical location must be matched with each paddle stroke throughout the
entire length of the journey, then even one misplaced coordinate steers the
kayaker off course with no navigational information for reorientation.
Aluru's program, called YAGA (yet another genome assembler) works by coordinating
potentially thousands of high-performance super computer processers
simultaneously.
"This becomes a very large
optimization project," says Aluru. "Like building a house - if one
person does all the work it takes a long time, but there is no need to
coordinate planning. If 1,000 people are working, they must interface at the
correct time, doing equal amounts of work and completing their tasks at once."
YAGA analyzes millions of these
paired reads for vital clues and predicts the one logical path containing the
short reads that is most likely to be the genome sequence.
"We will be trying to apply the
YAGA program to various real biology projects, such as sequencing transcriptomes and genomes of other crop grasses,"
says Aluru. "In the process, we will discover what we need to do better,
helping further development."