Author
Li, Haiqiong
Other Contributors
Zaki, Mohammed J., 1971-; Newberg, Lee; Bystroff, Christopher, 1960-; Stewart, Charles V.;
Date Issued
2013-12
Subject
Computer science
Degree
MS;
Terms of Use
This electronic version is a licensed copy owned by Rensselaer Polytechnic Institute, Troy, NY. Copyright of original work retained by author.;
Abstract
In this thesis, a parallel implementation of the LSH-ALL-PAIRS algorithm is proposed. The PARALLEL-LSH-ALL-PAIRS algorithm is based on the message passing interface (MPI) and runs on high performance servers, computing clusters, and supercomputers. The PARALLEL-LSH-ALL-PAIRS algorithm first uses a pool of processes to evaluate the hashing function on billions of genomic subsequences in parallel. Then, d-mers with the same hash value are grouped together and redistributed among all the processes using MPI communication. Finally, each process performs pair-wise comparisons of the assigned subsequences and outputs groups of similar pairs. Experiments show that the PARALLEL-LSH-ALL-PAIRS algorithm achieves good scalability with an increasing number of cores and increasing sizes of the input data on the RPI's IBM Blue Gene/Q supercomputer.; The LSH-ALL-PAIRS algorithm is an important method for comparison of genomic DNA sequences in order to find conserved genome features (i.e., subsequences of genomic sequence with d bases, called d-mers) across different species (e.g., Escherichia coli and Mycobacterium tuberculosis). The algorithm is based on a randomized search technique called locality-sensitive hashing (LSH), which is first applied to all the d-mers. After the hashing step, d-mers with the same hash value are grouped together into a class and pair-wise comparison is performed to find similar d-mers up to a certain number of mismatched bases. Instead of performing pair-wise comparison on potentially very large number of the input d-mers, LSH-ALL-PAIRS algorithm divides the d-mers into classes and performs pair-wise comparison on each individual group. However, since the computational complexity for pair-wise comparison within each class is still O(N2), where N is the number of d-mers in each class, the sequential LSH-ALL-PAIRS algorithm cannot process very long genomic sequence with billions of bases.;
Description
December 2013; School of Science
Department
Dept. of Computer Science;
Publisher
Rensselaer Polytechnic Institute, Troy, NY
Relationships
Rensselaer Theses and Dissertations Online Collection;
Access
Restricted to current Rensselaer faculty, staff and students. Access inquiries may be directed to the Rensselaer Libraries.;