• Login
    View Item 
    •   DSpace@RPI Home
    • Rensselaer Libraries
    • RPI Theses Online (Complete)
    • View Item
    •   DSpace@RPI Home
    • Rensselaer Libraries
    • RPI Theses Online (Complete)
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    All-pair comparison of billion-base genome sequences

    Author
    Li, Haiqiong
    View/Open
    170842_Li_rpi_0185N_10253.pdf (557.0Kb)
    Other Contributors
    Zaki, Mohammed J., 1971-; Newberg, Lee; Bystroff, Christopher, 1960-; Stewart, Charles V.;
    Date Issued
    2013-12
    Subject
    Computer science
    Degree
    MS;
    Terms of Use
    This electronic version is a licensed copy owned by Rensselaer Polytechnic Institute, Troy, NY. Copyright of original work retained by author.;
    Metadata
    Show full item record
    URI
    https://hdl.handle.net/20.500.13015/1028
    Abstract
    In this thesis, a parallel implementation of the LSH-ALL-PAIRS algorithm is proposed. The PARALLEL-LSH-ALL-PAIRS algorithm is based on the message passing interface (MPI) and runs on high performance servers, computing clusters, and supercomputers. The PARALLEL-LSH-ALL-PAIRS algorithm first uses a pool of processes to evaluate the hashing function on billions of genomic subsequences in parallel. Then, d-mers with the same hash value are grouped together and redistributed among all the processes using MPI communication. Finally, each process performs pair-wise comparisons of the assigned subsequences and outputs groups of similar pairs. Experiments show that the PARALLEL-LSH-ALL-PAIRS algorithm achieves good scalability with an increasing number of cores and increasing sizes of the input data on the RPI's IBM Blue Gene/Q supercomputer.; The LSH-ALL-PAIRS algorithm is an important method for comparison of genomic DNA sequences in order to find conserved genome features (i.e., subsequences of genomic sequence with d bases, called d-mers) across different species (e.g., Escherichia coli and Mycobacterium tuberculosis). The algorithm is based on a randomized search technique called locality-sensitive hashing (LSH), which is first applied to all the d-mers. After the hashing step, d-mers with the same hash value are grouped together into a class and pair-wise comparison is performed to find similar d-mers up to a certain number of mismatched bases. Instead of performing pair-wise comparison on potentially very large number of the input d-mers, LSH-ALL-PAIRS algorithm divides the d-mers into classes and performs pair-wise comparison on each individual group. However, since the computational complexity for pair-wise comparison within each class is still O(N2), where N is the number of d-mers in each class, the sequential LSH-ALL-PAIRS algorithm cannot process very long genomic sequence with billions of bases.;
    Description
    December 2013; School of Science
    Department
    Dept. of Computer Science;
    Publisher
    Rensselaer Polytechnic Institute, Troy, NY
    Relationships
    Rensselaer Theses and Dissertations Online Collection;
    Access
    Restricted to current Rensselaer faculty, staff and students. Access inquiries may be directed to the Rensselaer Libraries.;
    Collections
    • RPI Theses Online (Complete)

    Browse

    All of DSpace@RPICommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    Login

    DSpace software copyright © 2002-2022  DuraSpace
    Contact Us | Send Feedback
    DSpace Express is a service operated by 
    Atmire NV