Kamlesh Jangid, University of Georgia

Data Extraction Tools

DETools is a set of scripts which can be used to extract a small data from a larger dataset. Be it sequences in the FASTA format, or a distance matrix prepared using the PHYLIP package, preparing SAMPLE files for running multiple LIBSHUFF comparisons, or if you simply wish to know which pairs of organisms in a distance matrix share distance values less than or above a user determined cutoff, these tools are for you. You can use these tools and save yourself lots of time and energy doing those realignments, simply extract the data from one BIG dataset.

The v1 of these scripts were originally written in SCILAB, the open source platform for numerical computation, by Rajesh Jangid and me. Lateron, we modified and rewrote them in C++ for faster and user-friendly computation. The v2 of these scripts is the latest improved version and is now available for download as a Windows Installer here. Individual Windows executable files can be downloaded below or at the Whitman Lab Homepage.

Sequence Extraction (SeqEx)

The SeqEx utility can be used to extract a set of sequences from a larger sequence dataset. SeqEx is specifically written for extracting sequences which are no more than 7682 characters in length. If you have seuqences that are shorter, you need not worry. It works with both aligned and un-aligned sequence files in the FASTA format. The length limitation makes it usable with the NAST aligner of the Greengenes database.

How does it work?

For the SeqEx utility to work, you must have two INPUT files: 1) The BIG sequence file in FASTA format from which you want to extract a set of sequences, and 2) A list file in TEXT format containing the list of sequences names for which you wish to extract the sequences with one sequence ID per line. Make sure that all the sequence IDs listed in this list file are present in the BIG dataset. Execution will be easier if the SeqEx executable is in the same folder as the two input files. Simply double click on the executable, enter the name of the list file, the name of the BIG seuqence file, followed by the output file name. Your smaller sequence FASTA file is now ready.

Where is it available?

Click here to download the Windows executable of the SeqEx tool.

Matrix Extraction (DistEx)

The DistEx utility is used to extract a distance matrix for a set of sequences which form part of a larger distance matrix. DistEx is very useful when you have to run multiple LIBSHUFF comparisons for a set of libraries in different combinations

How does it work?

The DistEx utility works similar to SeqEx as described above as there are two INPUT files: 1) The BIG distance matrix file in the PHYLIP format, and 2) A sequence list file in TEXT format. The list file contains the sequence IDs (one ID per line) for which you wish to extract the distance matrix. Make sure that all the sequence IDs listed in this list file are present in the BIG matrix.

DistEx will generate two output files: a PHYLIP formatted distance matrix and a LIBSHUFF compatible SAMPLE file. The SAMPLE file can be directly used as an input in LIBSHUFF. The DistEx output is in the order of the input sequence list and is independent of their order of appearance in the BIGGER matrix. Similar to SeqEx, execution will be simpler if the DistEx executable is in the same folder as the two input files. Simply double click on the executable, enter the name of the list file, the name of the BIG matrix file, followed by the output file names. Your smaller distance matrix is now ready.

Where is it available?

Click here to download the Windows executable of the DistEx tool.

Sequence CutOff (CutOff)

The CutOff utility works on either a similarity matrix or a distance matirx. This tool allows FILTERING OUT those pairs of sequences which have similarity or distance values lower than the given cutoff. Hence, the output file contains the most similar sequences if using a similarity matrix as an input. In contrast, one selects for the most dissimilar sequences while using a distance matrix as an Input.

How does it work?

For the CutOff utility to work all you need is a PHYLIP generated distance matrix or a similairty matrix. The similarity matrix could be prepared from the PHYLIP distance matrix using the JukesCantor MSword macro written by Jose Gonzalez. CutOff will generate a single output file in the tab-delimited file format. Simply double click on the executable, enter the name of the matrix file, the cutoff value, followed by the output file name. Your seqeunce pairs are now ready.

Where is it available?

Click here to download the Windows executable of the CutOff tool.