edu.rit.compbio.phyl
Class DnaSequenceList

java.lang.Object
  extended by edu.rit.compbio.phyl.DnaSequenceList
All Implemented Interfaces:
Iterable<DnaSequence>

public class DnaSequenceList
extends Object
implements Iterable<DnaSequence>

Class DnaSequenceList provides a list of DnaSequences. Methods for reading and writing textual files of DNA sequences are provided.

Each DNA sequence consists of a sequence of sites. Each site has a state, which is a set of bases. The four bases are adenine, cytosine, guanine, and thymine. For textual I/O, each state is represented by a single character as follows:

Char. Meaning Set
A Adenine (A)
C Cytosine (C)
G Guanine (G)
T Thymine (T)
Y pYrimidine (C or T)
R puRine (A or G)
W "Weak" (A or T)
S "Strong" (C or G)
K "Keto" (G or T)
M "aMino" (A or C)
B not A (C or G or T)
D not C (A or G or T)
H not G (A or C or T)
V not T (A or C or G)
X unknown (A or C or G or T)
- deletion ()

The DNA sequence file format is that used by Joseph Felsenstein's Phylogeny Inference Package (PHYLIP). While the file is a plain text file, it often has the extension ".phy" to indicate that it is in PHYLIP format. For further information, see:

Here is an example of an input file:

  5    42
 Turkey     AAGCTNGGGC ATTTCAGGGT 
 Salmo gair AAGCCTTGGC AGTGCAGGGT 
 H. Sapiens ACCGGTTGGC CGTTCAGGGT 
 Chimp      AAACCCTTGC CGTTACGCTT 
 Gorilla    AAACCCTTGC CGGTACGCTT 
 
 GAGCCCGGGC AATACAGGGT AT
 GAGCCGTGGC CGGGCACGGT AT
 ACAGGTTGGC CGTTCAGGGT AA
 AAACCGAGGC CGGGACACTC AT
 AAACCATTGC CGGTACGCTT AA

The first line contains the number of species S and the number of sites N in each sequence. S must be >= 2. N must be >= 1.

The next S lines contain the initial data for each species. The first ten characters contain the sequence name. This must be exactly ten characters, padded with blanks if necessary. Then comes one character for each site in the sequence. Uppercase and lowercase are considered the same. Characters other than those for the states listed above are ignored. Often, a blank is inserted every ten characters for readability, but this is not necessary. After these S lines come zero or more blank lines for readability, which are ignored. If there is more sequence data, the next S lines give the states for the next sites in the sequences. This continues for the rest of the file.

This is known as the "interleaved" file format. There is also a "sequential" file format, but the sequential file format is not supported.

Thus, the complete sequence for each species in the example is:

Species Sequence
Turkey AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT
Salmo gair AAGCCTTGGCAGTGCAGGGTGAGCCGTGGCCGGGCACGGTAT
H. Sapiens ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA
Chimp AAACCCTTGCCGTTACGCTTAAACCGAGGCCGGGACACTCAT
Gorilla AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA

In the input file, the following alternate characters can be used: X, N, and ? all mean "unknown." O (capital letter O) and - (hyphen) both mean "deletion." The character . (period) means "the same as the corresponding site in the first species." Here is another input file with the same sequences as the one above:

  5    42
 Turkey     AAGCTNGGGC ATTTCAGGGT 
 Salmo gair ..G.CTT... AG.G...... 
 H. Sapiens .CCGGTT... .G........ 
 Chimp      ..A.CCTT.. .G..AC.CT. 
 Gorilla    ..A.CCTT.. .GG.AC.CT. 
 
 GAGCCCGGGC AATACAGGGT AT
 .....GT... CGGG..C... ..
 ACAGGTT... CG.T...... .A
 A.A..GA... CGGGACACTC ..
 A.A..ATT.. CGGTAC.CT. .A

Here are some more example DNA sequence files:


Constructor Summary
DnaSequenceList(DnaSequenceList list)
          Construct a new DNA sequence list that is a copy of the given DNA sequence list.
 
Method Summary
 int[] countAbsentStates()
          Determine the number of absent states after adding each sequence in this DNA sequence list to a tree.
 int exciseUninformativeSites()
          Excise uninformative sites from the DNA sequences in this DNA sequence list.
 int informativeSiteCount()
          Returns the number of informative sites in this DNA sequence list.
 Iterator<DnaSequence> iterator()
          Returns an iterator for the DNA sequences in this list.
 int length()
          Obtain this DNA sequence list's length.
static DnaSequenceList read(File file)
          Read a DNA sequence list from the given input file.
 DnaSequence seq(int i)
          Get the DNA sequence at the given index in this DNA sequence list.
 DnaSequenceTree toTree(int[] signature)
          Create a DNA sequence tree from this DNA sequence list and the given tree signature.
 void truncate(int len)
          Truncate this DNA sequence list to the given length.
 void write(File file)
          Write this DNA sequence list to the given output file.
 void write(File file, int sites, boolean periods, boolean bold)
          Write this DNA sequence list to the given output file.
 void write(PrintStream ps, int sites, boolean periods, boolean bold)
          Write this DNA sequence list to the given print stream in interleaved PHYLIP format.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DnaSequenceList

public DnaSequenceList(DnaSequenceList list)
Construct a new DNA sequence list that is a copy of the given DNA sequence list.

Note: The DNA sequences in the new list are copies of (not references to) the DNA sequences in the given list.

Parameters:
list - DNA sequence list to copy.
Throws:
NullPointerException - (unchecked exception) Thrown if list is null.
Method Detail

length

public int length()
Obtain this DNA sequence list's length.

Returns:
Length N (number of DNA sequences).

seq

public DnaSequence seq(int i)
Get the DNA sequence at the given index in this DNA sequence list.

Parameters:
i - Index, 0 ≤ iN−1.
Returns:
DNA sequence.
Throws:
ArrayIndexOutOfBoundsException - (unchecked exception) Thrown if i is out of bounds.

read

public static DnaSequenceList read(File file)
                            throws IOException
Read a DNA sequence list from the given input file. The input file must be in interleaved PHYLIP format.

The DNA sequences' sites and names are read from the input file. The DNA sequences' scores are set to 0.

Parameters:
file - File.
Returns:
DNA sequence list.
Throws:
NullPointerException - (unchecked exception) Thrown if file is null.
IOException - Thrown if an I/O error occurred. Thrown if the input file's contents were invalid.

write

public void write(File file)
           throws IOException
Write this DNA sequence list to the given output file. The output file is in interleaved PHYLIP format. There are 70 sites on each output line. Periods are not used. Informative sites are not marked in bold.

Parameters:
file - File.
Throws:
NullPointerException - (unchecked exception) Thrown if file is null.
IOException - Thrown if an I/O error occurred.

write

public void write(File file,
                  int sites,
                  boolean periods,
                  boolean bold)
           throws IOException
Write this DNA sequence list to the given output file. The output file is in interleaved PHYLIP format.

Parameters:
file - File.
sites - Number of sites per output line.
periods - True to use periods, false not to use periods.
bold - True to mark informative sites in bold, false not to.
Throws:
NullPointerException - (unchecked exception) Thrown if file is null.
IllegalArgumentException - (unchecked exception) Thrown if sites <= 10.
IOException - Thrown if an I/O error occurred.

write

public void write(PrintStream ps,
                  int sites,
                  boolean periods,
                  boolean bold)
           throws IOException
Write this DNA sequence list to the given print stream in interleaved PHYLIP format.

Parameters:
ps - Print stream.
sites - Number of sites per output line.
periods - True to use periods, false not to.
bold - True to mark informative sites in bold, false not to.
Throws:
NullPointerException - (unchecked exception) Thrown if ps is null.
IllegalArgumentException - (unchecked exception) Thrown if sites <= 10.
IOException - Thrown if an I/O error occurred.

truncate

public void truncate(int len)
Truncate this DNA sequence list to the given length. If this list is already shorter than len, the truncate() method does nothing.

Parameters:
len - Length.
Throws:
NegativeArraySizeException - (unchecked exception) Thrown if len < 0.

exciseUninformativeSites

public int exciseUninformativeSites()
Excise uninformative sites from the DNA sequences in this DNA sequence list.

Each site in the DNA sequences is either "uninformative" or "informative," defined as follows:

Since the uninformative sites do not affect the outcome of a maximum parsimony phylogenetic tree search, the uninformative sites can be omitted from the tree scoring process to save time. The informative sites do affect the outcome and must be included in the tree scoring process.

The exciseUninformativeSites() removes the uninformative sites from the DNA sequences in this list. The DNA sequences' scores and names are unchanged.

Returns:
Number of state changes the (excised) uninformative sites contribute to the parsimony score.

informativeSiteCount

public int informativeSiteCount()
Returns the number of informative sites in this DNA sequence list.

Returns:
Number of informative sites.

countAbsentStates

public int[] countAbsentStates()
Determine the number of absent states after adding each sequence in this DNA sequence list to a tree. The return value A is an N-element array, where N is the length of this DNA sequence list. As sequences from this list are added to a tree in order from i = 0 to N−1, A[i] is the number of character states that do not yet appear in the tree. Thus, the number of state changes in the tree must increase by at least A[i] when the sequences after sequence i are added to the tree. This can be used to prune a branch-and-bound search.

Returns:
Array A.

toTree

public DnaSequenceTree toTree(int[] signature)
Create a DNA sequence tree from this DNA sequence list and the given tree signature. The tree signature is an array of indexes of length N, where N is the length of this list. To construct the tree, for all i from 0 to N−1, the DNA sequence at index i in this list is added to the tree at index signature[i] using the DnaSequenceTree.add() method. For all i, signature[i] must be in the range 0 .. 2(i − 1), except signature[0] is 0.

Note: The returned tree has references to (not copies of) the DNA sequences in this list.

Parameters:
signature - Tree signature (array of tree indexes).
Returns:
Tree.

iterator

public Iterator<DnaSequence> iterator()
Returns an iterator for the DNA sequences in this list.

Specified by:
iterator in interface Iterable<DnaSequence>
Returns:
Iterator.


Copyright © 2005-2012 by Alan Kaminsky. All rights reserved. Send comments to ark­@­cs.rit.edu.