|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectedu.rit.compbio.phyl.DnaSequenceList
public class DnaSequenceList
Class DnaSequenceList provides a list of DnaSequences. Methods for reading and writing textual files of DNA sequences are provided.
Each DNA sequence consists of a sequence of sites. Each site has a state, which is a set of bases. The four bases are adenine, cytosine, guanine, and thymine. For textual I/O, each state is represented by a single character as follows:
| Char. | Meaning | Set | ||
| A | Adenine | (A) | ||
| C | Cytosine | (C) | ||
| G | Guanine | (G) | ||
| T | Thymine | (T) | ||
| Y | pYrimidine | (C or T) | ||
| R | puRine | (A or G) | ||
| W | "Weak" | (A or T) | ||
| S | "Strong" | (C or G) | ||
| K | "Keto" | (G or T) | ||
| M | "aMino" | (A or C) | ||
| B | not A | (C or G or T) | ||
| D | not C | (A or G or T) | ||
| H | not G | (A or C or T) | ||
| V | not T | (A or C or G) | ||
| X | unknown | (A or C or G or T) | ||
| - | deletion | () |
The DNA sequence file format is that used by Joseph Felsenstein's Phylogeny Inference Package (PHYLIP). While the file is a plain text file, it often has the extension ".phy" to indicate that it is in PHYLIP format. For further information, see:
Here is an example of an input file:
5 42 Turkey AAGCTNGGGC ATTTCAGGGT Salmo gair AAGCCTTGGC AGTGCAGGGT H. Sapiens ACCGGTTGGC CGTTCAGGGT Chimp AAACCCTTGC CGTTACGCTT Gorilla AAACCCTTGC CGGTACGCTT GAGCCCGGGC AATACAGGGT AT GAGCCGTGGC CGGGCACGGT AT ACAGGTTGGC CGTTCAGGGT AA AAACCGAGGC CGGGACACTC AT AAACCATTGC CGGTACGCTT AA |
The first line contains the number of species S and the number of sites N in each sequence. S must be >= 2. N must be >= 1.
The next S lines contain the initial data for each species. The first ten characters contain the sequence name. This must be exactly ten characters, padded with blanks if necessary. Then comes one character for each site in the sequence. Uppercase and lowercase are considered the same. Characters other than those for the states listed above are ignored. Often, a blank is inserted every ten characters for readability, but this is not necessary. After these S lines come zero or more blank lines for readability, which are ignored. If there is more sequence data, the next S lines give the states for the next sites in the sequences. This continues for the rest of the file.
This is known as the "interleaved" file format. There is also a "sequential" file format, but the sequential file format is not supported.
Thus, the complete sequence for each species in the example is:
| Species | Sequence | |
| Turkey | AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT | |
| Salmo gair | AAGCCTTGGCAGTGCAGGGTGAGCCGTGGCCGGGCACGGTAT | |
| H. Sapiens | ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA | |
| Chimp | AAACCCTTGCCGTTACGCTTAAACCGAGGCCGGGACACTCAT | |
| Gorilla | AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA |
In the input file, the following alternate characters can be used: X, N, and ? all mean "unknown." O (capital letter O) and - (hyphen) both mean "deletion." The character . (period) means "the same as the corresponding site in the first species." Here is another input file with the same sequences as the one above:
5 42 Turkey AAGCTNGGGC ATTTCAGGGT Salmo gair ..G.CTT... AG.G...... H. Sapiens .CCGGTT... .G........ Chimp ..A.CCTT.. .G..AC.CT. Gorilla ..A.CCTT.. .GG.AC.CT. GAGCCCGGGC AATACAGGGT AT .....GT... CGGG..C... .. ACAGGTT... CG.T...... .A A.A..GA... CGGGACACTC .. A.A..ATT.. CGGTAC.CT. .A |
Here are some more example DNA sequence files:
| Constructor Summary | |
|---|---|
DnaSequenceList(DnaSequenceList list)
Construct a new DNA sequence list that is a copy of the given DNA sequence list. |
|
| Method Summary | |
|---|---|
int[] |
countAbsentStates()
Determine the number of absent states after adding each sequence in this DNA sequence list to a tree. |
int |
exciseUninformativeSites()
Excise uninformative sites from the DNA sequences in this DNA sequence list. |
int |
informativeSiteCount()
Returns the number of informative sites in this DNA sequence list. |
Iterator<DnaSequence> |
iterator()
Returns an iterator for the DNA sequences in this list. |
int |
length()
Obtain this DNA sequence list's length. |
static DnaSequenceList |
read(File file)
Read a DNA sequence list from the given input file. |
DnaSequence |
seq(int i)
Get the DNA sequence at the given index in this DNA sequence list. |
DnaSequenceTree |
toTree(int[] signature)
Create a DNA sequence tree from this DNA sequence list and the given tree signature. |
void |
truncate(int len)
Truncate this DNA sequence list to the given length. |
void |
write(File file)
Write this DNA sequence list to the given output file. |
void |
write(File file,
int sites,
boolean periods,
boolean bold)
Write this DNA sequence list to the given output file. |
void |
write(PrintStream ps,
int sites,
boolean periods,
boolean bold)
Write this DNA sequence list to the given print stream in interleaved PHYLIP format. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public DnaSequenceList(DnaSequenceList list)
Note: The DNA sequences in the new list are copies of (not references to) the DNA sequences in the given list.
list - DNA sequence list to copy.
NullPointerException - (unchecked exception) Thrown if list is null.| Method Detail |
|---|
public int length()
public DnaSequence seq(int i)
i - Index, 0 ≤ i ≤ N−1.
ArrayIndexOutOfBoundsException - (unchecked exception) Thrown if i is out of bounds.
public static DnaSequenceList read(File file)
throws IOException
The DNA sequences' sites and names are read from the input file. The DNA sequences' scores are set to 0.
file - File.
NullPointerException - (unchecked exception) Thrown if file is null.
IOException - Thrown if an I/O error occurred. Thrown if the input file's contents
were invalid.
public void write(File file)
throws IOException
file - File.
NullPointerException - (unchecked exception) Thrown if file is null.
IOException - Thrown if an I/O error occurred.
public void write(File file,
int sites,
boolean periods,
boolean bold)
throws IOException
file - File.sites - Number of sites per output line.periods - True to use periods, false not to use periods.bold - True to mark informative sites in bold, false not to.
NullPointerException - (unchecked exception) Thrown if file is null.
IllegalArgumentException - (unchecked exception) Thrown if sites <= 10.
IOException - Thrown if an I/O error occurred.
public void write(PrintStream ps,
int sites,
boolean periods,
boolean bold)
throws IOException
ps - Print stream.sites - Number of sites per output line.periods - True to use periods, false not to.bold - True to mark informative sites in bold, false not to.
NullPointerException - (unchecked exception) Thrown if ps is null.
IllegalArgumentException - (unchecked exception) Thrown if sites <= 10.
IOException - Thrown if an I/O error occurred.public void truncate(int len)
len - Length.
NegativeArraySizeException - (unchecked exception) Thrown if len < 0.public int exciseUninformativeSites()
Each site in the DNA sequences is either "uninformative" or "informative," defined as follows:
Since the uninformative sites do not affect the outcome of a maximum parsimony phylogenetic tree search, the uninformative sites can be omitted from the tree scoring process to save time. The informative sites do affect the outcome and must be included in the tree scoring process.
The exciseUninformativeSites() removes the uninformative sites from the DNA sequences in this list. The DNA sequences' scores and names are unchanged.
public int informativeSiteCount()
public int[] countAbsentStates()
public DnaSequenceTree toTree(int[] signature)
Note: The returned tree has references to (not copies of) the DNA sequences in this list.
signature - Tree signature (array of tree indexes).
public Iterator<DnaSequence> iterator()
iterator in interface Iterable<DnaSequence>
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||