Using MUMmer to Identify Similar Regions in Large Sequence Sets

互联网2013-12-31

1142

Abstract
Table of Contents
Materials
Figures
Literature Cited

Abstract

The MUMmer sequence alignment package is a suite of computer programs designed to detect regions of homology in long biological sequences. Version 2.1 makes several improvements to the package, including: increased speed and reduced memory requirements; the ability to handle both protein and DNA sequences; the ability to handle multiple sequence fragments; and new algorithms for clustering together basic matches. The system is particularly efficient at comparing highly similar sequences, such as alternative versions of fragment assemblies or closely related strains of the same bacterium.

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Basic Protocol 1: MUMmer2: Comparing a Set of Sequences to a Single Reference Sequence
Alternate Protocol 1: NUCmer: Comparing a Set of Sequences to Another Set of Sequences
Alternate Protocol 2: PROmer: Comparing Sequences Using Protein Translations
Alternate Protocol 3: MUMmer1: Aligning Two Single Sequences
Support Protocol 1: Obtaining and Installing the MUMmer Package
Guidelines for Understanding Results
Commentary
Figures

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Materials

Basic Protocol 1: MUMmer2: Comparing a Set of Sequences to a Single Reference Sequence

Necessary Resources

Hardware
- Unix or Linux workstation. The largest program used in this protocol requires main memory of approximately 20 bytes per base of reference sequence plus 1 byte per base of query sequence. Thus, to compare 2 million bases of query sequence to 3 million bases of reference sequence, the computer should have at least (20 × 3 Mb) + (1 × 2 Mb) = 62 Mb of main memory.

Software
- MUMmer 2.12 package (see protocol 5 for download and installation)

Files
- A multi‐FASTA query file and a single‐FASTA reference file (see appendix 1B for information on FASTA). The files used in this example are complete genomic sequences from two strains of Helicobacter pylori —known as 26695 and J99. These sequences can be downloaded from TIGR's Comprehensive Microbial Resource at http://www.tigr.org/CMR, from the NCBI at http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html, or from the Current Protocols in Bioinformatics Web site at http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm.

Alternate Protocol 1: NUCmer: Comparing a Set of Sequences to Another Set of Sequences

Necessary Resources

Hardware
- Unix or Linux workstation. The largest program in the suite requires main memory of approximately 20 bytes per base of reference sequence plus 1 byte per base of query sequence. Thus, to compare 2 million bases of query sequence to 3 million bases of reference sequence, the computer should have at least (20 × 3 Mb) + (1 × 2 Mb) = 62 Mb of main memory.

Software
- NUCmer is included in the MUMmer 2.12 package (see protocol 5 for download and installation)

Files
- A multi‐FASTA query file and a multi‐FASTA reference file (see appendix 1B for information on FASTA). The files used in this example are sequences extracted from alignment regions of the H. pylori genomes used in the protocol 1 . File 26695parts.seq has five 2‐kb sequences extracted in order from file hp26695.seq, and file j99parts.seq has five corresponding 2‐kb sequences from file hpj99.seq but in permuted order, with 2 sequences reversed. The positions of the sequences in the files from which they were extracted are indicated in the FASTA header lines in the files. These files can be obtained from the Current Protocols in Bioinformatics Web site at http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm. The first field after the > on the FASTA header line of each sequence will be used to identify each sequence; therefore these field should be unique both within and between the query and reference files.

Alternate Protocol 2: PROmer: Comparing Sequences Using Protein Translations

Necessary Resources

Hardware
- Unix or Linux workstation. The largest program in the suite requires main memory of approximately 20 bytes per base of reference sequence plus 1 byte per base of query sequence. Thus, to compare 2 million bases of query sequence to 3 million bases of reference sequence, the computer should have at least (20 × 3 Mb) + (1 × 2 Mb) = 62 Mb of main memory.

Software
- PROmer is included in the MUMmer 2.12 package (see protocol 5 for download and installation)

Files
- A multi‐FASTA query file and a multi‐FASTA reference file (see appendix 1B for information on FASTA). The files used in this example are sequences extracted from alignment regions of the H. pylori genomes used in the protocol 1 . File 26695parts.seq has five 2‐kb sequences extracted in order from file hp26695.seq , and file j99parts.seq has five corresponding 2‐kb sequences from file hpj99.seq but in permuted order, with 2 sequences reversed. The positions of the sequences in the files from which they were extracted are indicated in the FASTA header lines in the files. These files can be obtained from the Current Protocols in Bioinformatics Web site at http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm. The first field after the > on the FASTA header line of each sequence will be used to identify each sequence; therefore these field should be unique both within and between the query and reference files.

Alternate Protocol 3: MUMmer1: Aligning Two Single Sequences

Necessary Resources

Hardware
- Same as the MUMmer2 protocol except that more memory is required: ∼25 bytes per base of both the query and reference sequences. Thus, to compare two 2‐megabase genomes will require ∼100 Mb of main memory.

Software
- The MUMmer1 script is included in the MUMmer 2.12 package (see protocol 5 for download and installation)

Files
- A multi‐FASTA query file and a single‐FASTA reference file (see appendix 1B for information on FASTA). The files used in this example are complete genomic sequences from two strains of Helicobacter pylori —known as 26695 and J99. These sequences can be downloaded from TIGR's Comprehensive Microbial Resource at http://www.tigr.org/CMR, from the NCBI at http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html, or from the Current Protocols in Bioinformatics Web site at http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm. These are the same files as in the example used for the protocol 1 .

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Figures

Figure 10.3.1 Beginning of output from program mummer2 for H. pylori strains 26695 and J99. The first line is the FASTA header line from the query sequence in the file hpj99.seq. The following lines list the locations of all MUMs between this sequence and the reference sequence, which was in the file hp26695.seq. The first column is the beginning position of the MUM in the reference sequence 26695; the second column is the beginning position of the MUM in the query sequence J99; and the third column is the length of the MUM. The MUMs are listed in order by position in the query sequence.

View Image

Figure 10.3.2 Beginning of the second part of the output from program mummer2 for H. pylori strains 26695 and J99 showing the matches for the reverse‐complement strand (indicated by the word Reverse at the end of the header line) of the query sequence J99. The format is the same as in the previous figure. If the query file contained more than one sequence, the matches for each sequence would follow in the same format, with each set of matches preceded by the FASTA header line of the sequence in the query file.

View Image

Figure 10.3.3 Part of the output from program mgaps showing the beginning of one of the clusters of MUMs. As before, the FASTA header lines from the query file are reproduced and clusters for the same query are separated by lines containing a single # character. The first three columns are the MUMs. The fourth column indicates any overlap between MUMs. The final two columns are the number of characters between the start of the MUM and the end of the preceding MUM in the reference and query sequence, respectively.

View Image

Figure 10.3.4 Part of the output from program combineMUMs showing the beginning of the same cluster as in the previous figure. Note the alignment above for MUM 131785 124710 17, which overlaps the preceding MUM by 6 characters (indicated by the −6 in the column that has none when there is no overlap). The overlap can be seen in that the last 6 characters before the gap match the last 6 characters in the gap. The phenomenon of overlapping MUMs generally indicates a variation in the number of occurrences of a tandem repeat in the two sequences. In this instance, the 63‐character insertion occurred within a repeated run of gtttt, where the B sequence had one more occurrence of gtttt than did the A sequence.

View Image

Figure 10.3.5 Contents of file nuc.coords. This file summarizes every match found between a query sequence and a reference sequence. The first two columns indicate the position of the match in the reference sequence. The next two columns are the position of the match in the query sequence. Note that if the start position [S2] is greater than the end position [E2] then the match is to the reverse‐complement strand of the query sequence. Columns 5 and 6 are the lengths of the matches in the respective sequences. Column 7 is the percentage of bases that match between the two sequences. Columns 8 and 9 are the tags of the sequences—these are the first fields on the FASTA header lines of the sequence in the input files. Optionally, in column 10 there may be a note about the match (not shown). A note of [DUPLICATE] means this match is identical to the one preceding it. [CONTAINS] indicates this match contains the preceding match within it. [SHADOWED] indicates this match is contained within the preceding match. [OVERLAPS] indicates this match shares some positions with the preceding match. These conditions occur because overlapping MUMs may fail to combine with one another.

View Image

Figure 10.3.6 Beginning of an alignment created by the show‐aligns program for NUCmer output.

View Image

Figure 10.3.7 Beginning of file pro.coords. This file summarizes every match found between a query sequence and a reference sequence. The first 7 and last 3 columns are as in Figure . Column 8 [%SIM] is the percent similarity according to the specified BLOSUM matrix. Column 9 [%STP] is the percentage of codons in the match that are stop codons. Columns 10 and 11 [FRM] indicate the reading frame of the match in the reference and query sequence, respectively, positive for forward strand, negative for reverse strand.

View Image

Figure 10.3.8 Beginning of an alignment created by the show‐aligns program for PROmer output.

View Image

Figure 10.3.9 Dot plot of MUMs between H. pylori strains 26695 and J99. MUMs between both forward strands are shown in red and MUMs between the forward and reverse strands are shown in blue. Each line segment connects the start position of the MUM to the end position of the MUM.

View Image

Videos

Literature Cited

Literature Cited
	Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403‐410.
	Arabidopsis Genome Initiative 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408:796‐815.
	Chang, W.I. and Lawler, E.L. 1994. Sublinear expected time approximate string matching and biological applications. Algorithmica 12:327‐344.
	Chao, K.M., Zhang, J., Ostell, J., and Miller, W. 1995. A local alignment tool for very long DNA sequences. Comput. Appl. Biosci. 11:147‐153.
	Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., and Salzberg, S.L. 1999. Alignment of whole genomes. Nucleic Acids Res. 27:2369‐2376.
	Delcher, A.L., Phillipy, A., Carlton, J., and Salzberg, S.L. 2002. Fast algorithms for large‐scale genome alignment and comparison. Nucleic Acids Res. 30:2478‐2483.
	Eisen, J.A., Heidelberg, J.F., White, O., and Salzberg, S.L. 2000. Evidence for symmetric chromosomal inversions around the replication origin in bacteria. Genome Biol 1:research11.01‐09.
	Gusfield, D. 1997. Algorithms on Strings,Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York.
	Henikoff, J.G., Pietrokovski, S., McCallum, C.M., and Henikoff, S. 2000. Blocks‐based methods for detecting protein homology. Electrophoresis 21:1700‐1706.
	Kurtz, S. 1999. Reducing the space requirement of suffix trees. Software Practice and Experience 29:1149‐1171.
	Lin, X., Kaul, S., Rounsley, S., Shea, T.P., Benito, M.I., Town, C.D., Fujii, C.Y., Mason, T., Bowman, C.L., and Barnstead, M. et al. 1999. Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature 402:761‐768.
	Mural, R.J., Adams, M.D., Myers, E.W., Smith, H.O., Miklos, G.L.G., Wides, R., Halpern, A., Li, P.W., Sutton, G., and Nadeau, J.et al. 2002. A comparison of whole‐genome shotgun‐derived mouse chromosome 16 and the human genome. Science 296:1661‐1671.
	Pearson, W.R. 2000. Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol. 132:185‐219.
	Perna, N.T., Plunkett, G., 3rd, Burland, V., Mau, B., Glasner, J.D., Rose, D.J., Mayhew, G.F., Evans, P.S., Gregor, J., and Kirkpatrick, H.A. et al. 2001. Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 409:529‐533.
	Schwartz, S., Zhang, Z., Frazer, K.A., Smit, A., Riemer, C., Bouck, J., Gibbs, R., Hardison, R., and Miller, W. 2000. PipMaker—a web server for aligning two genomic DNA sequences. Genome Res. 10:577‐586.
	Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., and Holt, R.A. et al. 2001. The sequence of the human genome. Science 291:1304‐1351.
Key References
	Delcher et al., 1999. See above.
	This describes the original MUMmer1 algorithm.
	Delcher et al., 2002. See above.
	This describes the enhancements in version 2 of MUMmer, including improved efficiency, more flexible clustering and alignment options, and the ability to handle files with multiple sequences.
	Gusfield, 1997. See above.
	This is a comprehensive treatment of suffix trees and sequence alignment algorithms for those interested in computer science details.
Internet Resources
	http://www.tigr.org/software/mummer
	The MUMmer homepage.

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Using MUMmer to Identify Similar Regions in Large Sequence Sets

Abstract

Table of Contents

Materials

Basic Protocol 1: MUMmer2: Comparing a Set of Sequences to a Single Reference Sequence

Alternate Protocol 1: NUCmer: Comparing a Set of Sequences to Another Set of Sequences

Alternate Protocol 2: PROmer: Comparing Sequences Using Protein Translations

Alternate Protocol 3: MUMmer1: Aligning Two Single Sequences

Figures

Videos

Literature Cited