This unit outlines a variety of methods by which DNA sequences can be manipulated by computers. Procedures for entering sequence data into the computer and assembling raw sequence data into a contiguous sequence are described first, followed by a description of methods of analyzing and manipulating sequences??e.g., verifying sequences, constructing restriction maps, designing oligonucleotides, identifying protein?coding regions, and predicting secondary structures. This unit also provides information on the large amount of software available for sequence analysis.The appendix to this unit lists some of the commercial software, shareware, and free software related to DNA sequence manipulation. The goal of this unit is to serve as a starting point for researchers interested in utilizing the tremendous sequencing resources available to the computer?knowledgeable molecular biology laboratory.

  • Sequence Data Entry
  • Sequence Data Verification
  • Restriction Mapping
  • Prediction of Nucleic Acid Structure
  • Oligonucleotide Design Strategy
  • Identification of Protein‐Coding Regions
  • Homology Searching
  • Genetic Sequence Databases and Other Electronic Resources Available to Molecular Biologists
  • Figure 7.7.1 Commonly used sequence file formats having specific defined elements and defining codes. (A ) EMBL comment lines begin with two‐letter codes: ID, short sequence name; DE, description; and SQ, sequence length. DNA or protein sequence follows; sequence end is denoted by two slashes (//) on a separate line. (B ) GenBank comments precede the sequence and are separated from it by the code “ORIGIN”. Sequence end is denoted by two slashes on a separate line. The actual text ot this entry has been abbreviated; See Fig. 19.2.3 for a more complete example of a GenBank file. (C ) GCG comments precede the sequence and are separated from it by two dots (..). (D ) Intelligenetics comment lines begin with semicolons (;). A single description line follows, and then the sequence begins on a separate line. Sequence end is denoted by a numeral one (1). (E ) NBRF (also called PIR format) first line starts with four required characters: a greater‐than sign (>); either “D” for DNA or “P” for protein; either “L” for linear or a “C” for circular; and a semicolon. The short sequence name follows on the same line. The next line is a description line. Sequence starts on a new line and its end is denoted by an asterisk ~undefined). (F ) DNA Strider Text is similar to the Intelligenetics format, but lacks the description line. (G ) FASTA (sometimes called Pearson format) first line begins with a greater‐than sign (>), followed by the sequence name and a short description. Sequence data then starts on a separate line. Note: Some formats (including GenBank, GCG, and NBRF) allow numbers to be included within the sequence for ease of reading (the numbers are ignored during sequence analysis).
    View Image
  • Figure 7.7.2 Multiple sequence editor. The GCG program GELASSEMBLE displays the aligned sequences on the top of the screen and a schematic of the sequenced fragments on the bottom. Arrows indicate the direction of sequencing; the asterisk in the lower part of the display indicates the position of the cursor in the sequence alignment as the user edits the sequence.
    View Image
  • Figure 7.7.3 One type of graphical restriction map. This figure was produced by the free PlotZ program; GCG MAPPLOT produces similar output.
    View Image

  • Figure 7.7.4 (A ) Text‐based output from Zuker's RNA‐folding program, available from GCG under the name of FOLD. This type of representation is difficult to visualize, but acceptable when only a quick view of the possible folded structures is desired. (B ) Graphic representation of the structure shown in part A,produced by the GCG. Squiggles program. The free LoopViewer program (for Macintosh) produces similar representations.
    View Image

  • Figure 7.7.5 Text of a message sent to the EBI FASTA mail server (e‐mail address: ). This message requests that the sequence be searched against the Other Mammalian section of the EMBL database. The answer will include the top 100 matching sequences and alignments of the top 20 matching sequences.
    View Image


