Mapping, PCR DNA Sequencing
互联网
Software for:
Making Restriction Maps
Designing PCR primers
Assembling fragments, detecting SNPs, and managing raw sequence data
For:GCG,Mac,PC,UNIX,and on the Web
Restriction Mapping
-Making restriction maps was my first use of “Molecular Biology” software
-Making restriction maps is a routine lab activity that is necessary for any type of cloning project.
-High quality maps are important for publications and exchange of information between researchers or between labs.
Archiving Data
-Maps are a common way for labs to archive information about entire libraries of plasmid constructs.
-It is important that map data are stored in a reliable format so that the obsolescence of a particular computer program does not render archives unusable.
Mapping Software
-Programs vary greatly in sophistication and ease of use
-Simple drawing programs (vector graphics)
-The venerable DNA Strider
-GCG (not a strong point of the package)
-Comprehensive Mac/PC MolBio programs
-Dedicated plasmid drawing programs
-Can it be done on the Web?
-Making high quality graphical restriction maps is one area where Mac/PC programs are much better than GCG or the Web
GCG Mapping Programs
GCG has a full set of functional restriction mapping tools.
This may be your best bet if you simply need to locate some restriction sites to plan a cloning project.
MAP allows you to search for any enzyme site in REBASE.
MAPPLOT makes a graphical output from MAP.
MAPSORT simulates a restriction digest and predicts the sizes of digest products with any combination of enzymes.
FINDPATTERNS allows you to search for short sequence patterns (enzyme sites, promoters, enhancers, etc.).
PLASMIDMAP is a GCG program that produces a "publication quality circular map" of a plasmid construct.
(I know of no one who has ever successfully used this program.)
Using MAP
MAP is the main GCG restriction mapping program. Like a lot of GCG programs, it is very powerful and quite complex.
-Restriction sites can be mapped for all enzymes (the default), or a set of enzymes that you specify by name.
-You can also select just enzymes with 4, 5, or 6 base recognition sites; and 5’, 3’, or blunt end cutters.
-You can allow a single base mismatch between the enzyme recognition site and your target sequence.
-The output can be viewed as a linear map or in a table format.
-MAP provides protein translations (in 3, 6, or any single reading frame).
Web Mapping Tools
There are some free mapping tools on the web for finding restriction sites and making text maps, but not for nice graphical maps.
Webcutter (Max Heiman, Yale Univ.):http://www.firstmarket.com/cutter/cut2.html
EMBOSS Restrict (EMBL Institut Pasteur)
http://bioweb.pasteur.fr/seqanal/interfaces/restrict.html
Restriction Maps (Colorado State Univ.)
http://arbl.cvmbs.colostate.edu/molkit/mapper/index.html (uses Java)
WebCutter
Webcutter is a free on-line tool to restriction map nucleotide sequence (text output).
http://www.firstmarket.com/cutter/cut2.html
Webcutter includes the option of finding restriction sites that can be introduced into a sequence by silent mutagenesis.
EMBOSS Restrict
http://bioweb.pasteur.fr/seqanal/interfaces/restrict.html
A JAVA Program
http://arbl.cvmbs.colostate.edu/molkit/mapper/index.html
From R. A. Bowen at Colorado Sate University.
JAVA applets displayed on a web page cannot be directly printed, can only grab screen shots -useless for publication.
Mac&PC Mapping Programs
Restriction mapping is one of the simplest molecular biology computing tasks - many MolBio software packages provide this function:
DNA Strider(very old,Mac only)(download from:http://endeavor.med.nyu.edu)
MacVector(Mac only,RCR has a site license)
OMIGA(Windows only,RCR has a site license)
Sequencher(Mac and Windows,RCR has a site license)
Gene Construction Kit(Mac only)
Vector NTI(Mac and Windows)
Plasmid Premier (Windows only)
DNA Strider is simple,but still elegant
Vector NTI
Vector NTI puts the plasmid map at the center of all program functions.
PCR Primer Design
The design of PCR (and sequencing) primers is relatively simple from a computational point of view:just search along a sequence and find short sub-sequences that fit certain criteria.
However,since the molecular biology of PCR is very complex, the nature of these criteria is not at all obvious.
All primers design software uses approx-imately the same criteria and computing algorithms. Graphical output is not necessary.
Molecular Biology of PCR
The fundamental Molecular Biology of PCR is not well understood.
We know what happens in a descriptive sense,but not the physical chemistry/thermodynamics.
The rules for choosing PCR primers are a rough combination of educated guesses and old fashioned trial-and-error.
None of the published formulas for calculating annealing temperatures has been proven to give better than a rough estimate.
The PCR Process
In a nutshell,PCR works like this:
DNA and two primers are combined in a salt solution with dNTPs and a heat stable DNA polymerase enzyme.
The primers match some sequence in the target DNA.
The solution is rapidly heated to DNA denaturing temperatures (~95°C) and cooled to a temperature where the polymerase can function.
Each thermal cycle generates copies of the sequence between the primers,so the total number of fragments amplifies in an exponential fashion: 2, 4, 8,16, 32, 64, etc.
Primer Design Rules
primers should be at least 15 base pairs long
have at least 50% G/C content
anneal at a temperature in the range of 50-65 degrees C
Usually higher annealing temperatures (Tm) are better (i.e. more specific for your desired target)
forward and reverse primer should anneal at approximately the same temperature
Primer Problems
primers should flank the sequence of interest
primer sequences should be unique
primers that match multiple sequences will give multiple products
repeated sequences can be amplified - but only if unique flanking regions can be found where primers can bind
primers can have self-annealing regions within each primer (i.e. hairpin and foldback loops)
pairs of primers can anneal to each other to form the dreaded "primer dimers"
Differential Primers
-New challenges for PCR primer design
gene-specific primers (for multi-gene families)identify specific species or strains of organisms molecular diagnostics/detectors
Consensus primers amplify a gene from all of a diverse group of organsims (eg. bacterial 16-S rDNA)
-Need to work with multiple alignments and find differential or conserved regions
Other Technologies
-Multiplex PCR
-GeneScan (ABI)
-PCR related technologies
-Primer extension
-Taqman (ABI)
-Orchid
-Pyrosequencing
-Ligase chain reaction
-Oligos for microarrays
GCG PRIME
The GCG program PRIME is a good tool for the design of primers for PCR and sequencing
For PCR primer pair selection, you can choose a target range of the template sequence to be amplified
In selecting appropriate primers, PRIME allows you to specify a variety of constraints on the primer and amplified product sequences.
-upper and lower limits for primer and product melting temperatures
-primer and product GC contents.
-a range of acceptable primer sizes
-a range of acceptable product sizes.
-required bases at the 3' end of the primer (3' clamp)
-maximum difference in melting temperatures between a pair of PCR primers
Other Features of PRIME
PRIME uses a simulated annealing test to check individual primers for self-complementarity and to check the two primers in a PCR primer pair for complementarity to each other.
Using this same annealing test, PRIME optionally can screen against non-specific primer binding on the template sequence and on any repeated sequences you specify.
Primer Design on the Web
There are a bunch of good PCR primer design programs on the web:
Primer 3 at the MIT Whitehead Institute
http://www.genome.wi.mit.edu/cgi-bin/primer/primer3_www.cgi
Cassandra at the Univ. of Southern California
http://www-hto.usc.edu/software/procrustes/cassandra/cass_frm.html
GeneFisher by Folker Meyer & Chris Schleiermacher at Bielefeld University, Germany
http://bibiserv.TechFak.Uni-Bielefeld.DE/genefisher/
Xprimer at the Virtual Genome Center, Univ. Minnesota Medical School
http://alces.med.umn.edu/rawprimer.html
Mac/PC Software
There are a number of (expensive) dedicated PCR primers design programs for personal computers that have “special features” such as nested and multiplex PCR :
Oligo (Molecular Biology Insights, Inc.)
Primer Premier (Premier Biosoft)Many of the comprehensive MolBio. programs also have PCR features.
MacVector
OMIGA
Vector NTI
GeneTool
Using Computers for DNA Sequencing
The Biological Basis of DNA Sequencing Technology
Virtually all DNA sequencing, (both automated and manual) relies on the Sanger method
DNA replication with dideoxy chain termination separation of the resulting molecules by polyacrylamide gel electrophoresis.
The DNA fragment to be sequenced must first be cloned into a vector (plasmid or lambda).
Then the cloned DNA must be copied in a test tube (in vitro ) by a DNA polymerase enzyme to obtain a sufficient quantity to be sequenced.
Limitations of the technology
Sequences can only be determined in approximately 400-800 base pair chunks known as “reads.”
This is due to both the biochemistry of the DNA polymerase enzyme and the resolution of polyacrylamide gel electrophoresis.
Most genes contain many thousands of bp and many modern sequencing projects are intended to produce complete sequences of large genomic regions (millions of bp).
Assemby of Contigs
As a result, all sequencing projects must involve the division of the target DNA into a set of overlapping ~500 bp fragments,
and then the assembly of these fragments into complete sequences (contigs).
Contig=contiguous sequenced region
Assembly of overlapping fragments is a computational problem.
Contig Assembly Problems
1)The 500 bp reads of sequence data have errors of both incorrectly determined bases and insertions/deletions.
2)The error rate is highest at the beginning and ends of the reads - precisely the regions that must be overlapped.
3)Some sequence from cloning vectors is often included at the ends of sequence reads.
Sequence Assembly Algorithms
Different than similarity searching
Look for ungapped overlaps at end of fragments.(method of Wilbur and Lipman,SIAM J.Appl.Math.44;557-567,1984)
High degree of identity over a short region want to exclude chance matches, but not be thrown off by sequencing errors.
Vector removal uses similar approach, but less stringent should recognize small regions of identity and tolerate more mismatches.
Overlap at ends, not internal
Software determines strategy
Based on their faith in the speed and reliability of sequence analysis/assembly software, researchers have generally taken one of three different approaches to planning sequencing projects.
Ordered cloning
People who don't trust software generally put a lot of time into dividing large pieces of DNA into small ordered overlapping fragments
This strategy requires much more initial cloning work in the laboratory,but it minimizes the number of actual sequencing reads required to complete a project
It is easy to assemble the reads since it is known how they should fit together to form the final contig
Primer Walking
Make a new primer from the end of each new sequence read
It requires very fast and accurate analysis of sequence reads since each step uses information from the previous read
Skips sub-cloing step entirely since all sequencing reactions can be done on one large clone
Expensive to make a lot of PCR primers,but the price of primer synthesis keeps dropping & there is an economy of scale
Assembly problems are minimized since both the order and the amount of overlap of reads are known
Shotgun Sequencing
Shotgun sequencing takes maximum advantage of the speed and low cost of automated sequencing relies totally on software to assembly a jumble of essentially random sequence reads into a coherent and accurate contig
TIGR demonstrated “proof of concept” on the genomes of Haemophilus influenzae, Methanococcus jannaschii, and Mycoplasma genitalium
Celera Genomics demonstrated the ability to shotgun sequence the entire human genome (?)
Human Genome Assembly
The HGP vs. Celera race to sequence the entire human genome was a classic battle of different strategies
The HGP used an ordered cloning approach
Breaking the genome into mapped BAC clones, then shotgun sequencing the BACs
Celera used a modified shotgun method
Random clones of various sizes (size selected libraries)
Plus relative mapping of clone ends (they must be located in the assembly at the correct distance and orientations
Created custom software to handle the assembly
Celera did make use of the “scaffold” built by the HGP
Other Large Sequencing Projects
Phylogenetic identification/analysis
medical studies of bacteria
environmental samples
EST sequencing - differential expression
cDNA studies
alternate splicing
full length transcripts
Genotyping
score known alleles
identify new mutations
Automation
The "pipeline" approach:
Vector removal
Assembly of identical and/or overlapping fragments
Identify genes
Look up on genome if fully sequenced organism or genome contigs for partially sequences organsims.
BLAST search of GeneBank for similar genes
Look up in specialized database of "predicted genes" ie. ENSEMBL
Project specific analysis differentials between sets Phylogenetics.
DATABASE
What these projects all share is a need to keep track of a lot of data.
Hundreds to thousands of sequences
Many fields of information about each one
Organism, library, plate ID for each clone
the sequence itself
cluster/contig membership
best BLAST hit (accession #, e-value, alignment)
genome position
Can't keep track just using folders and text files on your hard drive.
Design the database to include all possible fields.
(it’s a lot harder to add info later)
Computer tools for sequencing
A wide variety of different software tools have been created to aid DNA sequencing projects.
Each genome project lab has built its own custom software
UNIX based on a particular workflow design PHRED, PHRAP, and Consed.
Many packages for the individual investigator - included in most “comprehensive” molecular biology products: MacVector, LaserGene, DNundefined, etc.
I will focus on the assembly tools in GCG, Consed and the dedicated sequence assembly program Sequencher
The GCG Fragment Assembly System
GCG has a complete set of programs that allow data entry, and assembly of overlapping nucleotide sequence fragments into one contig
SEQED:a single sequence editor
GELSTART:creates fragment assembly projects
GELENTER:adds sequences (reads) to an assembly project, input of new sequences from keyboard, digitizer, or import of existing text files
GELMERGE:assembles individual sequences into contigs, can automatically remove vector sequences
GELASSEMBLE:multiple sequence editor for viewing and editing contigs, allows manual alignment of fragments insertion/deletion of gaps and changing of individual bases
GELDISASSEMBLE:breaks up contigs into individual sequences within a project
GELVIEW:displays contigs as a schematic display of overlapping fragments
SeqLab has a Chromatogram viewer
Other Chromatogram Viewers
Applied Biosystems has a free viewer/editor program for sequence chromatograms.
It is called EditView and it is a Macintosh only program (does not work in System 9.1 and newer).
http://cancer-seqbase.uchicago.edu/documents/EditView.hqx
There are a couple of viewers for Windows machines.
ABIView is free from David H. Klatte.
http://bioinformatics.weizmann.ac.il/software/abiview/abiinfo.html
Chromas is $50 shareware from Conor McCarthy,Technelysium Pty Ltd in Australia.
http://www.technelysium.com.au/chromas.html
Sequencher
Sequencher is a commercial program from the Gene Codes company (its only product) that is entirely dedicated to DNA fragment assembly;
View multiple alignments of the actual chromatograms;
Automatic vector removal;
Integrated views of sequence, chromatograms, and project overview (graphic representations);
Translation and restriction mapping tools (identify polylinkers);
The RCR has a NYU site license for Sequencher for both Mac and Windows.
Consed/PolyPhred
Consed is a graphical sequence assembly editor for UNIX;
It uses an X-windows interface (like SeqLab);
It works together with Phred and Phrap to give the best possible fragment assembly tools;
Uses information from the trace file to build a consensus using the best quality base at each positon;
With PolyPhred it can automatically find SNPs,alleles,and heterozygotes.