Mapping, PCR DNA Sequencing

互联网2008-06-19

4594

Software for:

Making Restriction Maps

Designing PCR primers

Assembling fragments, detecting SNPs, and managing raw sequence data

For:GCG,Mac,PC,UNIX,and on the Web

Restriction Mapping

-Making restriction maps was my first use of “Molecular Biology” software

-Making restriction maps is a routine lab activity that is necessary for any type of cloning project.

-High quality maps are important for publications and exchange of information between researchers or between labs.

Archiving Data

-Maps are a common way for labs to archive information about entire libraries of plasmid constructs.

-It is important that map data are stored in a reliable format so that the obsolescence of a particular computer program does not render archives unusable.

Mapping Software

-Programs vary greatly in sophistication and ease of use

-Simple drawing programs (vector graphics)

-The venerable DNA Strider

-GCG (not a strong point of the package)

-Comprehensive Mac/PC MolBio programs

-Dedicated plasmid drawing programs

-Can it be done on the Web?

-Making high quality graphical restriction maps is one area where Mac/PC programs are much better than GCG or the Web

GCG Mapping Programs

GCG has a full set of functional restriction mapping tools.

This may be your best bet if you simply need to locate some restriction sites to plan a cloning project.

MAP allows you to search for any enzyme site in REBASE.

MAPPLOT makes a graphical output from MAP.

MAPSORT simulates a restriction digest and predicts the sizes of digest products with any combination of enzymes.

FINDPATTERNS allows you to search for short sequence patterns (enzyme sites, promoters, enhancers, etc.).

PLASMIDMAP is a GCG program that produces a "publication quality circular map" of a plasmid construct.

(I know of no one who has ever successfully used this program.)

Using MAP

MAP is the main GCG restriction mapping program. Like a lot of GCG programs, it is very powerful and quite complex.

-Restriction sites can be mapped for all enzymes (the default), or a set of enzymes that you specify by name.

-You can also select just enzymes with 4, 5, or 6 base recognition sites; and 5’, 3’, or blunt end cutters.

-You can allow a single base mismatch between the enzyme recognition site and your target sequence.

-The output can be viewed as a linear map or in a table format.

-MAP provides protein translations (in 3, 6, or any single reading frame).

Web Mapping Tools

There are some free mapping tools on the web for finding restriction sites and making text maps, but not for nice graphical maps.

Webcutter (Max Heiman, Yale Univ.):http://www.firstmarket.com/cutter/cut2.html

EMBOSS Restrict (EMBL Institut Pasteur)

http://bioweb.pasteur.fr/seqanal/interfaces/restrict.html

Restriction Maps (Colorado State Univ.)

http://arbl.cvmbs.colostate.edu/molkit/mapper/index.html (uses Java)

WebCutter

Webcutter is a free on-line tool to restriction map nucleotide sequence (text output).

http://www.firstmarket.com/cutter/cut2.html

Webcutter includes the option of finding restriction sites that can be introduced into a sequence by silent mutagenesis.

EMBOSS Restrict

http://bioweb.pasteur.fr/seqanal/interfaces/restrict.html

A JAVA Program

http://arbl.cvmbs.colostate.edu/molkit/mapper/index.html

From R. A. Bowen at Colorado Sate University.

JAVA applets displayed on a web page cannot be directly printed, can only grab screen shots -useless for publication.

Mac&PC Mapping Programs

Restriction mapping is one of the simplest molecular biology computing tasks - many MolBio software packages provide this function:

DNA Strider(very old,Mac only)(download from:http://endeavor.med.nyu.edu)

MacVector(Mac only,RCR has a site license)

OMIGA(Windows only,RCR has a site license)

Sequencher(Mac and Windows,RCR has a site license)

Gene Construction Kit(Mac only)

Vector NTI(Mac and Windows)

Plasmid Premier (Windows only)

DNA Strider is simple,but still elegant

Vector NTI

Vector NTI puts the plasmid map at the center of all program functions.

PCR Primer Design

The design of PCR (and sequencing) primers is relatively simple from a computational point of view:just search along a sequence and find short sub-sequences that fit certain criteria.

However,since the molecular biology of PCR is very complex, the nature of these criteria is not at all obvious.

All primers design software uses approx-imately the same criteria and computing algorithms. Graphical output is not necessary.

Molecular Biology of PCR

The fundamental Molecular Biology of PCR is not well understood.

We know what happens in a descriptive sense,but not the physical chemistry/thermodynamics.

The rules for choosing PCR primers are a rough combination of educated guesses and old fashioned trial-and-error.

None of the published formulas for calculating annealing temperatures has been proven to give better than a rough estimate.

The PCR Process

In a nutshell,PCR works like this:

DNA and two primers are combined in a salt solution with dNTPs and a heat stable DNA polymerase enzyme.

The primers match some sequence in the target DNA.

The solution is rapidly heated to DNA denaturing temperatures (~95°C) and cooled to a temperature where the polymerase can function.

Each thermal cycle generates copies of the sequence between the primers,so the total number of fragments amplifies in an exponential fashion: 2, 4, 8,16, 32, 64, etc.

Primer Design Rules

primers should be at least 15 base pairs long

have at least 50% G/C content

anneal at a temperature in the range of 50-65 degrees C

Usually higher annealing temperatures (Tm) are better (i.e. more specific for your desired target)

forward and reverse primer should anneal at approximately the same temperature

Primer Problems

primers should flank the sequence of interest

primer sequences should be unique

primers that match multiple sequences will give multiple products

repeated sequences can be amplified - but only if unique flanking regions can be found where primers can bind

primers can have self-annealing regions within each primer (i.e. hairpin and foldback loops)

pairs of primers can anneal to each other to form the dreaded "primer dimers"

Differential Primers

-New challenges for PCR primer design

gene-specific primers (for multi-gene families)identify specific species or strains of organisms molecular diagnostics/detectors

Consensus primers amplify a gene from all of a diverse group of organsims (eg. bacterial 16-S rDNA)

-Need to work with multiple alignments and find differential or conserved regions

Other Technologies

-Multiplex PCR

-GeneScan (ABI)

-PCR related technologies

-Primer extension

-Taqman (ABI)

-Orchid

-Pyrosequencing

-Ligase chain reaction

-Oligos for microarrays

GCG PRIME

The GCG program PRIME is a good tool for the design of primers for PCR and sequencing

For PCR primer pair selection, you can choose a target range of the template sequence to be amplified

In selecting appropriate primers, PRIME allows you to specify a variety of constraints on the primer and amplified product sequences.

-upper and lower limits for primer and product melting temperatures

-primer and product GC contents.

-a range of acceptable primer sizes

-a range of acceptable product sizes.

-required bases at the 3' end of the primer (3' clamp)

-maximum difference in melting temperatures between a pair of PCR primers

Other Features of PRIME

PRIME uses a simulated annealing test to check individual primers for self-complementarity and to check the two primers in a PCR primer pair for complementarity to each other.

Using this same annealing test, PRIME optionally can screen against non-specific primer binding on the template sequence and on any repeated sequences you specify.

Primer Design on the Web

There are a bunch of good PCR primer design programs on the web:

Primer 3 at the MIT Whitehead Institute

http://www.genome.wi.mit.edu/cgi-bin/primer/primer3_www.cgi

Cassandra at the Univ. of Southern California

http://www-hto.usc.edu/software/procrustes/cassandra/cass_frm.html

GeneFisher by Folker Meyer & Chris Schleiermacher at Bielefeld University, Germany

http://bibiserv.TechFak.Uni-Bielefeld.DE/genefisher/

Xprimer at the Virtual Genome Center, Univ. Minnesota Medical School

http://alces.med.umn.edu/rawprimer.html

Mac/PC Software

There are a number of (expensive) dedicated PCR primers design programs for personal computers that have “special features” such as nested and multiplex PCR :

Oligo (Molecular Biology Insights, Inc.)

Primer Premier (Premier Biosoft)Many of the comprehensive MolBio. programs also have PCR features.

MacVector

OMIGA

Vector NTI

GeneTool

Using Computers for DNA Sequencing

The Biological Basis of DNA Sequencing Technology

Virtually all DNA sequencing, (both automated and manual) relies on the Sanger method

DNA replication with dideoxy chain termination separation of the resulting molecules by polyacrylamide gel electrophoresis.

The DNA fragment to be sequenced must first be cloned into a vector (plasmid or lambda).

Then the cloned DNA must be copied in a test tube (in vitro ) by a DNA polymerase enzyme to obtain a sufficient quantity to be sequenced.

Limitations of the technology

Sequences can only be determined in approximately 400-800 base pair chunks known as “reads.”

This is due to both the biochemistry of the DNA polymerase enzyme and the resolution of polyacrylamide gel electrophoresis.

Most genes contain many thousands of bp and many modern sequencing projects are intended to produce complete sequences of large genomic regions (millions of bp).

Assemby of Contigs

As a result, all sequencing projects must involve the division of the target DNA into a set of overlapping ~500 bp fragments,

and then the assembly of these fragments into complete sequences (contigs).

Contig=contiguous sequenced region

Assembly of overlapping fragments is a computational problem.

Contig Assembly Problems

1)The 500 bp reads of sequence data have errors of both incorrectly determined bases and insertions/deletions.

2)The error rate is highest at the beginning and ends of the reads - precisely the regions that must be overlapped.

3)Some sequence from cloning vectors is often included at the ends of sequence reads.

Sequence Assembly Algorithms

Different than similarity searching

Look for ungapped overlaps at end of fragments.(method of Wilbur and Lipman,SIAM J.Appl.Math.44;557-567,1984)

High degree of identity over a short region want to exclude chance matches, but not be thrown off by sequencing errors.

Vector removal uses similar approach, but less stringent should recognize small regions of identity and tolerate more mismatches.

Overlap at ends, not internal

Software determines strategy

Based on their faith in the speed and reliability of sequence analysis/assembly software, researchers have generally taken one of three different approaches to planning sequencing projects.

Ordered cloning

People who don't trust software generally put a lot of time into dividing large pieces of DNA into small ordered overlapping fragments

This strategy requires much more initial cloning work in the laboratory,but it minimizes the number of actual sequencing reads required to complete a project

It is easy to assemble the reads since it is known how they should fit together to form the final contig

Primer Walking

Make a new primer from the end of each new sequence read

It requires very fast and accurate analysis of sequence reads since each step uses information from the previous read

Skips sub-cloing step entirely since all sequencing reactions can be done on one large clone

Expensive to make a lot of PCR primers,but the price of primer synthesis keeps dropping & there is an economy of scale

Assembly problems are minimized since both the order and the amount of overlap of reads are known

Shotgun Sequencing

Shotgun sequencing takes maximum advantage of the speed and low cost of automated sequencing relies totally on software to assembly a jumble of essentially random sequence reads into a coherent and accurate contig

TIGR demonstrated “proof of concept” on the genomes of Haemophilus influenzae, Methanococcus jannaschii, and Mycoplasma genitalium

Celera Genomics demonstrated the ability to shotgun sequence the entire human genome (?)

Human Genome Assembly

The HGP vs. Celera race to sequence the entire human genome was a classic battle of different strategies

The HGP used an ordered cloning approach

Breaking the genome into mapped BAC clones, then shotgun sequencing the BACs

Celera used a modified shotgun method

Random clones of various sizes (size selected libraries)

Plus relative mapping of clone ends (they must be located in the assembly at the correct distance and orientations

Created custom software to handle the assembly

Celera did make use of the “scaffold” built by the HGP

Other Large Sequencing Projects

Phylogenetic identification/analysis

medical studies of bacteria

environmental samples

EST sequencing - differential expression

cDNA studies

alternate splicing

full length transcripts

Genotyping

score known alleles

identify new mutations

Automation

The "pipeline" approach:

Vector removal

Assembly of identical and/or overlapping fragments

Identify genes

Look up on genome if fully sequenced organism or genome contigs for partially sequences organsims.

BLAST search of GeneBank for similar genes

Look up in specialized database of "predicted genes" ie. ENSEMBL

Project specific analysis differentials between sets Phylogenetics.

DATABASE

What these projects all share is a need to keep track of a lot of data.

Hundreds to thousands of sequences

Many fields of information about each one

Organism, library, plate ID for each clone

the sequence itself

cluster/contig membership

best BLAST hit (accession #, e-value, alignment)

genome position

Can't keep track just using folders and text files on your hard drive.

Design the database to include all possible fields.

(it’s a lot harder to add info later)

Computer tools for sequencing

A wide variety of different software tools have been created to aid DNA sequencing projects.

Each genome project lab has built its own custom software

UNIX based on a particular workflow design PHRED, PHRAP, and Consed.

Many packages for the individual investigator - included in most “comprehensive” molecular biology products: MacVector, LaserGene, DNundefined, etc.

I will focus on the assembly tools in GCG, Consed and the dedicated sequence assembly program Sequencher

The GCG Fragment Assembly System

GCG has a complete set of programs that allow data entry, and assembly of overlapping nucleotide sequence fragments into one contig

SEQED:a single sequence editor

GELSTART:creates fragment assembly projects

GELENTER:adds sequences (reads) to an assembly project, input of new sequences from keyboard, digitizer, or import of existing text files

GELMERGE:assembles individual sequences into contigs, can automatically remove vector sequences

GELASSEMBLE:multiple sequence editor for viewing and editing contigs, allows manual alignment of fragments insertion/deletion of gaps and changing of individual bases

GELDISASSEMBLE:breaks up contigs into individual sequences within a project

GELVIEW:displays contigs as a schematic display of overlapping fragments

SeqLab has a Chromatogram viewer

Other Chromatogram Viewers

Applied Biosystems has a free viewer/editor program for sequence chromatograms.

It is called EditView and it is a Macintosh only program (does not work in System 9.1 and newer).

http://cancer-seqbase.uchicago.edu/documents/EditView.hqx

There are a couple of viewers for Windows machines.

ABIView is free from David H. Klatte.

http://bioinformatics.weizmann.ac.il/software/abiview/abiinfo.html

Chromas is $50 shareware from Conor McCarthy,Technelysium Pty Ltd in Australia.

http://www.technelysium.com.au/chromas.html

Sequencher

Sequencher is a commercial program from the Gene Codes company (its only product) that is entirely dedicated to DNA fragment assembly;

View multiple alignments of the actual chromatograms;

Automatic vector removal;

Integrated views of sequence, chromatograms, and project overview (graphic representations);

Translation and restriction mapping tools (identify polylinkers);

The RCR has a NYU site license for Sequencher for both Mac and Windows.

Consed/PolyPhred

Consed is a graphical sequence assembly editor for UNIX;

It uses an X-windows interface (like SeqLab);

It works together with Phred and Phrap to give the best possible fragment assembly tools;

Uses information from the trace file to build a consensus using the best quality base at each positon;

With PolyPhred it can automatically find SNPs,alleles,and heterozygotes.