The TRC shRNA Design Methods and Rules

互联网2008-06-18

1674

Overview

We design shRNA molecules with an algorithm. Our algorithm uses several criteria to rank potential 21mer targets within each human and mouse Refseq transcript. The algorithm applies a set of rules, including those derived from the siRNA literature, our cloning scheme, constraints on the synthesis of the oligonucleotides and others. In applying the algorithm, our aim is to achieve a balance of two competing goals: make hairpins that effectively knockdown the target transcript and, as best possible, design hairpins that knockdown only one gene and not other so-called 'off-target' genes. Each goal presents distinct challenges. The criteria for predicting effective knockdown with either siRNA or shRNA are not well understood. Our rules are primarily derived from the siRNA literature; how well these rules apply to shRNA design is unclear. Genome evolution constrains target specificity. Many genes are part of extensive gene families, which may make targeting any one gene difficult. Functionally distinct genes share many motifs. Our knowledge of transcript structure and variants is still very incomplete as well. For all these reasons and more, we construct 5 shRNAs for each transcript with the expectation of getting a range of knockdown efficiencies across the set and at least one or two which knockdown effectively.

Users of this database should be aware that in order to have consistent and reliable annotation, the TRC consortium decided early on to use NCBI's REFSEQ collection of transcripts as the definitive source of information for the primary target sequence for the design of shRNA molecules.

As a general rule in the construction of the library, we construct shRNA molecules targeting just the first Refseq transcript reported from each NCBI gene. In part due to our design process, see below, the majority of the shRNAs target all known transcript variants.

A brief narrative of the candidate selection process

Get the Candidate Sequences

For each human and mouse Refseq transcript, we generate all 21mers starting 25 bp after the beginning of the CDS up to those starting 150 bp from the end of the transcript. Each 21mer is called a 'candidate'.

Score the Candidate Sequences For Knockdown Efficiency

Each candidate is given an "original score" by applying a set of rules that either penalize or reward features predicting successful knockdown and clone-design considerations, and then calculating the product of all the penalties/rewards. The individual rules are listed below. The candidates are then sorted by score and all those above a minimum score are stored.

Score the Candidates Sequences for Specificity

We are forced to balance the prediction of knockdown efficiency against the desire to minimize interaction with off-target genes, without a clear understanding of just how to predict off-target "hits". We calculate a "specificity score" to promote candidates without obvious off-target transcripts. Each candidate is compared by BLASTN to two distinct abstractions of the transcriptome: the NCBI Unigene "unique" database (vaguely defined by NCBI as the "longest, best" sequence from each unigene cluster), and the transcripts from Refseq. We deem a 'miss' any sequence pair with at least three differences, with at least two of the differences in the core positions 3-19, i.e., not on the ends of the 21mer target region. We then determine if each candidate hits one unigene cluster, one Locuslink transcript, one Locuslink gene, and for those genes with muliple transcripts, all the the transcripts in the gene. Using just the "hits-One-Unigene" and the "hits-One-NM" values, we apply a "specificity score" to each candidate whereby candidates that uniquely hit one unigene cluster AND one Locuslink transcript are rewarded, those that hit one unigene OR one Locuslink transcript are rewarded, but less so, and those that had neither unigene or Locuslink specificity are penalized. After determining and storing this "specificityScore", we resort the candidates.

Spacing the candidate 21mers along the transcript

Since we synthesize 5 oligo pairs for each transcript, and since we hypothesize a role for the secondary structure of the target transcript in the effectiveness of an shRNA, we want to have the candidates spread out along the transcript, with one from the 3-prime UTR region and 4 along the CDS. To pick the five candidates, the highest scoring three-prime UTR candidate, if available, is chosen first. Next the top scoring candidate among the CDS candidates is chosen. A position-penalty is then applied to all the other CDS candidates, where the penalty is more severe the closer the candidate is to the first CDS candidate picked. After applying the position penalty, all the CDS candidates are resorted by their newly calculated, position-weighted score. From the list of remaining CDS candidates, the highest-scoring candidate is chosen and the position penalty is applied to all the remaining candidates based upon the already picked CDS candidates. This process is repeated until all the candidates are rescored. Finally the top 5 position-, specificity-weighted candidates are chosen for oligo synthesis.