用生物信息学方法对ncRNA进行鉴定

互联网2013-08-26

3813

简介:

在利用gene-finding 软件预测基因编码区的同时，就尝试着用生物信息学方法对ncRNA 进行鉴定；但由于ncRNA缺少编码蛋白质的基因所具有的典型特征，如启动子和终止子、开放阅读框、特异的剪切位点、多聚腺苷酸化位点和CG 岛等，且ncRNA 基因较小，用于gene-finding 软件的基序(motif)变动较大等，因此，到目前为止，还没有高效且通用的ncRNA 基因的预测算法。

现在能成功对ncRNA预测的gene-finding编程软件一般被设计成只能搜索单一种类的ncRNA，如tRNAScan-SE 搜索tRNA、snoScan 搜索带C/D盒的snoRNAs、SnoGps 搜索带H/ACA 盒的snoRNAs、mirScan 搜索microRNA等等。

一些基于基序聚类的软件，如RNAmotifs、Erpin以及Patsearch也用于对ncRNA 的搜索，但是这些软件同搜索单一种类的ncRNA软件相比，灵敏度和特异性都较差。实际上，用实验方法已证实的ncRNA 很少是用这类软件鉴定出来的。

随着各种生物物种基因组计划的实施，基因组的序列比较分析可用来检测ncRNA和cis-regulatoryRNA 的二级结构，如用QRNA 已检测出在大肠杆菌、酿酒酵母菌和激烈火球菌中的ncRNA，并在随后的实验中得到了证实。

举例来说:

ncRNA Identification Methods Examples:

1. (Sequence homology methods)

在一些例子中,当两个物种的进化距离比较近,一个简单的序列相似性的比对,通过BLAST或者FASTA就足够确认RNA基因.在比较紧密相关的RNA基因地时候这些同源性的搜索是第一步

2. (Pattern matching and covariance models)

For the identification of P/MRP RNA as well as IRE we used a combination of pattern searches and secondary structure profile searches with cmsearch of the Infernal package. Nuclear P RNA and MRP RNA sequences are poorly conserved in sequence. However,three conserved regions are shared; CR-I, CR-IV and CR-V. For nuclear P RNA there are also conserved elements in the domain 2 to take into account; CR-II and CR-III. Therefore, for the identification of P and MRP RNA we used a pattern based on consensus features including the CR-I, CR-IV and CR-V motifs as well as base-pairing rules consistent with the helix P2.When a P or MRP RNA gene was not found using these patterns new searches were carried out where mismatches were allowed. After the pattern matching procedure, sequences fitting the secondary structure template were further analyzed with Rfam covariance models. Highscoring candidates were further analyzed for characteristics typical for P/MRP RNA secondary structure; base pairing between the CR-I and CR-V motifs, presence of CR-IV as well as the helices P1, P2 and P3. Also IREs were identified using a combination of pattern matching and covariance models.To identify as many potential IREs as possible we primarily searched available mRNA sequences. In case there was no available mRNA, genomic sequences was searched for regions homologous to available proteins/mRNAs. Whenever an IRE candidate was found in a genomic sequence it was checked for reasonable proximity to the protein/mRNA match.Candidate sequences were checked for conserved primary sequence motifs and the ability to fold into a secondary structure typical for the iron responsive element

3. Profile HMMs of highly conserved regions in P and MRP RNA

For prediction of P and MRP RNAs we also used profile HMMs created from CR-I and CR-V multiple alignments. We further analyzed all genomic sequences that contained the CR-I and CR-V motifs and where the distance between the two motifs is less than 3000 bases. Advantages of this method are that large genomes may be searched quickly (100 Mbases in a few minutes) and in a highly specific manner identifies the P and MRP RNA genes.Candidates identified in the search based on HMM profiles were further analyzed to check
that other conserved features of the RNA were present

4.Identification of protein homologues

An efficient method for protein identification is PSI-BLAST (Position Specific Iterative BLAST). PSI-BLAST can repeatedly search the target databases, using a multiple alignment of high scoring sequences found in each search round to generate a new more sensitive scoring matrix able to find distantly related sequences that are sometimes missed in a BLAST search. Multiple PSI-BLAST searches with different query sequences were carried out in order to identify as many homologues as possible belonging to a certain protein family.The NCBI Genbank protein set was used as the primary source, but additional proteins were identified from individual genome projects or identified from TBLASTN searches of genome sequences. Whenever relevant, these novel sequences were included in the set of sequences used as database in the PSI-BLAST search.We also used profile HMMs at the Pfam database for Pop1, Pop3 (Rpp38), Pop5, Rpp14,Rpp20, Rpp25, Rpp40, Rpr2 (Rpp21) to identify homologues. In cases where available Pfam models were not sufficient or present, new models were created from multiple alignments and used with the HMMER package to find additional homologues.

To identify homologues to previously known proteins whose mRNAs are known to contain IREs we mainly used BLAST to search the NCBI Genbank set of proteins. Some gene sequences that were not in Genbank were identified by Genewise [160] Genewise uses a combination of comparative analysis (aligns proteins to genomic sequences) together with statistical signals to predict genes. For classification of proteins we also made use of phylogenetic analysis, including methods of parsimony, maximum likelihood and neighbour-joining..

5.ncRNA prediction using de novo methods

As opposed to the methods that detect new members of already known ncRNA families described previously (IRE and MRP/P RNA identification), we have also used two de novo methods, QRNA and RNAz , to computationally screen the S.cerevisae genome for ncRNAs.

QRNA makes a prediction of ncRNA based on pairwise alignments . It compares the score of three distinct models of sequence evolution to decide which one describes best thegiven alignment: a pair SCFG is used to model the evolution of secondary structure, a pair hidden Markov model (HMM) describes the evolution of protein coding sequence, and a different pair HMM implements the independent model of a sequence with an evolutionary random pattern not consistent with either a secondary structure or protein coding sequence.QRNA is currently limited to pairwise alignments, and rather slow for ncRNA gene prediction at a genomic scale. A program similar to QRNA, which tests for complementary mutations in three-sequence multiple alignments, is ddbRNA . It searches for common stems in the multiple alignments in a greedy fashion. The assessment of the significance of the conserved structure is based on shuffled alignments.

The program RNAz makes a prediction of ncRNA based on multiple sequence alignments . It uses two independent criteria for classification: a z-score measuring thermodynamic stability of individual sequences, and a structure conservation index obtained by comparing folding energies of the individual sequences with the predicted consensus folding. The two criteria are then combined to detect conserved and stable RNA secondary structures with high sensitivity and specificity. Yet another application suitable for multiple alignments is MSARI . The approach uses information from a larger set of sequence-aligned orthologs to detect significant ncRNA secondary structures. Primary sequence alignments are often inaccurate. In MSARI, one part of the method tries to correct errors in multiple alignments through energy minimisation calculations.

(责任编辑：admin)

关于丁香通

公司信息

个人用户

企业机构

无忧采购轻松科研

提问

扫一扫

实验小助手

扫码领资料

反馈

TOP

打开小程序