用生物信息学方法对ncRNA进行鉴定
互联网
简介: 在利用gene-finding 软件预测基因编码区的同时,就尝试着用生物信息学方法对ncRNA 进行鉴定;但由于ncRNA缺少编码蛋白质的基因所具有的典型特征,如启动子和终止子、开放阅读框、特异的剪切位点、多聚腺苷酸化位点和CG 岛等,且ncRNA 基因较小,用于gene-finding 软件的基序(motif)变动较大等,因此,到目前为止,还没有高效且通用的ncRNA 基因的预测算法。 现在能成功对ncRNA预测的gene-finding编程软件一般被设计成只能搜索单一种类的ncRNA,如tRNAScan-SE 搜索tRNA、snoScan 搜索带C/D盒的snoRNAs、SnoGps 搜索带H/ACA 盒的snoRNAs、mirScan 搜索microRNA等等。 一些基于基序聚类的软件,如RNAmotifs、Erpin以及Patsearch也用于对ncRNA 的搜索,但是这些软件同搜索单一种类的ncRNA软件相比,灵敏度和特异性都较差。实际上,用实验方法已证实的ncRNA 很少是用这类软件鉴定出来的。 随着各种生物物种基因组计划的实施,基因组的序列比较分析可用来检测ncRNA和cis-regulatoryRNA 的二级结构,如用QRNA 已检测出在大肠杆菌、酿酒酵母菌和激烈火球菌中的ncRNA,并在随后的实验中得到了证实。 举例来说:ncRNA Identification Methods Examples: 1. (Sequence homology methods) 在一些例子中,当两个物种的进化距离比较近,一个简单的序列相似性的比对,通过BLAST或者FASTA就足够确认RNA基因.在比较紧密相关的RNA基因地时候这些同源性的搜索是第一步 2. (Pattern matching and covariance models) 3. Profile HMMs of highly conserved regions in P and MRP RNA that other conserved features of the RNA were present 4.Identification of protein homologues An efficient method for protein identification is PSI-BLAST (Position Specific Iterative BLAST). PSI-BLAST can repeatedly search the target databases, using a multiple alignment of high scoring sequences found in each search round to generate a new more sensitive scoring matrix able to find distantly related sequences that are sometimes missed in a BLAST search. Multiple PSI-BLAST searches with different query sequences were carried out in order to identify as many homologues as possible belonging to a certain protein family.The NCBI Genbank protein set was used as the primary source, but additional proteins were identified from individual genome projects or identified from TBLASTN searches of genome sequences. Whenever relevant, these novel sequences were included in the set of sequences used as database in the PSI-BLAST search.We also used profile HMMs at the Pfam database for Pop1, Pop3 (Rpp38), Pop5, Rpp14,Rpp20, Rpp25, Rpp40, Rpr2 (Rpp21) to identify homologues. In cases where available Pfam models were not sufficient or present, new models were created from multiple alignments and used with the HMMER package to find additional homologues. 5.ncRNA prediction using de novo methods QRNA makes a prediction of ncRNA based on pairwise alignments . It compares the score of three distinct models of sequence evolution to decide which one describes best thegiven alignment: a pair SCFG is used to model the evolution of secondary structure, a pair hidden Markov model (HMM) describes the evolution of protein coding sequence, and a different pair HMM implements the independent model of a sequence with an evolutionary random pattern not consistent with either a secondary structure or protein coding sequence.QRNA is currently limited to pairwise alignments, and rather slow for ncRNA gene prediction at a genomic scale. A program similar to QRNA, which tests for complementary mutations in three-sequence multiple alignments, is ddbRNA . It searches for common stems in the multiple alignments in a greedy fashion. The assessment of the significance of the conserved structure is based on shuffled alignments. The program RNAz makes a prediction of ncRNA based on multiple sequence alignments . It uses two independent criteria for classification: a z-score measuring thermodynamic stability of individual sequences, and a structure conservation index obtained by comparing folding energies of the individual sequences with the predicted consensus folding. The two criteria are then combined to detect conserved and stable RNA secondary structures with high sensitivity and specificity. Yet another application suitable for multiple alignments is MSARI . The approach uses information from a larger set of sequence-aligned orthologs to detect significant ncRNA secondary structures. Primary sequence alignments are often inaccurate. In MSARI, one part of the method tries to correct errors in multiple alignments through energy minimisation calculations. |