丁香实验_LOGO
登录
提问
我要登录
|免费注册
点赞
收藏
wx-share
分享

GCG: Database Searching

互联网

1188
Searches in databases require efficiency and speed. This cannot be achieved by using the same methods as described in the previous chapters on sequence-comparison. It would take much too long to calculate alignment path matrices between the database sequence and the query sequence. However, calculation precision is still needed, because searching a “small” database of 10,000 sequences can no longer be controlled interactively by the researcher. The computer should still be able to separate statistical noise from real “similarity.” This target, however, cannot be achieved in a realistic frame. In Fig. 1 A, you can see a typical score of alignment between a query sequence and the database sequences. The identities will be clearly separated. Interspecies homologies might be clearly visible, but the interesting sequences, the distantly related sequences, might well be hidden in the statistical noise. The “noise” is shown with arrows on top of the scorings to illustrate that the bars are extremely large. The problem is even greater if you are trying to identify distantly related sequences. Then, you will miss identity matches, and interspecies homology matches and the resulting plot will show a very broad statistical noise (see Fig. 1B ). The following considerations will guide you in searching for a sequence in the database without being easily trapped.
Fig. 1.  Scormg histograms of typical database searches. The number of hits IS plotted vs the “score” this hit causes during the searching procedure. Subsequent alignment might change these scores because of gaps and homologies. A. Result of searching human calmodulin DNA in the EMBL database. The related protein, troponin C, is found in the steep descent of the statistical noise. B. Result of searching a randomized sequence (again, calmodulin) at precisely the same conditions. Note the random hits with low scores, and the change of scale in the X axis. C. Result of searching human calmodulin protein sequence with tfastu . Note the difference in scores relative to A. D. Result of searching an alignment of calmodulms using the profilesearch method. The reading frame of 10 calmodulins was extracted from the database and alignment as described in Chapter 9 . Note the difference m the scores relative to A.

提问
扫一扫
丁香实验小程序二维码
实验小助手
丁香实验公众号二维码
关注公众号
反馈
TOP
打开小程序