GCG: Database Searching

互联网2014-01-21

1188

Searches in databases require efficiency and speed. This cannot be achieved by using the same methods as described in the previous chapters on sequence-comparison. It would take much too long to calculate alignment path matrices between the database sequence and the query sequence. However, calculation precision is still needed, because searching a “small” database of 10,000 sequences can no longer be controlled interactively by the researcher. The computer should still be able to separate statistical noise from real “similarity.” This target, however, cannot be achieved in a realistic frame. In Fig. 1 A, you can see a typical score of alignment between a query sequence and the database sequences. The identities will be clearly separated. Interspecies homologies might be clearly visible, but the interesting sequences, the distantly related sequences, might well be hidden in the statistical noise. The “noise” is shown with arrows on top of the scorings to illustrate that the bars are extremely large. The problem is even greater if you are trying to identify distantly related sequences. Then, you will miss identity matches, and interspecies homology matches and the resulting plot will show a very broad statistical noise (see Fig. 1B ). The following considerations will guide you in searching for a sequence in the database without being easily trapped.

http://img.dxycdn.com/trademd/upload/asset/meeting/2014/01/21/A1389855971.jpg

Fig. 1. Scormg histograms of typical database searches. The number of hits IS plotted vs the “score” this hit causes during the searching procedure. Subsequent alignment might change these scores because of gaps and homologies. A. Result of searching human calmodulin DNA in the EMBL database. The related protein, troponin C, is found in the steep descent of the statistical noise. B. Result of searching a randomized sequence (again, calmodulin) at precisely the same conditions. Note the random hits with low scores, and the change of scale in the X axis. C. Result of searching human calmodulin protein sequence with tfastu . Note the difference in scores relative to A. D. Result of searching an alignment of calmodulms using the profilesearch method. The reading frame of 10 calmodulins was extracted from the database and alignment as described in Chapter 9 . Note the difference m the scores relative to A.