请教关于PMF搜索数据库的算法问题
丁香园论坛
642
不知这里有没有人研究PROTEOMICS中根据PEPTIDE MASS FINGERPRINT(PMF)搜索数据库来鉴定蛋白质的算法问题的。
本人最近在研究其中的一种算法 probability based algorithm,其代表软件是Mascot, 其前身是MOWSE,但查了很多文献,具体谈到细节的很少,只有如下一些:
“MOWSE Scoring scheme
The final scoring scheme is based on the frequency of a
fragment molecular weight being found in a protein of a given
range of molecular weight. OWL database sequence entries were
initially grouped into 10 kDalton intact molecular weight
intervals. For each 10 kDalton protein interval, peptide fragment
molecular weights were assigned to cells of 100 Dalton intervals.
The cells therefore contained the number of times a particular
fragment molecular weight occurred in a protein of any given size.
This operation was performed for each enzyme. Cell frequency
values were calculated by dividing each cell value by the total
number of peptides in each 10 kD protein interval. Cell frequency
values for each 10 kDalton interval were then normalised to the
largest cell value (Fmax), with all the cell values recalculated
as:
Cell value = Old value / Fmax
to yield floating point numbers between 0 and 1. These
distribution frequency values, calculated for each cleavage
reagent, were then built into the MOWSE search program. For
every database entry scanned, all matching fragments contribute to
the final score. In the current implementation, non-matching
fragments are ignored (neutral). For each matching peptide Mw a
score is assigned by looking up the appropriate normalised
distribution frequency value. In the case of multiple 'hits' in
any one target protein (i.e. more than one matching peptide Mw),
the distribution frequency scores are multiplied. The final
product score is inverted and then normalised to an 'average'
protein Mw of 50 kDaltons to reduce the influence of random score
accumulation in large proteins (>200 kDaltons). The final score is
thus calculated as:
Score = 50/(Pn x H)
Where Pn is the product of n distribution scores and H the 'hit'
protein molecular weight in kD.
Important consequences of this type of scoring scheme
are that matches with peptides of higher Mw carry more scoring
weight, and that the non-random distribution of fragment molecular
weights in proteins of different sizes is compensated for.
”
也就是只有两个公式。
不知各位大虾中有无对此有研究的,请指教,特别是有无关于Mascot 算法、实现具体的较详细的资料。各位有兴趣的同仁也可不吝帖子,讨论讨论。
关于另外一个PMF搜索软件 proFound的算法的资料、观点也请指教
本人最近在研究其中的一种算法 probability based algorithm,其代表软件是Mascot, 其前身是MOWSE,但查了很多文献,具体谈到细节的很少,只有如下一些:
“MOWSE Scoring scheme
The final scoring scheme is based on the frequency of a
fragment molecular weight being found in a protein of a given
range of molecular weight. OWL database sequence entries were
initially grouped into 10 kDalton intact molecular weight
intervals. For each 10 kDalton protein interval, peptide fragment
molecular weights were assigned to cells of 100 Dalton intervals.
The cells therefore contained the number of times a particular
fragment molecular weight occurred in a protein of any given size.
This operation was performed for each enzyme. Cell frequency
values were calculated by dividing each cell value by the total
number of peptides in each 10 kD protein interval. Cell frequency
values for each 10 kDalton interval were then normalised to the
largest cell value (Fmax), with all the cell values recalculated
as:
Cell value = Old value / Fmax
to yield floating point numbers between 0 and 1. These
distribution frequency values, calculated for each cleavage
reagent, were then built into the MOWSE search program. For
every database entry scanned, all matching fragments contribute to
the final score. In the current implementation, non-matching
fragments are ignored (neutral). For each matching peptide Mw a
score is assigned by looking up the appropriate normalised
distribution frequency value. In the case of multiple 'hits' in
any one target protein (i.e. more than one matching peptide Mw),
the distribution frequency scores are multiplied. The final
product score is inverted and then normalised to an 'average'
protein Mw of 50 kDaltons to reduce the influence of random score
accumulation in large proteins (>200 kDaltons). The final score is
thus calculated as:
Score = 50/(Pn x H)
Where Pn is the product of n distribution scores and H the 'hit'
protein molecular weight in kD.
Important consequences of this type of scoring scheme
are that matches with peptides of higher Mw carry more scoring
weight, and that the non-random distribution of fragment molecular
weights in proteins of different sizes is compensated for.
”
也就是只有两个公式。
不知各位大虾中有无对此有研究的,请指教,特别是有无关于Mascot 算法、实现具体的较详细的资料。各位有兴趣的同仁也可不吝帖子,讨论讨论。
关于另外一个PMF搜索软件 proFound的算法的资料、观点也请指教