Monday, March 31, 2008

My Geek Pride is hurt: BLOSUM matrices

BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) are the canonical substitution matrices used for scoring protein sequence alignments. In essence, it calculates the relative frequencies of all aminoacids in each position within an alignment and assigned a probability to the substitution of a particular residue. BLOSUM matrices built with closely related sequences are more stringent and have high numbers (BLOSUM80) indicating the percentage similarity allowed to include a sequence in the matrix (in the latter case, all proteins share at least 80% sequence identity).

BLOSUM matrices were developed in 1992 by Henikoff and Henikoff and since then have been extensively used in all analyses involving protein sequences...

and then, here comes he "AAARRGHHHH!!!"

Styczynski et al (2008) were killing their time looking at the evolution of the BLOCKS database and found the unthinkable.... an error in the source code for the algorithm that calculates de BLOSUM matrices!!!! that means... the results obtained with the available BLOSUM matrices differ significantly from the expected algorithm from Henikoff & Henikoff... merde!

Weirdest thing of all.. when corrected and tested back for the use of the matrices in database sequence search, it turned out that the "wrong" matrices performed much better in retrieving protein homologs than the "corrected" matrices.

Fortunately, it seems that though the difference is statistically significative, it is not big. That means, we haven't fucked it up so bad.

Epilogue to the blosum...

1) 16 years of extensive usage doesn't mean it is RIGHT.

2) how come that no one, ever, in 16 years, ever noticed this difference!!! THAT is what happens with dogma... when you take anything from granted

3) messing things up is not always THAT bad...

4) I didn't understand from the article if they proposed that the matrices were corrected even if they performed worse...

5) I would expect to see a huge ocean of erratas everywhere because "when using the revised blosum matrices... our results from the past ten years have completely changed!!"

8 Comments:

Xiuh said...

Bueno si eres biólogo experimental tienes resultados que te ayudan a evadir el problema, si todo tu paper eran predicciones bioinformaticas hm....

Daemios said...

Ja! ahí tienes que eliminar el sesgo en la muestra que tomas, el bicho que trabajas, la extracción de dna, el sesgo en amplificación y finalmente tu criterio para elegir los "picos" buenos de los cromatogramas...

mi punto: hacemos puras mentiras!

luis said...

Corolario sobre la ciencia experimental de la Ley de Murphy:

Nunca se deje engañar por los hechos.

Xiuh said...

mmmmmm, ya ven! pa que sufren con la metagenómica! la bioquimica es mas noble

Anonymous said...

pero ya sabíamos que casi todo es mentira no? puras interpretaciones.

Velo por el lado amable, ahora pueden sacar cientos de nuevos artículos jajaja que divertido!!!!

MyS

Xiuh said...

y por cierto, aprovechando que tu entrada tiene quorum: nos debes una comida!!!!!!

Anonymous said...

sí, nos debes una comida!!! mínimo un mensaje con café incluido!!
MyS

Luisito in the Labyrinth said...

Esperemos que dentro de 16 años, algún resentido investigador refute los resultados de Styczynski a través de artimaña (bio)informática.

Estoy de acuerdo con lo del sesgo experimental

Muchos salduos!!!!!