Rudimenthos: ResearchInEnglish

Thursday, May 06, 2010

Better the metagenome you know than the metagenome you don't...

Morgan, J., Darling, A., & Eisen, J. (2010). Metagenomic Sequencing of an In Vitro-Simulated Microbial Community PLoS ONE, 5 (4) DOI: 10.1371/journal.pone.0010209

A new era for the design of metagenomic controls starts! Morgan et al. present the benchmarking of metagenomic tools using artificial "microbial communities" mixed up in the lab.

The Hook...
Metagenomics is a fancy name for what's actually a large and obscure toolbox of molecular biology procedures and computational algorithms that promises to help us in the understanding of whole, natural microbial communities. It is so exciting because it allows us to study organisms (bacteria and archaea specifically) that would otherwise remain unacknowledged because we cannot grow them in the lab. It also provides for the first time the opportunity to analyse whole natural communities, and not only sectors of it (like "granivorous community" or "photosynthetic guild"). The comparison of natural functional communities would help us understand a lot about how communities are assembled, how they evolve and change in time and how are they affected by external disturbances.

Having said that, we still lack the tools to analyse such large databases and the quality standards to produce and compare metagenomes. This happens each time a new technology appears, because there has been not enough time to try and experiment with it as to accurately know its flaws. This is even worse with metagenomics since no whole community has ever been studied and so we don't really know or even suppose how our data should look like. Here's where Morgan et al. come to rescue with a very neat approach.
The Setting...
Their logic is simple and clear: since we do not have any community whose composition is completely known, let's make one. So they retrieved ten different microorganisms from the culture collections whose genomes have already been sequenced, and prepared aliquots so that they would have the same number of cells from each organism. Then they mixed them up, extracted the whole community DNA with three different DNA-extraction protocols and then sequenced four metagenome databases (one was replicated with an alternative sequencing method).

The Bad...
Surprisingly, none of the sequenced metagenomes reflected the original composition of the community mix. This can be caused for a number of reasons: the size of a genome and the number of genome copies per cell affect the probability of sequencing; differences in cell wall and matrix thickness and composition could prevent efficient DNA extraction; specific DNA segments might be harder to clone and/or sequence... When they compared between metagenomes, they found that most differences were due to the type of DNA extraction utilized. That is, the same community will result in different metagenomes when different DNA extraction methods are used. This also means that metagenomes obtained with different DNA extraction protocols should not be compared. Ever.

It still puzzles me one thing: the love for BLAST. Even when they assigned each sequence to a specific organisms by "blasting" each read from the metagenomes to the ten complete genomes of the organisms in the mix, there's a large number of sequences that could not be mapped back to the source organism. Sure, there seems to be a phage infecting some cultures that was not in the sequenced genome. But it is surprising that there was a large number of reads that actually hit a Bacillus, when there were five Lactobacillus strains in the mix. My point is that BLAST is a very poor algorithm to recover precise hits, and the short lenght of the sequences reduce the taxonomic resolution attainable by it, misleading the results. If we add a really biased and incomplete reference database, it ends up being almost impossible to accurately define the genomic composition of a natural community. This also calls for better and more precise methods of assigning or binning of metagenomic sequences.

The Good...
Since "all different" is not a very hopeful result, they prepared three replicas of each DNA extraction method so to say which of them showed a lower variability and hence would be more reliable. It turned out that the DNA kit extraction protocol has a larger repeatability, most likely because there's a lower variation in reagent concentrations.
And then again, although there's large variability inter- and intra- protocol, there are no radical changes in the relative abundance of each organism. That is, there is no change from the dominance of one organism to another. Although they're still not reflecting the "true" abundances.

The Ugly...
One of the samples was sequenced twice, one time with classic Sanger capillary sequencing and the other with pyrosequencing. This helped them to show that differences between extraction methods are far greater than differences between sequencing platforms. Still I sensed a bit of anti-pyrosequencing in it. Sure, pyrosequencing gives shorter reads and so a larger amount of reads will be unassignable to reference organisms (at least by BLAST standards). But I'm not sure that these results actually demonstrate that cloning-bias is not so important. It would be necessary to repeat each sample with pyrosequencing to demonstrate this. And it would be also great to replicate the same example as they did with Sanger. This would actually show how much of this variability is really attributable to DNA extraction and how much of it is attributable to cloning bias.

The Finale...
We desperately need more research like this, that would help us not only to standarize the technology behind metagenomics but also allows to build the robust theoretical framework that metagenomics (and community ecology in general) is so in need. This kind of work should be complemented with in-silico modelations of metagenomes (like that in Mavrommatis et al. 2007), and also with the development of better algorithms to cluster and assign taxonomy to sequenced reads.
After all the metagenomic hype, we still do not know the true structure and composition of sequenced microbial communites. But we do know a lot more than before.

Wednesday, October 08, 2008

Metagenome Sequence Simulators

Richter DC, Ott F, Auch AF, Schmid R, & Huson DH (2008). MetaSim: a sequencing simulator for genomics and metagenomics. PloS one, 3 (10) PMID: 18841204

An article from the Huson's Group at Tübingen University has just came out in the Open Access (and scientific publishing innovator) journal PLoS ONE, describing MetaSim, a software to produce artificial or syntetic or in silico metagenomes out of a selection of completely sequenced genomes.

This is just "heaven-sent" for me since I've been working on a set of syntetic metagenomes for the past two months, and will be happy to use this software first hand like... today. It seems that the software not only lets you choose the source genomes from a phylogenetic tree (figures reproduced here from the original article al PLoS ON E thanks to the Creative Commons License), but also choose from three different type of sequencing technology output (Sanger, 454 and Illumina) and generate theorethical metagenome.

This is the continuation of a very important change in genomic sciences, moving from experiments far too expensive or long to be replicated and hence out of hard statistical comparision, to null-model based in silico genomic analysis.

The first effort to analyze the true scope of metagenomic analysis was presented by Kostas Mavrommatis and others from the Genome Biology group at JGI (unfortunately published in an non-OA journal), where they produced three simulated metagenomes of contrasting complexity to asses assembly, gene prediction and annotation (SPOILER: the best combination assesed was Arachne assembler with Fgenesb predictor and PhyloPhytia for binning, and BLAST "performed poorly" as usual). This work also produced a database for the Fidelity of Analysis of Metagenomic Samples (FAMeS), a great effort to standarize metagenomic analysis software. A great alternative is ProxyGene annotation, as reported by the Markowitz group.

Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, Rigoutsos I, Salamov A, Korzeniewski F, Land M, Lapidus A, Grigoriev I, Richardson P, Hugenholtz P, & Kyrpides NC (2007). Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nature methods, 4 (6), 495-500 PMID: 17468765

I'll play a little with the software and post some of my impressions here... and maybe in the original PLoS ONE webpage since it is totally open to post-publication review!!!!

You can download MetaSim at Huson's Labpage!!!

Sunday, April 13, 2008

Bacillus coahuilensis : the genomical TexMex

After a long publication struggle, two articles from two close friends have finally been published: the description of novel species Bacillus coahuilensis by my former bacteriology teacher and former owner while doing my Social Service and actually the one to blame for my adscription to the lab I work in now, René Cerritos (a.k.a. Dr. Chapultepec) in the IJSEM Journal. The other is the publication of the complete genome sequence of the very same strain in PNAS by my former schoolmate, my former Represenant in the Universitary Council and beermate Luis Alcaraz (a.k.a. The Dude). Both are the product of a weird collaboration between the CINVESTAV and LANGEBIO in Iruapuato, the Texan universities of Rice and Houston and the institutes Biotechnology and Ecology in UNAM, where I'm at.

In short:

Cuatrociénegas Valley is in a 750 m basin above sea level in North Mexico, deep in the Chihuahuan Desert and formerly a coastal region during the Jurassic. It is characterized by the presence of many oligotrophic ponds in the middle of the desert supporting large bacterial communities, appearingly from a marine origin (as shown by Souza and Desnues), that have been studied by my labgroup leaded by Valeria Souz a and Luis Eguiarte (the very same place where I'm conducting my Theses). Cerritos isolated many bacilli strains from one of the widest and shallowest pond (Churince's Laguna Grande) and found many moderately halophilic species (that tolerate slightly salty envirnoments). A novel aerobic strain (m44) belonged to a group of aquatic, moderatedly halophilic species (B .marisflavi, B. aquimaris, B. vietnamensis) , and could not grow on most sugar-contaning media (uncommon for the bacilli). The team in CINVESTAV sequenced the genome (leaded by Gaby Olmedo and Luis Herrera-Estrella) and Alcaraz anotated it. He also conducted most of the sequence analises, with some help of Siefert from Rice University, Putonti from the UofH and me, during our stay in Houston a year ago. The genome turned out to be the smallest genome within bacilli with 3.35 Mbp with many mobile elements.

The most important feature of B. coahuilensis is that this is the second mexican microbial genome sequenced to date (the two bacteria genomes sequenced in Mexico are Rhizobium etli by the CCG and this), but whose sequence has been analized in the light of ecology and evolution (remember Dobzhansky's maxima?), that is, the adaptations of a formerly marine lineage to an oligotrophic lentic environment.

That is, the sequence pointed towards an adaptation to growth within low-phosphorus environment: namely the presence of sulfoquinovose synthase (sdq1) that synthesises sulfolipids to replace membrane phospholipids (which constitute around 30% of the total phosphorus), never reported before outside chloroplasts and unicellular cyanobacteria. The CINVESTAV team looked into the membrane and corroborated its sulfolipid composition.

The genome also codes for a sensory bacteriorrhodopsin gene, reminiscent of its marine origin where they are very abundant (see work by Venter and Rusch). The expression analyses proved it to be constitutive and not -light dependent, suggesting it to be an adaptation to shallow-water irradiance exposure.

Analysing the enconded transmembrane importers is a good way to analyse what the organism is uptaking for the environment, that is, it's "feeding-habits". The family of Iron-Siderophore importers is overrerpresented in B. coahuilensis, a feature shared with other aquatic bacilli, suggesting that marine bacilli actively scavenge for iron. It also show a preference for the uptake of single aminoacids and not large polypeptides, with absolute requirement of 8 aa and partial of another 5, a feature shared by the aquatic, small-genome organism Minibacterium massiliensis.

This, taken together with the fact that it has the lowest number of genes involved in nitrogen cycle, together with the experimental evidence of being incapable of utilizing a wide variety of sugars, suggests that this organism is totally dependent of the rest of the community to live on, and has evolved from a primitive bacterial component of that community with specific adaptations for a novel environment.

I'm very proud of the product of this collaboration and expect to continue this way. And also very happy because from the moment of this publication on, The Dude is able to obtain its PhD!!!

Monday, March 31, 2008

My Geek Pride is hurt: BLOSUM matrices

BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) are the canonical substitution matrices used for scoring protein sequence alignments. In essence, it calculates the relative frequencies of all aminoacids in each position within an alignment and assigned a probability to the substitution of a particular residue. BLOSUM matrices built with closely related sequences are more stringent and have high numbers (BLOSUM80) indicating the percentage similarity allowed to include a sequence in the matrix (in the latter case, all proteins share at least 80% sequence identity).

BLOSUM matrices were developed in 1992 by Henikoff and Henikoff and since then have been extensively used in all analyses involving protein sequences...

and then, here comes he "AAARRGHHHH!!!"

Styczynski et al (2008) were killing their time looking at the evolution of the BLOCKS database and found the unthinkable.... an error in the source code for the algorithm that calculates de BLOSUM matrices!!!! that means... the results obtained with the available BLOSUM matrices differ significantly from the expected algorithm from Henikoff & Henikoff... merde!

Weirdest thing of all.. when corrected and tested back for the use of the matrices in database sequence search, it turned out that the "wrong" matrices performed much better in retrieving protein homologs than the "corrected" matrices.

Fortunately, it seems that though the difference is statistically significative, it is not big. That means, we haven't fucked it up so bad.

Epilogue to the blosum...

1) 16 years of extensive usage doesn't mean it is RIGHT.

2) how come that no one, ever, in 16 years, ever noticed this difference!!! THAT is what happens with dogma... when you take anything from granted

3) messing things up is not always THAT bad...

4) I didn't understand from the article if they proposed that the matrices were corrected even if they performed worse...

5) I would expect to see a huge ocean of erratas everywhere because "when using the revised blosum matrices... our results from the past ten years have completely changed!!"

Wednesday, November 21, 2007

To Metagenome or not to Metagenome

A masterpiece of science blogging was posted here (http://suicyte.wordpress.com/2007/11/20/smallest-primate-ever-discovered/), addressing the finding of primate sequences in the GOS dataset, just one of the unassesed ambiguities in metagenomics.

The post is beautifully written and made me laugh really loud. The point is that metagenomics is sometimes being overselled just in the very same way as genomics has been (see Eisen's blog for just some examples) and this leads to an skeptic counterwave.

My brief cons and pros:

a) metagenomics offers indeed the unprecedented opportunity to explore unculturable microbial diversity, which no other tool can do.

b) metagenomics is not only a technological advance like genomics was, it fits perfectly in an ecological (community/ecosystemic) theoretical background

c) no matter what everybody says, the GOS sampling has provided an incredible amount of data on previously unknown (and often unimagined) microbial diversity

d) criticisms on the amount of money spent on metagenomics seem to me like questioning the financial support on Humboldt's (or any other naturalist) voyages, which were explorative and not precisely focused on any hypothesis.

e) metagenomics is obviously error-prone, and it's biases have been poorly evaluated

f) metagenomics is much more useful in small, simple communities where a reasonable coverage can be achieved

g) metagenomics is much more useful in well known, deeply studied natural communities where it is employed to answer specific biological questions

h) metagenomics is expensive!

i) a great deal of work is still to be done on: defining parameters for comparing different samples, assessing taxonomical and functional biases, increasing assembly effectiveness and contig construction, improving functional prediction, developing tools for the analysis of such huge datasets, etc.

j) metagenomics is best when interdisciplinary, that is when it's used along with techniques and analyses from other disciplines that might provide physiological, evolutionary or ecological information

That being said, metagenomics rocks!

Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, Beeson K, Tran B, Smith H, Baden-Tillson H, Stewart C, Thorpe J, Freeman J, Andrews-Pfannkoch C, Venter JE, Li K, Kravitz S, Heidelberg JF, Utterback T, Rogers YH, Falcón LI, Souza V, Bonilla-Rosso G, Eguiarte LE, Karl DM, Sathyendranath S, Platt T, Bermingham E, Gallardo V, Tamayo-Castillo G, Ferrari MR, Strausberg RL, Nealson K, Friedman R, Frazier M, & Venter JC (2007). The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific. PLoS biology, 5 (3) PMID: 17355176

Rudimenthos

Thursday, May 06, 2010

Better the metagenome you know than the metagenome you don't...