Introduction to Metabolomics and Flow Chart of Needs for Metabolomics Software
Introduction to the Metabolomics Pipeline Concept
To gather metabolomics data from a biological sample, one typically uses a combination of chromatography [e.g., gas chromatography (GC) or liquid chromatography (LC)] with forms of detection that gather the most global information on the full range of metabolites of interest. Due to its global detection of nearly all forms of metabolites, mass spectrometry (MS) is nearly always a component. A typical metabolomics analysis might thus use GC-MS or LC-MS. While the pipeline concept for analysis we describe here can be applied to these and many other analytical platforms, we will limit our discussion here to data derived from LC-MS. In our lab, we use an ion-trap MS, but the data processing we recommend would likely work with most MS instrumentation. Finally, while the recommendations we make here will work with any biological organism, we will illustrate our pipeline (in our applications and tutorials) with examples from plant tissues.
Once one has obtained LC-MS chromatograms for a sample of interest, it is often of great interest to compare the metabolites in that sample to others. Examples could be the comparison of different tissues, comparisons of the same tissue over time (developmental), comparisons of tissues from different plant cultivars, comparisons of plants with different levels of resistance to an insect or pathogen, comparisons of infected to control tissues, etc. However, the comparison of complex metabolomics data from multiple biological samples is very challenging. First of all, a truly global comparison of metabolites between samples may involve many hundreds, if not thousands of metabolites. Among the major challenges with such complex data are resolution and alignment of the metabolites.
Figure 1 shows the many steps needed to get to a final array of data (mass vs. retention time for each of the multiple samples) needed for modeling and exploration. Once raw LC-MS data is obtained, it must first be converted into a chromatogram format that allows cross comparisons in the software of interest. Next, the data needs to be deconvoluted, which means that the mass spectral data for each individual peak in overlapping peaks needs to be separated to the extent possible. Next, one rescans the chromatogram to separate peaks from background noise. This is a very critical step since it defines both the inclusiveness and rigor of the data. Finally, one aligns the data between multiple chromatograms, which is a process that uses different algorithms depending on the software used, but basically involves reiterative tentative alignments with retention time corrections. In our lab, we then perform a peak quality assessment to make certain that the aligned peak database contains peaks that meet our criteria. It is important to note that each of these data preparation steps involved many different parameters that can be adjusted to determine how inclusiveness and rigorous the final data set is. These are described more fully in our evaluations of current software and our tutorials. Finally, a comparative array is established in a spreadsheet format that is much like a nucleic acid microarray. The content of this data table will include a minimum of mass, retention time and peak magnitude (e.g., total ion count) for each biological sample. It can also include other elements of interest, such as isotopes, adducts, peak quality parameters, etc, depending on the software used in processing the data.
The final step is to use software to visualize and analyze the very complex data set. This includes routine statistical analyses and modeling efforts such as scatter plots, Venn diagrams, principal component analysis, hierarchal trees, heat maps, etc.
Most of the components of Figure 1 are available individually in various public and commercial software packages, but we found no overall package that included all componenets with the rigor and completeness that we desired. We thus undertook a nearly two year exhaustive evaluation of available software to form a pipeline for the overall analysis. Please see our evaluation page for details on our cisiderations. Our final recommendations include two major components with additional scripts that we wrote to make up for small deficiencies and to link them together effectively. For the full comparison of multiple biological samples, from raw data to the visualization and exlporation of modeled data routinely takes us about ½ hour.