Our Evaluations of Current Metabolomics Software
Considerations for Choice of Metabolomics Software
MR Kelly, S Opiyo, TL Graham
I. Overview: Preparation and Analysis of Metabolomics Data
The major processes in the conversion of raw LC-MS data into a format for bioinformatic analysis and subsequent modeling are shown in Figure 1. Of these, perhaps the most important steps are the filtering of data, picking of discrete peaks, an assessment of the rigor and quality of peaks, alignment of the peaks across chromatograms, statistical analyses, and data modeling. Here we present our experiences, notes and recommendations on the relative strengths and weaknesses of current available platforms for all of the steps outlined in Figure 1. We also present our current recommendations and a short manual for what we feel is the most user friendly and yet rigorous pipeline for these processes.
II. Current Platforms for Metabolomics - Reviews and Recommendations
During the process of choosing the recommended elements of our metabolomics pipeline, we evaluated many public and several commercial programs for their effectiveness in the various steps alluded to above and in Figure 1. This involved using the programs for the analysis of soybean tissues in many experiments, including tissue specific and/or developmental changes in metabolites, changes in metabolites during infection or elicitor treatment, different lines with resistance or susceptibility to pathogens or insects, etc. We chose soybean due to our extensive use of metabolic profiling and our knowledge of the secondary products in this plant species ( ), which allowed us to validate all analyses for effectiveness, rigor and completeness. This process took several months of use of each platform and trouble shooting of possible bugs or anomalies in each platform. Here we pass on the most important attributes, strengths and weaknesses of each platform.
A) Data Extraction and Data Summary
1. Advanced Chemical Development (ACD) Intelliextract (IX) Compare
ACD IX-Compare is a commercial program and has a very professional user interface and is specifically developed for use with chromatographic data. It is an excellent platform in its extraction of data from raw LC-MS data, the alignment of chromatograms, the filtering of data and the formation of a data summary table across treatments. A particular strength is its extraction of high quality peaks, including an editable formula to calculate a peak quality index. Export of comparative data into Excel is easy, but some data manipulation is required within Excel to achieve a meaningful data comparison. A major weakness is the fact that all comparisons across chromatograms are made by comparing only those peaks present in a reference chromatogram (e.g., a control). If peaks are not present in the reference chromatogram they are not considered or presented in data summaries. Another obvious weakness is the inability to access the code for the software to evaluate the actual processes used in data analysis and to make changes in data handling.
MZMine, written in Java, is a public platform that provides a very thorough extraction of peaks within chromatograms and good peak alignment and comparison. Particular strengths include the generation of scatter plots within the program and the ability to access all data for any metabolite within that scatter plot (including a very useful mini-chromatogram for that peak), by simply clicking on the data point for the peak. Another strength is that the scatter plots include all data for the samples being compared, rather than just those peaks present at detectable levels in all samples. A major weakness is the lack of methods for data filtration based on peak quality. While you can visualize the quality of peaks through a peak list that includes mini-chromatograms and reject peaks of poor quality, this is a tedious, manual and subjective process.
XCMS is a set of scripts written in R that extracts and transforms raw LC-MS data into a table that is then exportable into Excel or can be plotted with other R scripts. There are many parameters for each step in data extraction and treatment that encompass all of the processes shown in Figure 1. While it takes some time to optimize these, the final scripts that we developed (modified from those suggested by Smith, ) work very well as a starting point for all data sets we have analyzed. In our manual for our recommended metabolomics pipeline, we provide descriptions of these parameters and their usage, including the major parameters for which modificationsmay be efficacious. Peak alignment and high quality peak extraction are excellent. CAMERA is an add-on for XCMS and comprises scripts that work seamlessly with XCMS to extract additional data, including annotations of isotopes and adducts. This is very useful in efforts for peak identification since adducts (e.g., sodium adducts) are commonly encountered and multiple peaks due to isotopes are also always present. The ability to identify isotopes is very useful in accounting for them in overall quantification of individual metabolites. Strengths of XCMS include ultimate control over all aspects of the processes applied to your data as well as extraction of the other information provided by CAMERA. All of this can be readily exported into “csv” files for import into Excel. A weakness of XCMS/CAMERA, that is however shared with all of the data extraction platforms, is that it does not include built- in graphic modules for data visualization and modeling. While these can be added with additional R scripts, we found that XCMS truly excels primarily in data extraction and summary.
B) Programs for Data Visualization and Modeling
Although some of the data extraction programs noted above provide means for some data visualization and modeling (e.g., MZMine and custom R scripts to complement XCMS), none of them provide a really full range of options for these processes.
1. Array Star (a component of DNA Star)
We initially felt that the similarity between metabolomic and transcriptomic data might allow the use of programs already developed for nucleic acid microarray data to be used for visualization and modeling of metabolomics data. Following a brief preliminary evaluation of these programs, we chose Array Star (a commercial platform) for further evaluation. It did an outstanding job of loading the data, including the identification and averaging of replicate sets. The various visualization and modeling options (scatter plots, heat maps, Venn diagrams, etc) were very easy to use and of excellent graphical quality. Weaknesses, however, some of which were not apparent until much deeper examination, included the fact that the “mean” values presented for replicates were either geometric means of linear data or means of the log2 data. All plots use log2 data as a default. There was no option to generate arithmetric means. The use of geometric means caused a very prominent skewing of the linear (semi)-quantitive data from LC-MS for evaluation and interpretation. At least for the metabolomics data we used, this led to some problems in proper data interpretation. Also the scatter plots included only those metabolites present in all samples, with no ability to override this. Other weaknesses included the high cost of the software and the lack of transparency of the algorithms used..
MeVp, written in Java, has many of the features of Array Star in a public software format. We found, however, that it was prohibitively slow in execution, which made data modeling very difficult. While it runs somewhat better on a Mac, it also may have contributed to a serious problem with one of our PCs.
Our overriding concern over keeping complete control over our data, throughout the processes of data visualization and modeling, caused us to evaluate a very thorough statistical package that one of us had previously found very effective in data visualization and principal component analysis. The overall modeling involves two programs from Addinsoft, XLStat and 3D-Plot, that work seamlessly together and allow extensive options for data sampling, data reduction, statistical analysis and data visualization and modeling. Their use is very well integrated as macros within Excel, so that all analyses after data extraction (e.g., with XCMS) are very convenient, rapid and user friendly. Particular strengths of XLStat/3D Plot are that entire data sets (including individual data for each replicate, averages, , calculated values for fold change betweentreatments, annotations of peak identity, etc.) are very rapidly pulled into 3D-Plot in a format in which many types of analysese and plots can be very rapidly generated. The data are conveniently organized into projects and all plots, modeling efforts, etc are maintained for future comparison. Moreover, a modeling tab allows very rapid changes in the choice of data that is plotted for comparison and how this is presented. The user also has ultimate control on what overall data matrix is used for plotting simply by sorting/filtering within Excel. The three dimensional nature of the scatter plots, and the ability to rotate these plots, presents data in a format that is superior to normal scatter plots and even Venn diagrams for data visualization and modeling, and additional dimensions for data visualization are available through the size of the data points plotted or the color of the data points. Just as an example, if one uses data point size, four separate treatments can be compared on one graph and color can be used to present the mass of the metabolites or their metabolic class, etc. PCA analyses can also be generated in three dimensions, allowing three factors to be compared directly with data point size and color to present additional information. Trellis and hierachical plots can also readily be generated and are very valuable as alternatives for data interpretation. Clicking on the data points in each type of plot generates a complete summary of the data AND information for that metabolite present in the original Excel file. Moreover,data points of interest can be readily selected and exported into a new Excel or text file for further analysis. Although heat maps are not available, we have written a separate macro to carry these out directly in the Excel file.
III. Recommendations for a Metabolomics Pipeline
The results of nearly two years of evaluation of platforms for all of the steps alluded to in Figure 1 have led us to the following recommendations. All of these steps are very easily carried out with minimal training.
1) Data Extraction: Import raw data from the LC-MS into XCMS for data extraction and summarization. Perform XCMS, with or without CAMERA, on the data. Use our customized R scripts as a starting point and refer to our manual for optional changes in parameters. Export the data from XCMS into a csv tile and import that into Excel
2) Data Summary: Perform whatever manipulations on the data you wish in Excel. Major examples of this would include averaging replicate data, fold differences between treatments, simple statistical analyses and annotations of peak identity. Once you have decided what manipulations you wish to do, you could develop macros to facilitate this step.
3) Generate Heat Map: Use a macro (here is a simple example, link) to make a heat map of the data for rapid browsing and preliminary comparison.
4) Data Filtration: Sort and filter the data in Excel to choose the most relevant data for visualization and modeling
5) Data Modeling: Import the final data summary in Excel into the 3D-Plot module associated with XLStat. Use the various plotting and modeling options within 3D-Plot to model your data. For PCA analysis, run this within the main XLStat module and replot the Factor values in 3D-Plot. See our manual for some suggestions and examples.