Manual for Software Pipeline

Manual for Our Recommended Software Pipeline

A Software Pipeline for Modeling of LC-MS Based Metabolomics Data

MR Kelly, SO Opiyo, TL Graham

I. Introduction

Figure 1 shows the major steps required for the conversion of raw LC-MS data into a data summary for bioinformatics analysis and modeling.  We describe below our recommendations for a software pipeline that achieves this using very simple programs, most of which are publicly available.  For details of why we have chosen the components described here, please refer to our reviews of various available platforms (  ).  We wish to emphasize upfront that while all of these steps are very easily carried out with minimal training, the proper and rigorous interpretation of the results, particularly the very important annotation of the identities of metabolites, requires some fairly advanced training in analytical and organic chemistry (see section XX).

A brief summary of the steps in the pipeline, from raw data to modeling is given below.  In the pipeline, we use a set of publically available scripts programmed in R called XCMS (an acronym for various forms (X) of chromatography mass spectrometry; [1]).  XCMS is our program of choice for data extraction (steps 1-X in Figure 1).  We then employ a commercially available set of macros for Excel designed for statistics and modeling (steps X and X in Figure 1).  The macros, collectively called XLStat and 3D-Plot (from Addinsoft), are packaged together and work seamlessly within the Excel output from XCMS for data modeling.  Thus, in essence there are only two steps in the pipeline, although we also describe optional steps for preparation and filtration of the XCMS Excel output to provide a more complete summary and intuitive format for modeling.

In short, the various steps in the pipeline are:

1)    Data Extraction (see section XX for details): Import raw data from the LC-MS into XCMS for data extraction and summarization.  Perform XCMS, with or without CAMERA, on the data.  Use our customized R scripts as a starting point and refer to Section XX for optional changes in parameters.   Export the data from XCMS into a csv tile and import that into Excel

2)    Data Preparation for Modeling (see section XX for details): Perform whatever manipulations on the data you wish in Excel.  Major examples of this would include averaging replicate data, fold differences between treatments, simple statistical analyses and annotations of peak identity.  Once you have decided what manipulations you wish to do, you could develop macros to facilitate this step.  Finally, use a macro to make a heat map of the data for rapid browsing, preliminary comparisons and choice of data for further analysis.  We present a very simple macro in Section XX.

3)    Data Modeling (see Section XX for detais):  Import the final Excel data summary from Step 2 into the 3D-Plot module within XLStat. Use the various plotting and modeling options within 3D-Plot to model your data.  For PCA analysis, run this within the main XLStat module and replot the Factor values in 3D-Plot. If desired, sort and filter the data in Excel to choose the most relevant data for visualization and modeling.  This is often of great use in highlighting specific changes from sample to sample.  See Section XX for details and some suggestions and examples.


II. Data Extraction

XCMS is a set of scripts written in R that extracts and transforms raw LC-MS data into a table that is then exportable into Excel for further analysis. XCMS carries out steps 1-X in Figure 1.  There are many parameters for each step in data extraction and transformation shown in Figure 1.  While it takes some time to optimize these, the final scripts that we developed (modified from those suggested by Smith,  ) work very well as a starting point for all data sets we have analyzed.

Setting up R to run the XCMS Scripts.  To run XCMS, you must first have R installed

Download R CRAN webpage:

Download and run the executable file, use latest release, now it is 2.15.1. For example, download R 2.15.1 for Windows and run R-2.15.1.exe executable file. After launching R, we need to set the R working directory to inform R where we find data file and design file. It also informs R where to store the analysis results and associated figures from the analysis. To set R working directory, choose ’Change dir...’ from ’File’ menu, and a dialog box appears to let you browse which directory (folder) to set as working directory.

Starting and Quitting R
Start: R command.
Quit: q(). Prompted to save workspace image.
Working directory: getwd, setwd.
List objects: ls, objects.
Remove objects: rm, remove.

Installing RStudio
RStudio is a free and open source integrated development environment (IDE) for R. RStudio can only work after R has been installed. Download RStudio at Download the Desktop version of the RStudio.

Preparation of data files for import into XCMS.  Mike, can you work on this?


XCMS Scripts.  The complete documentation for the R scripts for XCMS can be found at XXX.  Additional shorter descriptions for getting started for metabolite extraction from LC-MS data can be found in XXX.  We have, of course, drawn very heavily from these resources.  After a very extensive evaluation of parameters for each command line in XCMS on many complex data sets (see Appendix X for a description of our instrumentation and data sets), we developed the following set of scripts for XCMS.  Please note that for simplicity we have not included all possible parameters in the command lines.  Modifications in the parameters may be required to optimize data extraction from the raw data for your particular applications or instrumentation.  In Appendix XX we provide a short description of the command lines and parameters as well as our suggestions for adapting the parameters to particular data sets.

Additional Data Filtration Scripts Added to XCMS.  We have added several scripts to the normal XCMS scripts which we find add to its usefulness.  Identified in bold, these include two sets of data filtration steps, which set limits on the ranges for retention times and mass that are output by XCMS.  For instance, in most of our applications, we set limits on the mass range from 100-1500 since we have found that many of the putative peaks from 1500-2000 are fragments of polymers (proteins and polysaccharides) which we are not interested in.  Also in our chromatograms, peaks at very early retention times and very late retention times include small molecules and very non-polar molecules such as chlorophyll and carotenoids that are not of interest to us.  While we retain this data in our raw chromatographic data, their inclusion greatly adds to the complexity of the data for modeling.  These command lines can simply be removed if they are not of interest to you and the ranges can be adjusted by simply modifying the scripts before pasting them into R.


Optional use of CAMERA.  CAMERA (XXX) is comprised of additional R scripts that work seamlessly with XCMS to extract additional data, including annotations of isotopes and adducts.  The scripts for CAMERA are described in documentation at XXX.  CAMERA is very useful in efforts for peak identification since adducts (e.g., sodium adducts, which add a mass of 23 to the MS spectra of individual metabolites) are commonly encountered and can greatly confound identity of peaks if not taken into account.  We have found that CAMERA works well in identifying adducts, but it takes some work to decide which putative adducts are truly there.  Also, in our experience, it does not catch all adducts, so some manual interpretations are necessary.  Please see Section XX for more discussion of adducts and suggestions on accurate peak identification.

CAMERA also identifies potential isotopes.  The most common isotopes encountered in LC-MS chromatograms are those corresponding to C13 isotopes.  If a metabolite has many carbons, several isotope peaks (e.g., M+1 and M+2) may be present.   Isotopes are always present, and, in fact their presence is sometimes used as a criterium for the quality of the data.  In practice, with very complex metabolic profiles, especially with data from lower resolution machines (such as the ion trap we use as a work horse in our research) we have found that isotopes are not always abundant enough to be recognized by XCMS/CAMERA.  None-the-less, the ability to identify isotopes is very useful in accounting for them in overall quantification of individual metabolites.

Our starting XCMS scripts with CAMERA added in are given below.  CAMERA simply adds additional columns to the csv file exported to Excel.  Their use is further discussed in Section XXX on Metabolite Identification and Annotation.

III. Data Preparation for Modeling

The output from the R scripts from XCMS or XCMS/CAMERA contains all of the extracted data from the raw LC-MS data, including columns for mass, retention time, ion counts, putative isotopes, adducts, etc, for each metabolite, which are listed as rows.  Depending on how you set up your experiment and import your files into XCMS, this data for each replicate treatment for an experiment may also be shown.  We routinely add columns with formulas for calculation of the arithmetric mean of the replicates.  Simple statistics on the mean for the replicates can also readily be added in Excel.  Columns can also be added for ANOVA, etc.  A very important addition to the data set is a column for specific peak annotation or identity (an example for soybean would be the malonyl-glucosyl  conjugare of the isoflavone, daidzein, which we abbreviate MGD).  In our case, where we know the identity of many peaks, we also add a column for metabolic class (examples for soybean would be “isoflavone” or “flavonol glycoside”).  As we will see in Section XX on XLStat, these additional columns are of great use in data modeling.

Generation of a heat map within Excel.  A very simple macro for generation of a heat map within your final Excel summary file is given below.  The cutoff values and colors can be readily modified.  Thanks to XXX for the basic macro.  This macro basically creates a heat map that is superimposed upon the data for replicates and their averages.  The result is a heat map with the actual data values underneath, which we found is very valuable for browsing and cross-checking the data.  For instance, it becomes readily apparent if replicates differ widely in total ion count, an instance that might suggest outliers or possibly defective LC-MS chromatograms, etc.  The heat map, especially when combined with data sorting (e.g., by fold difference) also allows a very rapid tool in helping to choose data for further modeling.  Finally, if you wish to make a traditional heat map, without the data shown underneath, you can cut and paste the data fields into another instance of Excel and simply clear the contents of the cells.


Generation of Venn diagram using R package VennDiagram
Before running the scripts for Venn diagram install a R package called VennDiagram.
Organize your data so that the first variable starts in column number two. Use Venn2_script to generate a two set Venn diagram, Venn3_script to generate a three set Benn diagram, and Venn4_script to generate a four set Venn diagram

IV. Data Modeling in XLStat

While one can do much data plotting and modeling directly in Excel, especially after sorting and filtering of data, we highly recommend the Excel macros contained in the packages called XLStat an 3D Plot from Addinsoft.  These are currently available as a academic package for about $700, which includes a perpetual license for a single computer, but more licenses can also be bought at discount. The two programs from Addinsoft, XLStat and 3D-Plot, work seamlessly together and allow extensive options for data sampling, data reduction, statistical analysis and data visualization and modeling.  Their use is very well integrated as macros within Excel, so that all analyses after data extraction (e.g., with XCMS) are very convenient, rapid and user friendly.   Particular strengths of XLStat/3D Plot are that entire data sets (including individual data for each replicate, averages, calculated values for fold change between treatments, annotations of peak identity, etc.) are very rapidly pulled into 3D-Plot in a format in which many types of analyses and plots can be very rapidly generated.   The data are conveniently organized into projects and all plots, modeling efforts, etc, are maintained for future comparison.  Moreover, a modeling tab allows very rapid changes in the choice of data that is plotted for comparison and how this is presented.  The user also has ultimate control on what overall data matrix is used for plotting simply by sorting/filtering within Excel.  The three dimensional nature of the scatter plots, and the ability to rotate these plots, presents data in a format that is superior to normal scatter plots and even Venn diagrams for data visualization and modeling, and additional dimensions for data visualization are available through the size of the data points plotted or the color of the data points.  Just as an example, if one uses data point size, four separate treatments can be compared on one graph and color can be used to present the mass of the metabolites or their metabolic class, etc.  PCA analyses can also be generated in three dimensions, allowing three factors to be compared directly with data point size and color to present additional information.  Trellis and hierachical plots can also readily be generated and are very valuable as alternatives for data interpretation.  Clicking on the data points in each type of plot generates a complete summary of the data AND information for that metabolite present in the original Excel file.  Moreover, data points of interest can be readily selected and exported into a new Excel or text file for further analysis.









1.       Smith CA, Want EJ, O'Maille G, Abagyan R, Siuzdak G (2006) XCMS: Processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal Chem 78:779–787





# Venn2_script R for generating a two set Venn diagram.   
Data <- read.table(“C:\\R_directory\\your_file_name.cvs”, header=TRUE, blank.lines.skip=FALSE)
Group1 <- Data[,2]
Group2 <- Data[,3]
VennList <- list (Var1 = Group1, Var2=Group2)
VennDia <- venn.diagram(vennList, sp.cases=TRUE, filename=NULL, cex=1.5, cat.cex = 1.5 , cat.dist=0.07, cat.pos=0)

# Venn3_script R for generating a three set Venn diagram.  
Data <- read.table(“C:\\R_directory\\your_file_name.cvs”, header=TRUE, blank.lines.skip=FALSE)
Group1 <- Data[,2]
Group2 <- Data[,3]
Group3 <- Data[,4]
VennList <- list (Var1 = Group1, Var2=Group2, Var3=Group3)
VennDia <- venn.diagram(vennList, sp.cases=TRUE, filename=NULL, cex=1.5, cat.cex = 1.5 , cat.dist=0.07, cat.pos=0)

# Venn4_script R for generating a four set Venn diagram.  
Data <- read.table(“C:\\R_directory\\your_file_name.cvs”, header=TRUE, blank.lines.skip=FALSE)
Group1 <- Data[,2]
Group2 <- Data[,3]
Group3 <- Data[,4]
Group4 <- Data[,5]
VennList <- list (Var1 = Group1, Var2=Group2, Var3=Group3, Var4=Group4)
VennDia <- venn.diagram(vennList, sp.cases=TRUE, filename=NULL, cex=1.5, cat.cex = 1.5 , cat.dist=0.07, cat.pos=0)