From LipidomicsWiki
Contents |
Protein Mass Spectrometry
Protein mass spectrometry is used by Protagen AG (among others) for the following applications:
- Identification of protein biomarkers previously found in experiments conducted within Protagen AG or by partners in the project
- Determination of binding partners in protein-protein or protein-lipid interaction studies
- Structural characterisation of proteins with respect to disulfide bridge analysis and analysis of post-translational modifications such as oxidation or deamidation
- Analysis of complex proteins mixtures such as bodily fluids
- Label-free quantification of proteins or peptides in complex protein mixtures
- Absolute quantification of proteins by using isotopically labelled peptides
- N- and C-terminal sequencing of purified peptides
The standardisation of the data sets from such experiments is crucial to the reproducibility and comparability of the analysis results. This if even more true in the case of mass spectrometry, as very complex measurements are undertaken that give rise to complex data sets as well.
In general, two major cases can be distinguished based upon the mass spectrometers used for data acquisition.
Matrix Assisted Laser Desorption Ionization Mass Spectrometry (MALDI)
There are several main characteristics of this method with respect to data processing and the standardization thereof.
The MALDI can measure a large number of samples in a relatively short timescale by peptide mass fingerprinting (PMF). To be more precise, 384 samples can be analysed in a first round within 30min. The resulting data sets are essentially time-of-flight vs. intensity plots which are converted into mass-to-charge vs. intensity plots by the use of external calibration. These discreetly spaced data sets are then first of all processed in order to filter out the peaks in the mass spectra and thereafter the monoisotopic peaks of each peak cluster. This task is performed using defined scripts that were optimised for the detection of large number of monoisotopic peaks without labelling the noise. The result of this processing is a list of mass-to-charge values, the concurring charge, the peak intensity and the signal-to-noise ratio.
A second type of data set (peptide mass fingerprint, PFF) is generated on the same machine subsequently by fragmenting prominent peaks and deriving fragment-ion data. These data sets are quite different than the ones acquired above so that a different set of parameters needs to be applied for the correct processing. This again leads to a peak list for each fragment ion spectrum.
This together means that for any given sample, one PMF and typically 0≤n≤20 PFF spectra are generated and thus n+1 peaks lists have to be handled.
Liquid Chromatography- Electrospray Ionization Mass Spectrometry (LC-ESI-MS/MS)
The main characteristic of the LC-ESI-MS/MS data sets are that due to the coupling with an HPLC separation, there are much more PFF data sets for any given sample and all these data sets are contained within one data set. Additionally to the data mentioned above, each PFF spectrum carries the retention time of the separation. These data sets can contain for a typical sample between 1000 and 2000 MS/MS spectra and the size of one data set alone is in the range of 2GB.
Here, standardized procedures for the process of the data sets not only include the detection of peaks as above, but also the detection of chromatographic compounds. The parameters of the processing were optimized and standardized.
Protein Identification
The mass lists from the MALDI or the LC-ESI-experiment are further processed in a relational database (proteinscape) that ensures a standardized handling of large data sets. For each type of measurement and goal of analysis, an optimized parameter set was developed with which the protein identification was carried out. This includes parameters such as enzyme specificity, mass tolerance, possible modifications, protein database, missed cleavages etc. The resulting data sets are linked within the database to the original experimental data such as mass spectra, LC-separation, 2D-gel separation etc.
Additional measures were taken to standardize the data processing, such as quality control steps during analysis. Such quality control steps are simple measures such as determining the number of peptides measured, the evaluation of the chromatographic separation, where applicable, or more complex measures such as filtering the known contaminant from the data sets.
An important step in standardization was to several algorithms in parallel for database searching, this making the resulting proteins lists more reliable. A meta-scoring combining the results of the individual algorithms greatly increases the standardization. Also, the use of a defined false-discovery rate (where possible) leads to more reliable data sets, as spurious results are better excluded.
The resulting protein identifications are then reported in a standardized report format containing all information important for the further work in the project.
Protein chips
Protein microarrays are produced by Protagen on nitrocellulose coated glass slides (1x3 inch). Depending on the chip type (AV400, AV2000) up to several thousand different recombinant human proteins are spotted on a microarray. The protein arrays can then be incubated with serum samples or purified antibodies to characterize their antigen binding profile via fluorescence intensities. Image analysis of scanned microarrays is performed by GenePixPro 6. Using a chip specific GAL file, which describes content and layout, the software performs spot finding, background subtraction, and intensity quantification. The numerical values are saved in a single GenePix results file (gpr) for each microarray and represent the input data for all further analysis. Several software tools have been implemented for the processing of these high throughput protein chip data and in the following these tools are described in more detail.
UNIchip Data Analysis Tool
The UNIchip® Data Analysis Tool is an Excel macro for analyzing numerical data from UNIchip® Protein Biochips used for antibody profiling and characterization. It automatically performs analyses of microarray raw data and creates Excel sheets containing the results. The Analysis Tool supports Excel 2000, Excel 2002, Excel 2003 and Excel 2007. After starting the program, the main window is displayed (see below), which explains the different steps that have to be taken.
First several input parameters have to be specified that configure the behavior of the tool. This information ranges from details regarding the chip (number, type, replicates) to and antigen and process control details. Next the raw data can be imported by copying and pasting the required values directly from GenePix or by specifying the appropriate GenePix result files. Finally the analysis can be started, which generates several new result sheets. All of these sheets are documented in the user guide that is available at http://www.protagen.de/customers_downloads/Man_UNIchip_DAT.pdf. Of special importance are:
Worksheet - Analysis Single Normalized
In the “analysis single normalized” sheet the median of every single protein is normalized to the median of the chosen process control. As the median of the chosen process control is defined as 100%, relative signal intensities for every protein are reported as percentage (%). If using signal calculation with local background subtraction negative values may be listed for very weak signals and stronger local background. Finally the binding profile normalized to process control for each UNIchip® is visualized in a separate chart.
Worksheet - Binding Profile Antigen
The sheet “Binding Profile Antigen“ is only visible when using UNIchip® Premium versions as it shows the median of every single protein normalized to the user specific antigen. Relative signal intensities, listed for every protein and protein biochip, are reported as a percentage (%). The value of the user specific (target) antigen is defined as 100%. The scaling of the specific antigen signal (to 100%) is based on the antigen spots with the antigen concentration chosen in configuration tab “Mapping” =>“Concentration to normalize”. In case of entering a value > 0 in the field “cut-off (antigen)” (see tab “Advanced” in the configuration) the values are filtered according the cutoff. At the bottom of the worksheet you will find additional information summarizing the number of off-target activities (OTAs), the average signal intensity of the OTAs (in %), and highest and lowest OTA values. Finally the binding profile normalized to antigen for single, duplicate or triplicate UNIchips are visualized in a separate chart.
Worksheet – Report P1
The sheet “Report P1“ summarizes the chosen antibody mapping and lists the number of off-target activities (OTAs), the average signal intensity of the OTAs (in %), and highest and lowest OTA values for every antibody.
Worksheet – Report P32
The sheet “Report P32“ summarizes the results of the tables generated in report "Report P22" which shows OTAs of the antibodies normalized to the antigen. For every protein the name, gi number and OTA are reported. The following figure shows a screenshot of this worksheet.
chipQM
Protein arrays can be used to characterise the binding behaviour of single antibodies or to study the antibody repertoire of serum samples from individuals with a certain medical condition. In the last case protein arrays are used as a diagnostic tool to classify healthy versus non-healthy serum samples. For such a classification to yield significant results, it is important to monitor the quality of the incubated protein chips. This is the purpose of chipQM. It reads GenePix result files of all chips in the study and generates a graphical representation of several quality measures as is shown in the following figure. In this example roughly 130 chips have been analysed that belong to three different batches.
From bottom to top these quality measures are:
sick/healthy: This is simply a visual indicator to make sure sera belonging to healthy and non-healthy individuals are incubated in alternating order across all batches.
normF: This is the normalization factor for this chip. Each chip contains 64 copies of IgGs that are used as internal standard for normalization. normF is the mean value of all the copies that are flagged good (see IgG goodFrac).
IgG CV: Coefficient of variation of all normalization spots that are flagged good.
IgG CVpin: The print head that is used for spotting the proteins consists of 16 pins, each of which transfers a slightly different amount of protein. This CV is the mean of the 16 pin specific CVs.
meanChip intAbs: This is the mean of the absolute intensities of all proteins on the chip without control spots.
meanChip intNorm: This is the mean of the normalized intensities of all proteins on the chip without control spots.
meanChip backAbs: This is the mean of the absolute background intensities of all spots on the chip.
IgG goodFrac: Fraction of the 64 copies of IgGs that are used for normalization which are flagged as good. A spot is flagged good if it fulfills certain criteria regarding signal to noise ratio, circularity, saturation, etc.
goodFrac: Fraction of all spots on the chip without normalization spots that are flagged as good. Since it is expected that antibodies react with only a small number of spotted proteins, it is normal that this number is smaller than the fraction of good flagged IgGs.
mergeGPR
This is a utility program that merges all GenePix result files of a classification study and generates a large data file suitable as input for the classification software DTREG (see below). mergeGPR also combines spot replicates (duplicates or quadruplicates depending on chip type) and performs normalization.
DTREG
This commercial software (www.dtreg.com) is used for the actual classification process via support vector machines. The software performs cross validation to avoid over fitting, can handle missing values and generates various diagrams and performance measures of the classification result. In addition to sensitivity and specificity it also calculates receiver operator curves (ROC) as shown in the next figure.
selectFeatures
This program has been developed by Protagen to provide a ranking of the proteins that are spotted on a microarray according to their ability as biomarkers to differentiate between sera from healthy and non-healthy individuals. The program works on the same input data as DTREG and performs a single marker ranking using a Mann-Whitney test. This test is a non-parametric equivalent to a t-test but does not rely on normal distributed data and is also more robust against outliers. To take care of the problem of multiple testing selectFeatures also calculates corrections for the p-values. As a very conservative approach a Bonferroni correction is performed and as less restrictive alternative the false discovery rate (FDR) according to Benjamini & Hochberg is calculated.

