DiagnoProt

Tutorial

Introduction

DiagnoProt is a tool that finds discriminative mass spectra among different biological conditions and ultimately performs spectral profiling classification of unknown conditions by comparing sets of tandem mass spectra (MS/MS). The software can pinpoint high-quality spectra that meet quality control criteria but nonetheless remained unidentified by proteomic search engines. By applying pattern matching and artificial intelligence algorithms, the software can make a very interesting use of these "gold" (i.e., unidentified by peptide spectrum matching search engines) spectra.

This tutorial describes all the necessary steps for using and understanding DiagnoProt. The installation section shows how to install and configure the software. The how it works section details the theory and methods employed. We strongly recommend its reading before any attempt to use the software, as it will provide an understanding of what is going on behind the scenes so that the parameters can be properly set according to the experimental setup at hand. Finally, the how to use it section shows where one can click to get things done.

If there are still questions after reading this tutorial, a Forum is made available for further assistance.

How it works

Overview

Figure 1 DiagnoProt workflow.

The general workflow of DiagnoProt is summarized in Figure 1. A collection of MS/MS spectra originating from a biological condition is first filtered by the spectral quality control (QC). The QC removes noisy spectra and spectra that are unlikely to be identified by proteomic search engines. Then the “good” spectra are clustered by spectral similarity. A spectral cluster, in DiagnoProt's context, is a mass spectrum elected to be representative of a set of very similar spectra, according to a similarity function. Once all the clusters are computed, they are stored in a knowledge base (KB). All biological conditions have their own collection of spectral clusters in the KB. Once a KB with at least two conditions is created, DiagnoProt finds clusters that have discriminative power between each pair of conditions. These clusters might be identified by a proteomic search engine (such as Comet, ProLuCID, etc.). The spectra that remain unidentified are considered high-quality candidates (or gold spectra), since they passed the QC test in the first place. The following sections provide details on DiagnoProt's modules.

Quality Control

The QC is implemented as several quality filters that are controlled by their parameters: minimum number of peaks (Min. No. Peaks), minimum relative intensity (Min. Rel. Intensity), Min. Ret. Time, minimum Xrea (Min. Xrea), maximum Balance (Max. Balance), and minimum spectral count (Min. Spec. Count). Table 1 shows these parameters and their respective default values.

Table 1 QC parameters.

Parameter	Default Value
Min. Num. Peaks	50
Min. Rel. Intensity	0.01
Min. Ret. Time	10 minutes
Min. Xrea	0.3
Max. Balance	1
Min. Spec. Count	2

All spectra whose number of peaks are below Min. Num. Peaks are disregarded. Poor fragmentation usually produces spectra with few peaks. The default value is 50, but you can adjust this filter to a value more appropriate to your experimental setup.

The relative intensity r of a peak in a spectrum is the ratio of its intensity to the intensity of the base peak. All peaks such that r < Min. Rel. Intensity are removed from the spectrum. This filter removes those very small grass peaks that are usually regarded as noise.

Spectra with chromatography retention times smaller than Min. Ret. Time are disregarded. The default value is 10 minutes, but the appropriate setting for this parameter depends on the equipment being used, on laboratory conditions, and on the samples being processed. Moreover, the acetonitrile-to-water gradient used in the mass spectrometer may also affect the value to be chosen. For example, a setting of 10 minutes may be appropriate for a process going from 5% to 40% acetonitrile (ACN) over two hours. If the process applies a gradient of 5% to 40% ACN over one hour, then a 5-minute minimum retention time may be more appropriate. In most cases, peptides only elute from the chromatographic column after a certain percentage of ACN is achieved.

Xrea is a signal-to-noise score proposed by Na and Paek [1], that only considers relative and cumulative intensities without taking into account the distribution of intensities along the m/z axis. A high Xrea score usually means a spectrum with some distinctive peaks in relation to other, low-intensity peaks. On the other hand, a low Xrea means a spectrum with many peaks of close intensity all over, that is, a spectrum in which it is difficult to discriminate between signal peaks and noise peaks. All spectra that have Xrea bellow Min. Xrea are discarded. The empirically recommended value for Min. Xrea is between 0.3 and 0.4.

You may find spectra with high Xrea values but with no hope of identification due to their peak distributions. The Balance score is a quality score that measures the divergence between a model distribution and the peak distribution of a spectrum. The lower the Balance score, the closer a spectrum is to the model. All spectra that have Balance > Max. Balance are not included in the KB.

Min. Spec. Count is a QC filter that is effective only during search, therefore it doesn't prevent a spectrum from being inserted into the KB. Let c be the number of spectra in a spectral cluster. If c < Min. Spec. Count, then the cluster will be disregarded during search.

Binning

Figure 2 Example of a binned spectra.

Binning is an operation that transforms a spectrum into a vector (Figure 2). The objective of binning is to make a summary of the spectrum and to optimize mathematical calculations. A bin is by definition an interval [l,u) of m/z values. The bin’s lower (l) and upper (u) bounds are represented as red lines in Figure 2. The binning procedure is controlled by the parameters in Table 2.

Table 2 Binning parameters.

Parameter	Default Value
Bin Offset	0.4
Bin Size	1.0005
Min. Bin m/z	200
Max. Bin m/z	1700

The binning procedure works as follows. All peaks whose m/z are bellow Min. Bin m/z or above Max. Bin m/z are discarded. Let o be the value of Bin Offset and s the value of Bin Size. A number of bins is then created, the first one being [o, o+s), the second [o+s,o+2s), and the ith one for i > 0 being [o+(i-1)s, o+is). Of course, the total number of bins will be limited by the binning parameters. When bins are created, all peak intensities inside each bin are summed up and saved in a normalized vector.

Clustering

The objective of clustering is to find spectra that are very similar, thus doing away with redundancy in the KB. The clustering algorithm is controlled by the parameters in Table 3. Note that the retention time tolerance (Ret. Time Tolerance) is an optional parameter, and by default it is set to positive infinity.

Table 3 Clustering parameters.

Parameter	Default Value
Similarity Threshold	0.95
Precursor Tolerance	3.5
Ret. Time Tolerance	∞

Two spectra are said to be similar to each other if, and only if, the difference in their precursors’ m/z is within Precursor Tolerance, the difference in their chromatography retention times is within Ret. Time Tolerance, and the similarity function between them is above the Similarity Threshold. DiagnoProt finds all similar spectra, cluster them and stores them at the KB.

Discriminating analysis

A discriminative analysis requires a KB with at least 2 conditions. By comparing the collections of clusters, the algorithm finds all discriminating clusters for each biological condition in the KB taking into account spectral similarities and quantitative data.

If a proteomic search such as peptide spectrum matching (PSM) is performed on these clusters, it is expected that many of them will remain unidentified. In order to find these “gold”, high-quality unidentified spectra, you must run a proteomic search on your KB and set a maximum XCorr score. The default Max. XCorr is 1.5, which is a very low value, so a score bellow this value is usually considered a false identification. All discriminating spectra that are bellow Max. XCorr are said to be “gold” spectra because they passed the QC but could not be identified by the proteomic search engine.

Spectral profile classifier

Figure 3 Spectral Profile Classifier.

The spectral profile classifier tries to discover the biological condition of an unknown sample by making cluster comparisons between the known conditions in the KB. The spectra of a sample of unknown condition pass through the same QC and clustering procedures used in the KB creation. Then, using a cluster set similarity score, the unknown sample is assigned to the highest ranking condition in the KB.

Principal Component Analysis

A Principal Component Analysis (PCA) may be performed on the KB to provide an intuitive view of the biological samples and their relations. The PCA plots the dissimilarities/distances between samples and conditions in a chart. With this procedure, it is possible to have a graphical overview of the samples in the KB. It is expected that biological samples belonging to the same biological condition will share much biological material, and therefore they may be graphically closer in the plot.

How to use it

Creating a Knowledge Base

Before searching for discriminative spectra or starting the spectral profile classifier, a KB must be generated with at least 2 biological conditions. In what follows, the creation of a KB using DiagnoProt's Knowledge Base Manager is demonstrated.

Figure 4 KB Manager interface.

As soon as DiagnoProt is launched, the KB Manager interface can be seen (Figure 4). One can either create an empty KB or create a KB automatically from a dataset of MS files.

To create the KB automatically, set the clustering parameters (Similarity Threshold, Precursor Tolerance, and Ret. Time Tolerance), chose a general setup (Bottom-Up, for shotgun proteomics data, or Top-Down*), add a KB description, and click on the button "create a knowledge base from dataset". A dialog box will prompt the user to inform the root directory of the MS files, and that’s it! With just a few clicks of the mouse, the software will start computing clusters and will create a KB automatically. Needless to say, the MS files must be supported by the software. Currently, DiagnoProt supports the RAW (Thermo Fischer Scientific®) and MS2 file formats, but other formats will be supported shortly. The fungi dataset, described in our paper, is available for download. In addition, for the "KB with one click" option to work, the MS files must be organized in a logical structure enabling DiagnoProt to determine the biological conditions linked to each file. An example of a valid directory structure is shown in Figure 5. Fungi_Dataset is the dataset's root directory. Bellow the root directory, the directories of each biological conditions are found. In turn, under each biological condition directory, directories with biological replicates, containing their respective MS, are found. In this way, just by parsing the directory structure, DiagnoProt will take all MS files in the Aoryzae1 directory to belong to one biological sample of Aspergillus oryzae.

Figure 5 MS files directory structure for automatic KB creation.

Besides creating a KB from a dataset of MS files, one can also create an empty KB from scratch by clicking on the button “create empty knowledge base”. In either way, one will end up with a .PKB file that stores the KB.

*DiagnoProt for top-down proteomics is still under development.

Appending new data into a Knowledge Base

Once a *.pkb file is created, one can add new biological conditions to the KB and new spectra to an existing condition. In the "KB Manager > Creation > Append to biological sample" tab, select the KB file you wish to update, select an existing biological condition or add a new one by clicking the "add" button, and chose a directory with MS files. All the spectra in the MS files will be clustered and merged with the spectra already saved in the KB.

Shortlisting discriminative spectra

Figure 6 KB Manager Analysis interface.

The discriminating analysis is done using the KB Manager Analysis interface (Figure 6). One must input the KB file, set the QC parameters: Min. Spec. Count, Min. Xrea and Max. Balance, and click on the "compare biological conditions" button. When the computations are complete, all clusters that occur in other conditions and the unique clusters of each condition will be shown in the clusters tab. Each cell in the clusters table shows the number of spectra in the intersection, and the last column shows the number of unique spectra.

Figure 7 Spectrum Viewer explorer.

If one double-clicks on any cell of the clusters table, the Spectrum Viewer explorer will pop up listing all the spectra linked to that cell (Figure 7). Here, one can view the spectrum and other relevant information, such as precursor m/z, precursor charge, Xrea, and Balance score. This interface also enables saving the original spectra to the MS2 format for further analysis by a search engine.

Finding gold spectra

To find "gold spectra", that is, high-quality discriminative spectra that remain unidentified by a search engine, DiagnoProt recognizes the SQT formats generated by search engines like Comet [2] (embedded in PatternLab [3]), SEQUEST, or ProLuCID. To evaluate the spectra, save the exclusive clusters to MS2 file format using the Spectral Viewer explorer (Figure 7), run the search on these MS2 files, and then open the Find Gold Spectra using the utilities menu of DiagnoProt (Figure 8).

Figure 8 Find Gold Spectra on Utils menu item.

In the Find Gold Specra interface (Figure 9), inform the MS2 file that was saved, the SQT file(s), and the KB file. Then set the QC parameters and a maximum XCorr for spectra to be considered unidentified. The default value for Max. XCorr is 1.5, which is a very low cross-correlation score and is usually not considered a match. Click on the save button, and when the search is finished, DiagnoProt will show all high-quality spectra that were not identified by the PSM search engine in the Spectrum Viewer explorer (Figure 7).

Figure 9 Find Gold Spectra interface.

Spectral Profile Classifier

Figure 10 Spectral Profile Classifier validation interface.

The Spectral Profile Classifier's performance can be validated on your KB. DiagnoProt's validation tab (Figure 10) runs a leave-one-out cross-validation (LOOCV) test on your data and gives you an estimate of the classification error outside your sample dataset. This test works by temporarily removing all the clusters from a biological sample from the KB, treating it as a unknown sample, and then submitting it to the classifier. After repeating the procedure for all biological samples, a validation error estimate will be shown as well as the classifier’s assignment to each replicate. To run the test, you must set the clustering parameters, the QC parameters, and the dataset root directory. Your MS files in the root directory must follow the same directory structure previously described for creating a KB with a single click.

Figure 11 Diagnostics tab.

If you're satisfied with the validation results, you can apply the classifier and your KB data in the classification of unknown conditions. This is done in the Diagnostics tab (Figure 11). This tool performs a spectral cluster profile classification of unknown samples using Jaccard index comparisons between all the conditions in the KB and the unknown sample. To run the diagnostics, chose a KB, chose a directory with MS files from the unknown condition, set QC parameters, and click on the "GO" button. When all computations are done, the Jaccard indexes between the unknown sample and all conditions will be shown in the table.

PCA plot

Figure 12 PCA plot of biological samples tool.

DiagnoProt can generate a PCA plot showing how biological samples are close to one another (Figure 12). To do this, chose a KB, set QC parameters, and click on the "compare biological conditions" button. DiagnoProt will compute Jaccard indexes between all pairs of biological samples, perform a projection on the direction of greatest data variance, and plot each sample in a two-dimensional chart. Each point in the chart represents a biological condition. It is expected that samples from the same condition will appear close to one another in the PCA plot.

Limitations

The spectral library should be generated only with files originating from the mass spectrometry protocol. At this point, we do not recommend clustering spectra from different types of equipment.

References

S. Na and E. Paek, "Quality Assessment of Tandem Mass Spectra Based on Cumulative Intensity Normalization", Journal of Proteome Research, vol. 5, no. 12, pp. 3241-3248, 2006.
Eng, M. Hoopmann, T. Jahan, J. Egertson, W. Noble and M. MacCoss, "A Deeper Look into Comet—Implementation and Features", Journal of The American Society for Mass Spectrometry, vol. 26, no. 11, pp. 1865-1874, 2015.
P. Carvalho, D. Lima, F. Leprevost, M. Santos, J. Fischer, P. Aquino, J. Moresco, J. Yates and V. Barbosa, "Integrated analysis of shotgun proteomic data with PatternLab for proteomics 4.0", 2016.