Partition Assessment Tool

The Partition Assessment Tool (PAT) is a clustering assessment framework for proteomics.

WARNING PAT automatically runs the following clustering tools: O-Cluster, hclust, CAST, N-Cluster, igraph, and DBSCAN. PRIDE Cluster and MS-Cluter require separate downloads as per the Requirements section; their local paths should be included in PAT’s graphical user interface.

Requirements

1. Download and install PatternLab for Proteomics from http://patternlabforproteomics.org/.

2. Download and install R for Windows from https://www.r-project.org/.

3. If you want to run the PRIDE Cluster algorithm, download the spectra-cluster-cli from https://github.com/spectra-cluster/spectra-cluster-cli.

4. If you want to run the MS-Cluster algorithm, download it from http://proteomics.ucsd.edu/software-tools/ms-clusterarchives/.

Download

1. The Partition Assessment Tool can be downloaded here.

2. An example project is available here.

Tutorial

1. Obtaining a list of confident peptide identifications.

1.1 PAT requires a list of peptide identifications abiding by the SEPro format. This is seamlessly achieved with PatternLab for Proteomics 4.0. Please refer to our bioinformatics protocol for details [doi.org/10.1038/nprot.2015.133].

2. How to assess the performance of clustering algorithms on a proteomics dataset

2.1 Click on the “Browse” button and select a SEPro file. Make sure the SEPro was saved with the original MS2 spectra.

2.2 Click on the “Settings” button, chose the clustering algorithms, the appropriate clustering, reference partition and processing spectra settings.

2.3 Click on the “Cluster” button and wait for the analysis to complete. It may take a few hours depending on your computer settings.

2.4 Click on the “Save” button and save the clustering results.

2.5 The “Reference Partition” tab will show information about the dataset’s reference partition: the number of identified spectra, the number of identified peptides, the number of identified ion species, the C/N indicator of dataset challenge level, and the reference partition’s cluster size distribution plot.

2.6 Click on the “Candidate Partitions” tab to show the clustering results.

2.7 On “Candidate Partitions / Table” tab you can see the cluster assessment measures (number of clusters, Gaussian Biased True Similar Pairs, Jaccard Index, Variation in Partition Size, and Purity) for each algorithm run with each similarity threshold. You can show all partitions or only the best partitions after GBTSP screening.

2.8 Click on “Candidate Partitions / Plots” tab to show the plots.

2.9 The “Standardized True Similar Pairs” plot will either show the GBTSP (if the Gaussian Filter is checked) or the S(a) (if the Gaussian Filter is unchecked) as a function of the number of clusters for all candidate partitions. You can also plot the Purity distribution on the y2-axis. This plot shows that, in general, Purity is biased toward the highest number of clusters and GBTSP is biased toward the number of clusters of the reference partition.

2.10 The “True Similar Pairs” plot shows the true similar pairs count as a function of the number of clusters from all candidate partitions, the maximum number of true similar pairs (max[a]) given by the reference partition, and the expected number of true similar pairs (E[a]) given by the fixed number of clusters (H_num) null model. In practice, for most proteomics datasets (that is, datasets in which the number of spectra and the number of clusters are high), E[a] tends to zero.

2.11 The “True Dissimilar Pairs” plot shows the equivalent plot for true dissimilar pairs. In general, E[d] is close to max[d]. That is why true dissimilar pairs are not a sure indicator of partition agreement.

2.12 The “Purity” plot shows the average purity as a function of cluster size for the best candidate partitions after GBTSP screening.

2.13 The “Incorrectly Clustered Spectra” plot shows the Proportion of Clustered Spectra as a function of the Proportion of Incorrectly Clustered Spectra (the complement of Purity) for the best candidate partitions after GBTSP screening. (The number annotation below the algorithm’s coordinate is the similarity threshold for its selected candidate partition.)

2.14 The “Retainment of Identified Peptides” plot shows the Retainment of Identified Peptides as a function of the Proportion of Spectra Remaining for the best candidate partitions after GBTSP screening. (The number annotation below the algorithm’s coordinate is the similarity threshold for its selected candidate partition.)

3. How to estimate the selection probability of cluster assessment measures

3.1 Click on the menu “Tools / Selection Bias Simulator”. Choose the number of elements, the range of the number of clusters for the random candidate partitions’ ranged distribution, the number of clusters of the reference partition, the number of simulation steps and click on the “START” button.

3.2 Click on the “SAVE” button to save the simulation results.

3.3 Click on “Selection Probability” tab to see the estimated (ranged and full) selection probability plots for the following assessment measures: Adjusted Rand Index, Jaccard Index, Standardized True Similar Pairs, Gaussian Biased True Similar Pairs, Purity, Proportion of Clustered Spectra, Retainment of Identified Peptides, and Proportion of Spectra Remaining.

3.4 Click on “Simulation Statistics” tab to assess the simulation quality and compare the theoretical and experimental expected values, standard deviations, and coefficient of variations for true similar pairs and true dissimilar pairs.