Nontargeted analysis based on mass spectrometry is a rising practice in environmental monitoring for identifying contaminants of emerging concern. Nontargeted analysis performed using comprehensive two-dimensional gas chromatography coupled with time-of-flight mass spectrometry (GC×GC/TOF-MS) generates large numbers of possible analytes.
Journal of Chromatography A 1660 (2021) 462656 Contents lists available at ScienceDirect Journal of Chromatography A journal homepage: www.elsevier.com/locate/chroma Automated high confidence compound identification of electron ionization mass spectra for nontargeted analysis Joseph Bendik a,1, Richa Kalia a,1, Jeet Sukumaran b, William H Richardot c,d, Eunha Hoh d, Scott T Kelley a,b,∗ a Department of Biology, San Diego State University, San Diego, CA, USA Department of Biology, San Diego State University, 5500 Campanile Drive, San Diego, CA 92104, USA San Diego State University Research Foundation, San Diego, CA, USA d School of Public Health, San Diego State University, San Diego, CA, USA b c a r t i c l e i n f o Article history: Received 30 July 2021 Revised 26 October 2021 Accepted 27 October 2021 Available online 31 October 2021 Keywords: ChromaTOF PyAutoGUI Mass spectral comparison Nontargeted analysis Suspect screening Machine learning a b s t r a c t Nontargeted analysis based on mass spectrometry is a rising practice in environmental monitoring for identifying contaminants of emerging concern Nontargeted analysis performed using comprehensive two-dimensional gas chromatography coupled with time-of-flight mass spectrometry (GC×GC/TOF-MS) generates large numbers of possible analytes Moreover, the default spectral library similarity score-based search algorithm used by LECO® ChromaTOF® does not ensure that high similarity scores result in correct library matches Therefore, an additional manual screening is necessary, but leads to human errors especially when dealing with large amounts of data To improve the speed and accuracy of the chemical identification, we developed CINeMA.py (Classification Is Never Manual Again) This programming suite automates GC×GC/TOF-MS data interpretation by determining the confidence of a match between the observed analyte mass spectrum and the LECO® ChromaTOF® software generated library hit from the NIST Electron Ionization Mass Spectral (NIST EI-MS) library Our script allows the user to evaluate the confidence of the match using an algorithmic method that mimics the manual curation process and two different machine learning approaches (neural networks and random forest) The script allows the user to adjust various parameters (e.g., similarity threshold) and study their effects on prediction accuracy To test CINeMA.py, we used data from two different environmental contaminant studies: an EPA study on household dust and a study on stormwater runoff Using a reference set based on the analysis performed by highly trained users of the ChromaTOF and GC×GC/TOF-MS systems, the random forest model had the highest prediction accuracies of 86% and 83% on the EPA and Stormwater data sets, respectively The algorithmic approach had the second-best prediction accuracy (82% and 79%), while the neural network accuracy had the lowest (63% and 67%) All the approaches required less than to classify 986 observed analytes, whereas manual data analysis required hours or days to complete Our methods were also able to detect high confidence matches missed during the manual review Overall, CINeMA.py provides users with a powerful suite of tools that should significantly speed-up data analysis while reducing the possibilities of manual errors and discrepancies among users, and can be applicable to other GC/EI-MS instrument based nontargeted analysis © 2021 The Authors Published by Elsevier B.V This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Introduction Environmental monitoring for chemical contaminants typically requires using targeted analysis, in which a priori information ∗ Corresponding author at: Department of Biology, San Diego State University, San Diego, CA, USA E-mail address: skelley@sdsu.edu (S.T Kelley) These authors contributed equally to this work (mass spectra, retention times, etc.) on specific chemicals is used to detect compounds of interest While these methods are sensitive and quantitative for a known set of compounds, they miss undefined compounds regardless of their abundance Nontargeted analysis (NTA), including suspect screening, was developed to detect multiple compounds simultaneously, including novel compounds, and involves comprehensive sample preparation and chromatography followed by full mass spectrometry analysis [1–3] Comprehensive two-dimensional gas chromatography coupled with time- https://doi.org/10.1016/j.chroma.2021.462656 0021-9673/© 2021 The Authors Published by Elsevier B.V This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) J Bendik, R Kalia, J Sukumaran et al Journal of Chromatography A 1660 (2021) 462656 of-flight mass spectrometry (GC×GC/TOF-MS) has proven to be one of the useful techniques for performing NTA of environmental samples [4–7] GC×GC/TOF-MS has a superior ability to identify compounds due to the enhanced sensitivity and separation power of the GC×GC chromatography system and acquisition of full scan mass spectra at low concentrations via the fast acquisition rate of the TOF-MS compared to one dimensional GC coupled to a quadrupole MS [8] As a result, the GC×GC/TOF-MS based NTA generates thousands of chromatographic features, and each feature has a full scan mass spectrum and chromatographic information Data analysis is required to process thousands of features, and in NTA projects there are a significant number of analytes that need to be studied [9] In GC×GC/TOF-MS based NTA, the raw data is analyzed using data processing software such as LECO® ChromaTOF® ChromaTOF’s “automatic peak search” first identifies features based upon certain conditions (i.e., S/N ratio, GC retention time, etc.) Additionally, ChromaTOF’s peak table alignment feature “Statistical Compare”, enables users to make comparisons between groups of samples (ex Samples vs Controls) to efficiently isolate compounds of interest “Statistical Compare” aligns peaks across sample groups based upon 1st and 2nd dimension GC retention times, as well as mass spectral similarity In order to identify compounds of interest, each peak is compared against the National Institute of Standards and Technology electron ionization mass spectral (NIST EI-MS) library (or custom MS library depending on the user), generating a list of ranked suggested compounds (Library Hits) and “similarity score” by ChromaTOF utilizing the NIST Similarity score based on the relative abundances of the matched pairs of masses and the abundance ratios of adjacent matching peaks [10,11] Afterwards, each library hit must be manually reviewed to further evaluate the confidence of a match between the library hit mass spectra and the observed mass spectra (after deconvolution), known as the Peak True mass spectra in ChromaTOF Currently, the observed mass spectra and library hit mass spectra are either manually reviewed in ChromaTOF, or the data can be exported as a PDF Fig 1A shows the workflow for manual data analysis [12] Once the best matches are obtained using the spectral library search algorithms, analytical reference standards are procured, and their respective retention times and mass spectra are obtained from the same instrumental condition of GC×GC/TOFMS The verification success rates were 94% and 96% in our studies [4,13] This supports the notion that the manual review works for determination of high confidence identification However, this manual review can be time consuming and error prone when the data size is large, and results can be inconsistent among users For instance, reviewing thousands of compound’s mass spectra and their matching mass spectra from a MS library (e.g., the NIST EIMS library) can take many hours or even days depending on user experience This high level of manual data handling leads to numerous errors necessitating multiple independent reviews for all results to minimize errors Thus, automation of these tasks would be extremely valuable to improve the accuracy and increase the analysis throughput [14] To improve the speed and accuracy of identification based on mass spectral matching, we developed two programs: chromaTOF_auto.py and CINeMA.py (Classification Is Never Manual Again) The chromaTOF_auto.py script automates GC×GC/TOF-MS data download from LECO® ChromaTOF® software, while CINeMA.py facilitates the confirmation of analyte matches between the NIST mass spectral library and the experimental mass spectra using two different approaches: an algorithmic method based directly on the manual curation method and machine learning approaches using neural networks and random forests trained on manually curated data sets These machine learning techniques have been used for similar mass spectrometry applications in pre- vious studies, demonstrating their potential ability to aid compound identification [15–17] Our results show that our scripts greatly reduce the time needed for GC×GC/TOF-MS based nontargeted analysis, while still maintaining high accuracy Materials and methods 2.1 Automated data collection All samples used to develop and test our scripts were analyzed using Pegasus 4D GCìGC/TOF-MS (LECO, St Joseph, MI) LECOđ ChromaTOFđ software (version 4.50.8.0 optimized for Pegasus) was used for data processing The stormwater runoff samples (aka the Stormwater data set) were collected by the San Francisco Estuary Institute (SFEI) from Napa, Sonoma, and Santa Rosa counties in California following the 2017 Northern California wildfires [13] The household dust samples (aka the EPA data set) were provided as part of the U.S Environmental Protection Agency (EPA)’s Nontargeted Analysis Collaborative Trial (ENTACT), an inter-laboratory study designed to compare the various workflow techniques implemented within the NTA research community [18,19] In brief, participants were given a series of samples in a blind trial some of which had been spiked with a cocktail of various compounds and were instructed to conduct NTA The EPA data set contained 986 observed analytes from the analysis of LECO® ChromaTOF® software and the Stormwater data set contained 892 observed analytes In the EPA data set, 409 compounds were manually reviewed to be high confidence matches, and 577 were reviewed as low confidence In the Stormwater data set, 373 were reviewed as high confidence and 519 were reviewed as low confidence The LECO® ChromaTOF® software assigns each chromatographic peak a name based upon mass spectral similarity to compounds within the 2011 NIST EI-MS library After isolating all compounds of interest during review, the user sorts the “peak table” in ChromaTOF so that each compound of interest is in sequential order To so, the “peak table” was sorted by “comment” and “peak number” The “peak true” (deconvoluted mass spectra) data of all compounds of interest were then exported in MSP format (peak_true.msp) Next, the mass spectra of each compound’s assigned name from the 2011 NIST EI-MS library (library hit) were exported using the chromaTOF_auto.py script The chromaTOF_auto.py is based on PyAutoGUI, a python module to control the use of mouse and keyboard for automation of any Graphical User Interface PyAutoGUI reproduces human actions such as moving, clicking and dragging the mouse, pressing and holding keys, and pressing keyboard hotkey combinations [20] Using this script an analyst can easily extract the GCìGC/TOF-MS library hits data from the LECOđ ChromaTOFđ software for further analysis in a significantly reduced time and with negligible human effort The chromaTOF_auto.py script does not modify, manipulate, or extend the software or databases of the LECO® ChromaTOF® software Fig 1B shows the workflow for automated data download with chromaTOF_auto.py The LECO® ChromaTOF® workspace is composed in left to right order with the following components - the directory for accessing tools and options (Acquisition Que, GC and MS Methods, Acquired Samples etc.), peak table, and the library hit mass spectrum (Fig S1) The chromaTOF_auto.py script saves the library hit files sequentially in the most recent directory used by the user, renaming the files (1.msp, 2.msp, etc.) for easy access 2.2 Data parsing The data obtained from the GC×GC/TOF-MS data analysis by the LECO® ChromaTOF® software on both the EPA and Stormwater data sets was parsed using CINeMA.py to extract: (1) Analyte name J Bendik, R Kalia, J Sukumaran et al Journal of Chromatography A 1660 (2021) 462656 Fig Workflow for manual (A) or automated (B) data analysis An environmental sample once collected is processed using GC×GC/TOF-MS and analyzed using the LECO® ChromaTOF® software The LECO® ChromaTOF® software outputs a list of observed analytes present in the sample For a manual analysis, this processed data for each observed analyte and their respective library hits are then manually downloaded by the analyst Next, the analyst reviews this manually downloaded data to evaluate the confidence of the match (High or Low) between the mass spectra of each observed analyte and their corresponding library hit For the automated analysis, the user creates a directory to save the observed analyte’s (SA) library hit files and then downloads them sequentially using the chromaTOF_auto.py script (“Name”); (2) Mass-to-charge ratio (m/z) of the ions and their respective intensities; (3) Similarity Score between the observed analyte and library hit #1 from the LECO® ChromaTOF® software (only present in library hits); and (4) Total number of ions in the mass spectrum This data was necessary for the script CINeMA.py to analyze the confidence of a match between the observed analytes and library hits In addition, since the lowest mass spectral acquisition ion was m/z 50, the manual review of matches ignores all ions below m/z 50 present in the library hit CINeMA.py parsed all the files in the given data directory into the required data structure to train, test, or make predictions, using either the algorithmic model or the machine learning models [21–23] Depending on the user action (predict, train, or test), CINeMA.py requires the data directory to have a specific organizational structure (Fig 2) CINeMA.py results were benchmarked with those obtained via manual analysis to establish the reliability of our CINeMA.py results and the effectiveness of CINeMA.py in reducing GC×GC/TOFMS data analysis time The peak_true.msp file contains data for all the observed analytes together as shown in Fig S2 To verify the completeness of the analyte data, the script parses the peak_true.msp file using a state machine as shown in Fig S3 Finally, each compound’s library hit is output to an individual file as shown in Fig S4 the observed analyte mass spectrum and the library hit from the 2011 NIST EI-MS library matches The user can alter this similarity score threshold using the command line inputs for CINeMA.py The algorithm compares the library hit mass spectra with the observed mass spectra from LECO® ChromaTOF® software A match is deemed a “high confidence” match if the following are true: the similarity score is greater than or equal to the user provided similarity score, the most abundant three ions of the library hit are present in the observed mass spectra (and vice versa), the molecular ion is present, and the correlation percentage between the spectra of the library hit and the observed mass spectra is at least 80% 2.4 Machine learning models Two types of machine learning approaches were used to determine if the best library hit is a high- or low-confidence match to the observed mass spectra: a random forest algorithm, and a neural network Random forest and neural networks were both selected for this study primarily because of their effectiveness when working with classification problems such as this Neural networks can analyze complex relationships between inputs, which makes it a good choice to detect differences in mass spectra that can contain large amounts of ion intensity data However, neural networks usually require vast amounts of samples for training Conversely, random forest works well with smaller amounts of data with more clearly defined features, such as the spectra features a reviewer looks for during a manual review In addition, feature importance can be easily provided with random forest, allowing the user to visualize the aspects of their manual review that the machine considers the most important 2.3 Algorithmic model The algorithmic model, outlined in Fig 3, begins by checking for the similarity score threshold, which by default is set to 600 in this study, but the threshold can be changeable (out of 999) This similarity score from NIST is an output from the LECO® ChromaTOF® software describing the measure of similarity between J Bendik, R Kalia, J Sukumaran et al Journal of Chromatography A 1660 (2021) 462656 Fig Data directory structures (A) Under the sample directory there is a subdirectory called ‘hits’ and the peak_true.msp file that contains the data for observed analytes The user should use the ‘hits’ directory to save all the library hits files obtained through using chromaTOF_auto.py Each sample subdirectory should contain a compounds.tsv file, which contains the m/z ratio for the molecular ion in the library hit file (B) For training or testing the accuracy of a machine learning model with a new data set, the root directory should contain sub directories, which are sample names Each sample subdirectory should contain a ground_truth.tsv file, which contains the manual interpretation of the confidence of a match of observed analytes and library hits obtained from GC×GC/TOF-MS data analysis by the LECO® ChromaTOF® software Fig Algorithmic model If the similarity score from the LECO® ChromaTOF® software is less than the similarity score threshold, the algorithm classifies the match as a low confidence match If the similarity score is higher, then the model normalizes the spectrum data for both the observed analyte (SA) and the library hit (LH) and checks the following set of conditions: (1) presence of most abundant three ions (Top ions) of the library hit in the observed analyte, (2) presence of molecular ion of the library hit in the observed analyte, (3) presence of top three ions of the observed analyte in the library hit and (4) correlation (>=80) between the spectra of the library hit and the observed analyte If all these conditions are met, it interprets the match as a “high confidence match.” J Bendik, R Kalia, J Sukumaran et al Journal of Chromatography A 1660 (2021) 462656 Fig Neural network model’s structure The first 10 0 inputs are the library hit ion intensities and the next 10 0 are the observed analytes’ ion intensities There are three hidden layers of size 10 0, 10 and 10 neurons, and have softsign activation functions The last layer of the network uses a softmax activation function and is composed of two neurons for high or low predictions The model was trained with epochs and a batch size of 128 The input data for random forest consisted of the same mass spectra features checked when using the algorithmic model: similarity score, correlation percentage, molecular ion presence, and the number of top ions present in the hit that are also present in the observed analyte (and vice versa) The random forest model was built in python using the Scikit-Learn package [24,25] The hyperparameters for the model were tuned based on optimizing the accuracy metric, resulting in 100 trees and a max depth of The input data for the neural network consisted of the ion intensities for each observed analyte and its best hit to detect if the two spectra are similar enough to be considered a high-confidence match This model was built in python using the Keras and Tensorflow packages [26,27] Fig illustrates the structure of the neural network model Activation functions, the number of epochs, and the batch size were selected for the neural network based on the accuracy metric, aided with the use of GridSearchCV in the Scikit-Learn package Model performance was examined through confusion matrices, receiver operating characteristic (ROC) curves, and 10-fold cross validation All models were trained on one of the two data sets and tested on the other using an expert’s manual review of high and low confidence for the data labels Additionally, to provide more data for model training, these data sets were also combined into one large data set and then trained and tested in three ways: (1) Train on 80% of the combined set and test on the remaining 20%; (2) Train on 80% of the EPA data set plus 100% of the Stormwater data set, and then test the remaining 20% of the EPA data set; (3) Train on 80% of the Stormwater data set plus 100% of the EPA data set, and then test the remaining 20% of the Stormwater data set Random splits were performed on all train test split cases CINeMA.py also allows the analyst to train and save their own machine learning model on a given data set The saved models can then be used for testing or making predictions for new data sets Fig Mirror plots comparing observed analyte and library hit mass spectra The mirror plots are provided by CINeMA.py for all matches from the non-targeted analysis to the given library spectra, allowing straightforward manual confirmation The top spectra (positive values in blue) is the spectrum from the observed analyte in the sample, while the bottom mirrored spectra (negative values in red) is the spectrum of the corresponding library hit for the observed analyte (A) An example of a high confidence library match (B) Example of a low confidence match (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) pare the observed analyte’s mass spectrum and the corresponding library hit’s mass spectrum if desired The mirror-plot of the two mass spectra makes visual comparison easy while comparing the two separate plots produced by LECO® ChromaTOF® software When training a neural network model, CINeMA.py produces model_performance.pdf containing loss curves for each fold during cross-validation, shown in Fig When testing either of the machine learning models, the script will produce measures.pdf containing the confusion matrix and the ROC curve, as in Fig [29] By considering low-confidence matches as “negatives,” and highconfidence matches as “positives,” the user can use the confusion matrix to calculate performance metrics such as accuracy, sensitivity, specificity, and balanced accuracy When training the random forest model with feature input data, the script will produce importance.pdf containing a bar plot with the relative importance for each feature (Fig 8) Source code for chromaTOF_auto.py and CINeMA.py, along with tutorials and test data are available on Github at https://github.com/sharmaricha200/thesis.git 2.5 Report generation The CINeMA.py generates reports in the form of two files report.tsv and report.pdf The report.tsv file contains information about the peak number, name of the observed analyte and the predicted match between the library hit and the observed analyte The report.pdf file contains mirror plots between each observed analyte’s mass spectrum and its corresponding library hit’s mass spectrum [28] Fig shows example plots of high and low confidence matches The plots allow the analyst to visually com5 J Bendik, R Kalia, J Sukumaran et al Journal of Chromatography A 1660 (2021) 462656 Fig Example feature importance for the random forest model trained on the EPA data set and tested on the Stormwater data set Fig Example training loss generated during one-fold of the 10-fold cross Validation on the EPA data set The blue curve (top) indicates the loss on the training samples, and the orange curve indicates the loss on the samples held out for validation in that fold This results shows Neural Network Model loss using ion intensity data trained for epochs and a batch size of 128 (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Results and discussion The automated data collection workflow process implemented in chromaTOF_auto.py needed only a few minutes on an Intel® CoreTM i7–6700 Quad CPU, with GB RAM running Windows® 10, 64-bit to download library hit data (∗ msp) files from a given GCìGC/TOF-MS data output analysis from the LECOđ ChromaTOFđ software Because of computational speed, chromaTOF_auto.py initially caused the ChromaTOF® GUI to crash To overcome this issue, we included a delay timer in the chromaTOF_auto.py script, allowing the user to set up the screen as described above before the automation takes over to download the library hit files For generating predictions, CINeMA.py was able to produce results within a minute When testing the algorithm model on the complete data sets, an accuracy of 81.54% was achieved on the 986 compounds in the EPA set and an accuracy of 78.70% was achieved on the 892 compounds in the Stormwater set For the machine learning models, the highest accuracy value and Area under the ROC curve score (AUC) seen on the complete EPA set was achieved using the random forest model on the algorithm’s feature data This model had an accuracy of 85.60% and had an AUC score of 0.887 The highest accuracy value and AUC score seen on the Stormwater set was also achieved using the random forest model on the algorithm’s feature data This model had an accuracy value of 82.85% and an AUC score of 0.899 (Table 1) The neural network did not perform as well as the other models based on the testing accuracies, AUC scores, and cross-validation accuracies (Tables and 2) Combining data sets did somewhat improve the testing accuracy and AUC score for this model however (Table 1) Agreement rates between the human user’s decision vs a model decision per “High” and “Low” confidence were similar, with a slightly higher agreement by the algorithm model in “High” than in “Low" (Table S1) This demonstrates that the models work equally for compound identification regardless of “High” and “Low” confidence matching To identify reasons for discrepancy between classifications (human vs computer), we manually reviewed “incorrect” classifications The main source of discrepancy when comparing human classifications to the algorithm’s classifications appeared to come from instances in which observed mass spectra and library hit mass spectra were very similar, but were on the cusp of either high or low confidence This often occurred in instances in which the library hit mass spectra contained numerous ions with low relative abundance Since NTA of environmental samples often involves the detection of trace contaminants, compounds present at low con- Fig Example efficacy outputs following random forest model training on the EPA data set and testing on the Stormwater data set (A) Confusion matrix (B) Receiver Operating Characteristic curve (ROC) J Bendik, R Kalia, J Sukumaran et al Journal of Chromatography A 1660 (2021) 462656 Table Random forest and neural network model performances across the EPA dust and Stormwater data sets Includes the number of compounds present in the training and test sets, the accuracy on the test set, and the Area Under the ROC Curve (AUC) score RF Features Train EPA Test Stormwater Train Stormwater Test EPA Train 80% EPA Test 20% EPA Train 80% Stormwater Test 20% Stormwater Train 80% (EPA + Stormwater) Test 20% (EPA + Stormwater) Train EPA + 80% Stormwater Test 20% Stormwater Train Stormwater + 80% EPA Test 20% EPA NN Intensities Train EPA Test Stormwater Train Stormwater Test EPA Train 80% EPATest 20% EPA Train 80% Stormwater Test 20% Stormwater Train 80% (EPA + Stormwater) Test 20% (EPA + Stormwater) Train EPA + 80% Stormwater Test 20% Stormwater Train Stormwater + 80% EPA Test 20% EPA #Training Compounds #Test Compounds Testing Accuracy AUC 986 892 788 713 1502 1699 1680 892 986 198 179 376 179 198 82.85% 85.60% 89.39% 81.01% 82.45% 79.33% 89.39% 0.899 0.887 0.943 0.883 0.873 0.887 0.936 986 892 788 713 1502 1699 1680 892 986 198 179 376 179 198 66.82% 63.49% 75.25% 69.83% 70.48% 69.83% 77.78% 0.706 0.641 0.846 0.742 0.761 0.732 0.836 Table 10-fold cross validation mean accuracy +/- standard deviation on the two neural network models across the EPA dust and Stormwater data sets NN Intensities Train Train Train Train Train Train Train EPA Test Stormwater Stormwater Test EPA 80% EPA Test 20% EPA 80% Stormwater Test 20% Stormwater 80% (EPA + Stormwater) Test 20% (EPA + Stormwater) EPA + 80% Stormwater Test 20% Stormwater Stormwater + 80% EPA Test 20% EPA centrations may not produce enough low abundance ions to be detected by the mass spectrometer As the algorithm is confined by a strict set of rules (i.e., correlation percentage ≥ 80%), some compounds may be classified as “low” while a human user may take additional factors into account and classify as “high” Additionally, both the algorithm and random forest model corrected human errors As shown in Fig S5, some compounds in which the observed mass spectra and library hit mass spectra were near perfect matches were erroneously classified as “not a match (low)” by the human user but classified correctly as a “match (high)” by the algorithm Conversely, there were instances in which the observed mass spectra and library hit mass spectra did not match well but were classified as “high” by the human user and classified as “low” by the algorithm Such errors were due to fatigue experienced by the human user comparing hundreds of mass spectral matches in succession While the random forest model had the highest accuracy scores, there are still some benefits to the use of a simplified algorithm over the machine learning techniques The simplified algorithm is capable of working with extremely small data sets and does not require an outside source of data for training Both types of machine learning techniques require data for training and, especially in the case of neural networks, large amounts of data may be necessary The algorithmic approach however avoids this issue, meaning users may prefer this method over training their own machine learning model Consequentially, this may explain the low performance metrics in the neural network compared to the other models as the number of samples contained in the data sets was relatively small for this type of model Furthermore, the algorithm is easily tunable, allowing the user to specify their own similarity score and correlation percentage thresholds when testing their own data sets This ability to easily tune the algorithm makes it applicable for use with programs other than ChromaTOF, as their spectral matching components may use a scale different than ChromaTOF’s similarty score (0–999) 74.44% 71.30% 72.21% 70.83% 73.50% 75.40% 73.10% (+/(+/(+/(+/(+/(+/(+/- 3.28%) 4.46%) 5.31%) 4.79%) 2.68%) 2.99%) 2.70%) Conclusions Overall, the random forest model provided the best accuracy value for both data sets, and we showed that compounds missed by the algorithm were often recognized by machine learning Furthermore, by ranking feature importance a machine learning approach can highlight ways to improve the algorithmic approach by illustrating which feature thresholds can be tuned in the algorithm The neural network model with intensities has the potential to predict unknown rules and patterns for analyzing the data set, which the feature-using models lack Feature models are based on man-made rules and likely have room for improvement since it could be difficult to hardcode all possible rules Thus, in principle, with larger data sets a neural network approach using ion intensities has the potential to find patterns and rules that cannot be coded via an algorithm Furthermore, it can be improved by increasing the size and accuracy of training data sets In future work, we will continue to explore the potential of neural networks with intensity data to enhance the accuracy of NTA In terms of speed, CINeMA.py is able to provide prediction results within a minute Manual data analysis by multiple people required hours or even days for the same data sets of observed analytes CINeMA.py’s capacity to rapidly evaluate the confidence of a match between observed analytes and library matches represents a significant improvement over manual analysis that can take substantial time depending on the data size and can be error-prone during heavy data handling CINeMA.py gives the user the flexibility to not only automate the interpretation of the confidence of the match of observed analytes and their corresponding library matches, but also to experiment with various test parameters to study its effects on the analysis In addition, the user can choose to use either or both the algorithmic model and any of the machine learning models to analyze their data and compare their predictions The user can also train the machine learning models with relevant data sets to improve predictions on new data sets Because J Bendik, R Kalia, J Sukumaran et al Journal of Chromatography A 1660 (2021) 462656 the machine learning approaches can find rules and patterns that cannot be coded via standard algorithmic approaches, these techniques have potential for compound identification in the future Although our study was conducted primarily on LECO’s ChromaTOF platform, these approaches are applicable to other GC–MS based nontargeted and/or suspect screening analyses for high matching compound identification by EI mass spectral similarity comparison [7] K.A Phillips, A Yau, K.A Favela, K.K Isaacs, A McEachran, C Grulke, A.M Richard, A.J Williams, J.R Sobus, R.S Thomas, J.F Wambaugh, Suspect Screening analysis of chemicals in consumer products, Environ Sci Technol 52 (2018) 3125–3135, doi:10.1021/acs.est.7b04781 [8] J.M.D Dimandja, Peer reviewed: GC X GC, Anal Chem 76 (2004) 167–174 10.1021/ac041549+ [9] I.A Titaley, O.M Ogba, L Chibwe, E Hoh, P.H.Y Cheong, S.L.M Simonich, Automating data analysis for two-dimensional gas chromatography/time-of-flight mass spectrometry non-targeted analysis of comparative samples, J Chromatogr A 1541 (2018) 57–62, doi:10.1016/j.chroma.2018.02.016 [10] LECO accurate mass library https://knowledge.leco.com/component/edocman/ accurate- mass- library- 209- 272/viewdocument?Itemid=1761 [11] S.E Stein, Estimating probabilities of correct identification from results of mass spectral library searches, J Am Soc Mass Spectrom (1994) 316–323, doi:10 1016/1044- 0305(94)85022- [12] E.G Xu, W.H Richardot, S Li, L Buruaem, H.H Wei, N.G Dodder, S.F Schick, T Novotny, D Schlenk, R.M Gersberg, E Hoh, Assessing toxicity and in vitro bioactivity of smoked cigarette leachate using cell-based assays and chemical analysis, Chem Res Toxicol 32 (2019) 1670–1679, doi:10.1021/acs.chemrestox 9b00201 [13] D Chang, W.H Richardot, E.L Miller, N.G Dodder, M.D Sedlak, E Hoh, R Sutton, Framework for non-targeted investigation of contaminants released by wildfires into stormwater runoff: case study in the Northern San Francisco Bay area, Integr Environ Assess Manag (2021) Online ahead of print, doi:10.1002/ ieam.4461 [14] H Mol, Non-targeted is our target, The Anal Scientist (2013) https:// theanalyticalscientist.com/techniques- tools/non- targeted- is- our- target [15] E.D Strozier, D.D Mooney, D.A Friedenberg, T.P Klupinski, C.A Triplett, Use of comprehensive two-dimensional gas chromatography with time-of-flight mass spectrometric detection and random forest pattern recognition techniques for classifying chemical threat agents and detecting chemical attribution signatures, Anal Chem 88 (2016) 7068–7075, doi:10.1021/acs.analchem.6b00725 [16] F Allen, A Pon, R Greiner, D Wishart, Computational prediction of electron ionization mass spectra to assist in GC/MS compound identification, Anal Chem 88 (2016) 7689–7697, doi:10.1021/acs.analchem.6b01622 [17] D.D Matyushin, A.Y Sholokhova, A.K Buryak, Deep learning driven GC-MS library search and its application for metabolomics, Anal Chem 92 (2020) 11818–11825, doi:10.1021/acs.analchem.0c02082 [18] E.M Ulrich, J.R Sobus, C.M Grulke, A.M Richard, S.R Newton, M.J Strynar, K Mansouri, A.J Williams, EPA’s non-targeted analysis collaborative trial (ENTACT): genesis, design, and initial findings, Anal Bioanal Chem 411 (2019) 853–866, doi:10.10 07/s0 0216- 018- 1435- [19] S.R Newton, J.R Sobus, E.M Ulrich, R.R Singh, A Chao, J McCord, S LaughlinToth, M Strynar, Examining NTA performance and potential using fortified and reference house dust as part of EPA’s non-targeted analysis collaborative trial (ENTACT), Anal Bioanal Chem 412 (2020) 4221–4233, doi:10.1007/ s00216- 020- 02658- w [20] A Sweigart, PyAutoGUI, GitHub Repository, 2014 https://github.com/ asweigart/pyautogui [21] V Keleshev, Docopt, GitHub Repository, 2012 https://github.com/docopt/ docopt [22] C.R Harris, K.J Millman, S.J van der Walt, R Gommers, P Virtanen, D Cournapeau, E Wieser, J Taylor, S Berg, N.J Smith, R Kern, M Picus, S Hoyer, M.H van Kerkwijk, M Brett, A Haldane, J.F del Río, M Wiebe, P Peterson, P Gérard-Marchant, K Sheppard, T Reddy, W Weckesser, H Abbasi, C Gohlke, T.E Oliphant, Array programming with NumPy, Nature 585 (2020) 357–362, doi:10.1038/s41586- 020- 2649- [23] W McKinney, Data structures for statistical computing in python, in: Proceedings of the 9th Python in Science Conference, 1, 2010, pp 56–61, doi:10.25080/ majora- 92bf1922- 00a [24] F Pedregosa, O Grisel, R Weiss, A Passos, M Brucher, G Varoquax, A Gramfort, V Michel, B Thirion, O Grisel, M Blondel, P Prettenhofer, R Weiss, V Dubourg, M Brucher, Scikit-learn: machine learning in python, J Mach Learn Res 12 (2011) 2825–2830 [25] G Varoquaux, Joblib, GitHub Repository, 2009 https://github.com/joblib/joblib [26] F Chollet, Keras, GitHub Repository, 2015 https://github.com/fchollet/keras [27] M Abadi, A Agarwal, P Barham, E Brevdo, Z Chen, C Citro, G.S Corrado, A Davis, J Dean, M Devin, S Ghemawat, I Goodfellow, A Harp, G Irving, M Isard, Y Jia, R Jozefowicz, L Kaiser, M Kudlur, J Levenberg, D Mane, R Monga, S Moore, D Murray, C Olah, M Schuster, J Shlens, B Steiner, I Sutskever, K Talwar, P Tucker, V Vanhoucke, V Vasudevan, F Viegas, O Vinyals, P Warden, M Wattenberg, M Wicke, Y Yu, X Zheng, TensorFlow: large-scale machine learning on heterogeneous distributed systems, (2016) http://arxiv.org/abs/1603.04467 [28] J.D Hunter, Matplotlib: a 2D graphics environment, Comput Sci Eng (2007) 90–95, doi:10.1109/MCSE.2007.55 [29] M Waskom, Seaborn, GitHub Repository, 2013 https://github.com/mwaskom/ seaborn Declaration of Competing Interest The authors declare they have no known competing financial interests CRediT authorship contribution statement Joseph Bendik: Software, Investigation, Formal analysis, Validation, Visualization, Writing – original draft Richa Kalia: Software, Investigation, Visualization, Formal analysis, Writing – original draft Jeet Sukumaran: Software, Methodology William H Richardot: Validation, Data curation, Resources Eunha Hoh: Methodology, Validation, Funding acquisition, Writing – review & editing Scott T Kelley: Conceptualization, Writing – original draft, Writing – review & editing, Supervision, Project administration Funding This work was funded in part by the California Tobacco Related Disease Research Program funded grant (27IP-0028C) Acknowledgments We would like to thank Dr Nathan Dodder, Ying Xu, Bryan Ho, and Basilin Benson for their valuable insights during the study design Supplementary materials Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.chroma.2021.462656 References [1] L Chibwe, I.A Titaley, E Hoh, S.L.M Simonich, Integrated framework for identifying toxic transformation products in complex environmental mixtures, Environ Sci Technol Lett (2017) 32–43, doi:10.1021/acs.estlett.6b00455 [2] J Hollender, E.L Schymanski, H.P Singer, P.L Ferguson, Nontarget screening with high resolution mass spectrometry in the environment: ready to go? Environ Sci Technol 51 (2017) 11505–11512, doi:10.1021/acs.est.7b02184 [3] J.R Sobus, J.F Wambaugh, K.K Isaacs, A.J Williams, A.D McEachran, A.M Richard, C.M Grulke, E.M Ulrich, J.E Rager, M.J Strynar, S.R Newton, Integrating tools for non-targeted analysis research and chemical safety evaluations at the US EPA, J Expo Sci Environ Epidemiol 28 (2018) 411–426, doi:10.1038/s41370- 017- 0012- y [4] C.D Tran, N.G Dodder, P.J.E Quintana, K Watanabe, J.H Kim, M.F Hovell, C.D Chambers, E Hoh, Organic contaminants in human breast milk identified by non-targeted analysis, Chemosphere 238 (2020) 124677, doi:10.1016/ j.chemosphere.2019.124677 [5] M.B Alonso, K.A Maruya, N.G Dodder, J Lailson-Brito, A Azevedo, E SantosNeto, J.P.M Torres, O Malm, E Hoh, Nontargeted screening of halogenated organic compounds in bottlenose dolphins (tursiops truncatus) from Rio de Janeiro, Brazil, Environ Sci Technol 51 (2017) 1176–1185, doi:10.1021/acs.est 6b04186 [6] C.A Manzano, N.G Dodder, E Hoh, R Morales, Patterns of personal exposure to urban pollutants using personal passive samplers and GC × GC/ToF-MS, Environ Sci Technol 53 (2019) 614–624, doi:10.1021/acs.est.8b06220 ... number” The “peak true” (deconvoluted mass spectra) data of all compounds of interest were then exported in MSP format (peak_true.msp) Next, the mass spectra of each compound? ??s assigned name from the... well as mass spectral similarity In order to identify compounds of interest, each peak is compared against the National Institute of Standards and Technology electron ionization mass spectral... mass spectra (after deconvolution), known as the Peak True mass spectra in ChromaTOF Currently, the observed mass spectra and library hit mass spectra are either manually reviewed in ChromaTOF,