Support vector regression models are created and used to predict the retention times of oligonucleotides separated using gradient ion-pair chromatography with high accuracy. The experimental dataset consisted of fully phosphorothioated oligonucleotides. Two models were trained and validated using two pseudoorthogonal gradient modes and three gradient slopes.
Journal of Chromatography A 1671 (2022) 462999 Contents lists available at ScienceDirect Journal of Chromatography A journal homepage: www.elsevier.com/locate/chroma Building machine-learning-based models for retention time and resolution predictions in ion pair chromatography of oligonucleotides Martin Enmark, Jakob Häggström, Jörgen Samuelsson∗, Torgny Fornstedt∗ Department of Engineering and Chemical Sciences, Karlstad University, SE-651 88 Karlstad, Sweden a r t i c l e i n f o Article history: Received December 2021 Revised 22 March 2022 Accepted 25 March 2022 Available online 27 March 2022 Keywords: Machine-learning Support vector regression (SVR) model Oligonucleotides Ion-pair chromatography Resolution a b s t r a c t Support vector regression models are created and used to predict the retention times of oligonucleotides separated using gradient ion-pair chromatography with high accuracy The experimental dataset consisted of fully phosphorothioated oligonucleotides Two models were trained and validated using two pseudoorthogonal gradient modes and three gradient slopes The results show that the spread in retention time differs between the two gradient modes, which indicated varying degree of sequence dependent separation Peak widths from the experimental dataset were calculated and correlated with the guaninecytosine content and retention time of the sequence for each gradient slope This data was used to predict the resolution of the n – impurity among 250 0 random 12- and 16-mer sequences; showing one of the investigated gradient modes has a much higher probability of exceeding a resolution of 1.5, particularly for the 16-mer sequences Sequences having a high guanine-cytosine content and a terminal C are more likely to not reach critical resolution The trained SVR models can both be used to identify characteristics of different separation methods and to assist in the choice of method conditions, i.e to optimize resolution for arbitrary sequences The methodology presented in this study can be expected to be applicable to predict retention times of other oligonucleotide synthesis and degradation impurities if provided enough training data © 2022 The Authors Published by Elsevier B.V This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/) Introduction Ion-pair chromatography (IPC) is an important technique for separating synthetic oligonucleotides, which are a class of DNAor RNA-based molecules with widespread and well-known applications in diagnostics [1,2], research [3], and, recently, therapeutic applications [4,5] Oligonucleotides used for antisense therapy [6] are typically produced using stepwise solid-phase synthesis via the β -cyanoethyl phosphoramidite method [7] Depending on the length, sequence, and miscellaneous chemical modifications of these antisense active pharmaceutical ingredients (APIs) [8], the final synthesis product will contain a large fraction of impurities The polymeric nature of the oligonucleotides and the many impurities challenge analytical separations, and phosphorothioated (PS) oligonucleotides are especially difficult to analyze [9–12] In this study, we will focus on the shortmer impurities with respect to the parent full-length product (FLP) In this study we put particular focus on the n – impurity generated due to e.g failed coupling in the last coupling step, i.e trityl-off ∗ Corresponding authors E-mail addresses: Jorgen.Samuelsson@kau.se (J Samuelsson), Torgny.Fornstedt@kau.se (T Fornstedt) Amphipathic [13] oligonucleotides are predominately separated and analyzed using IPC [9,14,15] The most-used stationary phase is the C18 column, typically pH-stable variants such as the XBridge C18 and other reversed-phase chemistries [11,12,15,16] Many different combinations of ion-pairing reagents (IPRs) have been evaluated [9,15] For the separation of PS oligonucleotides, methods using tributyl ammonium acetate (TBuAA) as the IPR have been proven successful [11,15,17] In this study, we will use TBuAA in two previously investigated gradient modes [18] In the aforementioned study we could show that using the phenyl column resulted in slightly improved n – selectivity compared to the C18 column in the IPR gradient mode In the co-solvent gradient elution mode, the co-solvent fraction increases over time, while the IPR concentration typically remains constant In the IPR gradient mode, the IPR concentration decreases over time while the co-solvent fraction remains constant Both modes elute oligonucleotides by decreasing the apparent electrostatic potential generated by the adsorption of the IPR We have previously shown that the IPR gradient increases the selectivity for oligonucleotide impurities of the same charge, for example phosphodiester (P=O)1 impurities of fully phosphorothioated oligonucleotides, especially using a phenyl column [18] Other chromatographic modes not using IPRs such as HILIC have https://doi.org/10.1016/j.chroma.2022.462999 0021-9673/© 2022 The Authors Published by Elsevier B.V This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/) M Enmark, J Häggström, J Samuelsson et al Journal of Chromatography A 1671 (2022) 462999 also been investigated for the separation of PS-modified oligonucleotides [19] Retention time prediction models for the IPC separation of oligonucleotides are few, and noteworthy works include those of Gilar et al [20], Studzinska and Buszewski [21], Sturm et al [22], Liang et al [23], and Kohlbacher et al [24] These models are well established for peptides and are routinely employed, for example, in shotgun proteomics to design targeted proteomics experiments and to reduce false-positive hits in mass spectrometry analysis The many different approaches used can roughly be divided into (i) index-based, (ii) modeling-based, and (iii) machine-learning (ML)based methods [25] In index-based methods, the effect of each amino acid in a sequence is estimated using the multilinear regression of a large set of peptides with known retention times [26,27] In modeling-based methods, the physicochemical properties of the peptide are used to predict the retention times [27] In ML-based methods, a training set of peptides is used to estimate the parameters of a predefined mathematical model; many different approaches have been used for this, such as artificial neural networks [28] and support vector regression (SVR) [29,30] Gilar et al have developed an empirical logarithmic model (hereafter denoted as LM) to predict the retention of synthetic oligonucleotides [20] Their modeling-based method has five input variables, i.e., the amount of each nucleotide (T, C, G, and A) as well as the total number of nucleotides in the oligo Studzinska and Buszewski used quantitative structure–retention relationships (QSRRs) to predict the retention based on descriptors such as van der Waals surface area, solvent-accessible area, dipole moment, total energy, and hydration energy [21] All these parameters were numerically estimated and fitted to simple functions Neither of these methods delivers excellent predictivity The great advantage of the LM model is that it is easy to use, requires few data points for calibration, and has been shown to be rather good for predicting the retention of non-phosphorothioated oligonucleotides However, due to the selection of descriptors, the model cannot address potential structural changes such as grove and hairpin formation as well as whether the retention is dependent on sequence and not just on composition The same could also be true of the QSRR method, which shares the problems of descriptor selection and of finding accurate descriptors of more complicated molecules such as oligonucleotides Sturm et al used SVR for retention predictions [22], mainly using sequence-based descriptors as well as descriptors correlating to stacking energies occurring in hairpin formation Sturm et al showed that their model had better predictive power than did the LM model and also could predict the retention change due to hairpin formation Since the experimental system including solutes investigated by Gilar et al and Sturm et al is similar, it is relevant to compare both approaches for phosphorothioated oligonucleotides separated in different experimental systems Later, Liang et al used a similar SVR model to investigate how to optimize the selectivity in gradient elution [23] In all above studies, the authors investigated non-phosphorothioated oligonucleotides using triethylamine as the IPR as well as co-solvent gradient mode Due to the successful utilization of SVR models in [22,23] we decided to investigate if such models also can be successfully used to predict retention of phosphorothioated oligonucleotides eluted using tributylamine as IPR The aim of this study is to build SVR IPC retention time prediction models based on the oligonucleotide sequence for two different gradient modes, i.e., the conventional co-solvent gradient and the IPR gradient modes As training and testing solutes, around 100 heteromeric, fully phosphorothioated oligonucleotides will be used As the IPR, TBuAA will be used to reduce diastereomer separation Finally, and most importantly, the retention time prediction models will be used to predict the probability of successfully separating the impurities from synthetic oligonucleotides as well as compar- ing the two different gradient modes; (i) co-solvent gradient and (ii) IPR gradient mode using three gradient slopes Materials and methods 2.1 Chemicals and materials The IPRs TBuAA and triethylammonium acetate (TEtAA) were prepared from tributylamine (≥99.5%, CAS number: 121-44-8) and triethylamine (≥99.5%, CAS number 121-44-8) with acetic acid (≥99.8%, CAS number 64-19-7), all purchased from Sigma-Aldrich (St Louis, MO, USA) The mobile phases were prepared using HPLC gradient-grade acetonitrile (CAS number 75-05-8) from VWR (Radnor, PA, USA) and deionized water with a resistivity of 18.2 M /cm from a Milli-Q water purification system (Merck Millipore, Darmstadt, Germany) An XBridge Phenyl column, 150 × 3.0 mm, 3.5 μm, 100 A˚ pore size from Waters (Milford, MA, USA) was used in all experiments Fully phosphorothioated oligonucleotides were purchased in 0.25-μmol scale from Integrated DNA Technologies (Leuven, Belgium) and delivered desalted and lyophilized The purchased FLP oligonucleotides were not purified before use A list of all oligonucleotide sequences can be found in Supplementary material Table S1 2.2 Instrumentation Experiments were conducted on an Agilent 1260 Infinity II HPLC system (Agilent Technologies, Palo Alto, CA, USA), configured with a binary pump, a 100-μL injection loop, a diode-array UV detector, single quadrupole MS, and a column thermostat 2.3 Procedures 2.3.1 Selection of oligonucleotides The first part of the dataset was selected to explore the effects of length, nucleobase composition, and sequence It contains three different 8-, 12-, and 16-mer oligonucleotide sequences These were designed in silico by first generating one million sequences of length 8, 12 and 16 by randomly picking adenine (A), thymine (T), cytosine (C), or guanine (G) at each position in the sequence The retention time of all sequences was then calculated using the LM model described by Gilar et al [20] This allows us to estimate the variance in retention time for each population of 8, 12 and 16mers Then, we randomly picked three sequences of each length from each population mean – standard deviations, mean and finally mean + standard deviations, labeled SnA, SnB, or SnC, where n = 8, 12, or 16, respectively These sequences can be found in Supplementary material Table S1 Since the LM predicts that the contribution to retention time increases according to the nucleobase in the order C < G < A < T, the base composition of the sequences will vary from high proportions of guaninecytosine content (GC-content) in the SnA sequences to high proportions A and T in the SnC sequences, respectively The second part of the dataset was selected to test whether the secondary oligonucleotide structure influences the retention time The 16-mer sequences referred to as reference hairpin (RHA) and model hairpin (MHA) by Stellwagen et al [31] were then selected; Stellwagen et al investigated the effect of monovalent cations on the thermal stability of MHA, as measured by capillary electrophoresis In this case the MHA should contain more than 10% hairpin structures at 50 °C at least in a solution containing 100 mM tetrabutyl ammonium, no organic solvent and high amount of other background electrolytes They also found that the DNA melting point decreases with increasing lipophilicity of the IPR [31] In our study, we therefore included permutated variants of RHA and MHA that minimize M Enmark, J Häggström, J Samuelsson et al Journal of Chromatography A 1671 (2022) 462999 Table Summary of experimental gradient conditions Elution mode Co-solvent gradient IPR gradient G1 Initial MeCN (v%) TBuAA (mM) Slope (v% MeCN min–1 ) Initial TEtAA (mM) MeCN (v%) Slope (mM TEtAA min–1 ) 38 2.22 0.1 41.5 0.32 G2 G3 1.23 0.81 the co-solvent gradient experiments A list of all oligonucleotide sequences as well as their retention times can be found in Supplementary material Table S1; the peak widths were obtained from the n –1, n –2, and n – peaks by first interpolating the actual peak and then determining the corresponding width at half height Calculations 0.16 0.08 All general computations were performed using Python with the Numpy supporting libraries and all graphics were generated using Matplotlib The first step in finding an ML model is processing the data Our dataset consists of the output data, i.e., the retention times and the corresponding oligonucleotides, represented by a string of different combinations of A, T, G, and C, serving as input data Since ML models require numerical input, the oligonucleotides must be encoded In our implementation, we encoded the oligonucleotides in terms of different frequencies based on their primary and secondary structural properties, as described by Sturm et al [22] These different features were divided into groups, as done by Sturm et al., where COUNT contains the frequency of each nucleotide in the sequence, CONTACT contains the frequencies of all possible dinucleotides in terms of their order (e.g., the numbers of CG, CA, CT, CC etc occurring in the sequence), SCONTACT contains the frequencies of all dinucleotides bases, disregarding their order (e.g., the numbers of CG + GC, CA + AC, CC, etc.), and finally HAIRPIN contains the numbers of stem, loop, and free bases [22] The secondary structure of the sequences was calculated using the seqfold module [34] assuming the temperature 50°C The next step in the search for a model was the training, and then finding the best-performing features and hyperparameters This was done by performing a nested cross-validation, the purpose of which was to estimate how well the model responded to new data, to reduce the risk of model overfitting First, one split the dataset into to k subsets Then, one chose one subset to be omitted from the training to act as validation data (1/3 of all data), while the rest of the dataset was used for training (2/3 of all data) The chosen training set was then further split into n subsets, and the same procedure as described before was repeated This approach is visualized in Fig The best-performing model on average after the inner cross-validation was chosen to be tested on the outer validation set Then the result was evaluated based on the average performance in the outer validation, and the main metric that this implementation used was the root mean squared error (RMSE) This procedure was performed for each sub-dataset, where every unique combination of the described feature groups was evaluated The inner cross-validation was done using gridsearchcv from the sklearn ML library, which performs a k-fold cross-validation for a given model (SVR) and lists of hyperparameters (regularization parameter C, epsilon tube ε , and kernel coefficient γ ) When gridsearchcv found a fit for each combination of hyperparameters, then the best-performing model was chosen and further evaluated on the outer validation set, which was randomly split using the sklearn function kfold [35] The number of folds in both the outer and inner cross-validations was chosen to be three Furthermore, results might vary due to the stochastic nature of the algorithms when performing a fit and due to the randomized split of the datasets, so the process was performed another three times to reduce the variance of the results As a comparison, the LM model developed by Gilar et al (equation in [20]) was fitted to each sub-dataset A nonlinear least squared regression was performed to find the optimal weights by using the lmfit module [36] The LM requires no hyperparameter optimization and was therefore only evaluated on the outer validation split When the best-performing features were found, a final training was then done using the best- hairpin formation (i.e., RHB and MHB) Finally, a sequence mimicking the MALAT-1 transcript targeting ASO described by Nilsson et al [32] was included in the dataset The 8-, 12-, and 16-mer sequences synthesized are hereafter referred to as FLPs of length n 2.3.2 Experimental All samples were prepared by dissolving the lyophilized oligonucleotides by vortexing them in deionized water prepared using a Milli-Q water purification system (Merck Millipore) The stock concentration was mg mL–1 and the injection concentration was 0.2 mg mL–1 μL was injected into the column of this solution Mobile phases were prepared by weight using the density of water and acetonitrile (MeCN) at room temperature For the cosolvent gradient experiments, 10 and 80 v% MeCN solutions were prepared, while for the IPR concentration gradient experiments, two 41.5 v% solutions were made During stirring, acetic acid was added followed by tributylamine (to both eluents for co-solvent gradient experiments) or tributylamine or triethylamine separately for IPR concentration gradient experiments All mobile phases were stirred for at least 12 hours before use to ensure that the all IPRmolecules are fully dissolved Before use, the sw pH of all mobile phases (solvent/water) was determined using a pH electrode calibrated in aqueous buffer The measured pH value of the mobile phase ranged between 7-8 depending on the mobile phase composition; at low concentration of MeCN and at high concentration of MeCN All experiments were performed using still-air column temperature control at 50°C The flow rate was 0.5 mL min–1 which provided sufficiently good MS signals, i.e., good enough nebulization in the spray chamber Three gradient slopes were evaluated for each of the two gradient methods, and their details can be found in Table A re-equilibration time of about three column volumes was used after the end of each gradient A 0.01 mg mL−1 sample of uracil was prepared in deionized water and used as the void volume marker The UV signal was recorded at 260 nm Mass spectrometry analysis was performed using negative polarity in API-ES ionization mode More details of the mass spectrometry settings can be found in Roussis et al [33] Retention times were obtained from both UV and MS signals The retention time of the full-length sequence was determined from the peak apex of the UV signal Retention times of shortmer impurity sequences were obtained by the selective ion monitoring of charge states and For the 8-mer samples, a retention time of n = 8, 7, 6, 5, or was obtained in a single injection, whereas for the 16-mer samples, retention times of n = 16, …, 12 and 11, …, were obtained in two separate injections This allowed the repeatability of experiments to be monitored Retention times were adjusted for the additional dwell volume introduced by the tubing to the MS To determine the correct time for samples having overlapping m/z values for different charge states, it was assumed that the retention time of the n – x-mer was always less than that of the n-mer Some mentioning on the amounts of data used; in total, retention times for 98 unique sequences were collected and determined for all gradient slopes in the IPR-gradient experiments, 96 for the G1 and G2 gradient slopes and 91 for the G3 gradient slope, for M Enmark, J Häggström, J Samuelsson et al Journal of Chromatography A 1671 (2022) 462999 from its n – impurity We will also demonstrate how the choice of elution method, conditions, and sequence characteristics affect the probability of success 4.1 Retention times The first observation of both the co-solvent gradient and IPR gradient was that the retention times of sequences with n = 8, 12, and 16 increased with increasing proportions of A and T (samples SnA through SnC in Supplementary material Table S1) The retention time also increased with decreasing gradient slope Very short oligonucleotides, i.e., n < 5, were only marginally affected by the gradient compared with longer sequences, i.e., n = 16, as the system dwell volume had less of an effect on strongly retained oligonucleotides The oligonucleotide -ACGACCGGGCGGAGTC-5 (S16A) had similar retention times using either method for all three gradients, as it was used to normalize the effects of gradient slope and starting point between the methods This normalization had the unexpected effect that the shorter oligonucleotides, i.e., the S8x and S12x samples, were eluted significantly earlier using the IPR gradient than the co-solvent gradient Clearly, the two methods cannot be normalized for oligonucleotides of different lengths without also changing the shape of the gradient Other 16-mer sequences than S16A had different retention times in the two modes, indicating that there were different sequence-specific contributions to retention The hairpin-forming sequence MHA had about a 0.15-min shorter retention time than did its permutated sequence MHB in the co-solvent gradient system and about a 0.3min difference in the IPR gradient system using the shallower gradient (G3) The second hairpin-forming sequence RHA had retention almost identical to that of its permutated variant RHB in both systems at the same gradient slope In Fig 2a, we can see the difference between the two gradient modes The shortest oligonucleotides display better selectivity, i.e., a large change in the y-direction with the addition or removal of a nucleobase subunit in the co-solvent gradient method; whereas the opposite trend holds for the longest oligonucleotides in the IPR gradient method (the larger change is in the x-direction) However, as can also be seen in Fig 2b, the eluted peaks in the IPR gradient are wider than in the co-solvent gradient How this affects resolution will be investigated further, see Section 4.3 below Fig Flowchart showing the steps required to train an SVR model to predict retention times performing features on two thirds of the dataset to visualize the results in plots Also, the models that was trained on the whole dataset was saved for later use To evaluate the characteristics of the SVR model, we generated 250,0 0 unique random sequences with n = 12 and 16 We then calculated their retention times and fitted them using a normal distribution The peak width at half height (w0.5,i ) was assumed to be described by the GC-content (sum of fractions of C and G) of the sequence and its retention time plus a constant The solution to the resulting linear matrix equation (Supplementary material S4) was determined using the least-squares method The half-height width of the UV trace of FLP and mass trace of the n – to n - 7-mers of 16-mer FLPs in the dataset as well as n – to n - 3-mers of the 12-mer FLPs were used as input The SVR model can be downloaded from the Supplementary material 4.2 Machine learning model to predict retention times The first step in finding the best ML model was to evaluate the model performance as a function of numbers of features, i.e., count, contact, scontact, and hairpin (see Section for more details about the features) We found that for all combinations of gradient modes and slopes, count gave the smallest RMSE for three out of six systems (for a summary of all models, see Supplementary material Table S2) For the remaining three systems, different combinations of features gave only marginally improved model RMSE This result could already be anticipated from the retention data, with permutations of the strong hairpin structures MHA and RHA only marginally affecting the retention time We therefore decided to continue using the model but with only the count feature In the study by Sturm et al [22] all features were found required to properly predict the retention times However, this finding cannot be directly extrapolated to our study since there are two main experimental difference between the experiments conducted by Sturm et al and by us Firstly, they uses another IPR (TEA) and, secondly, they uses unmodified oligonucleotides whereas we used TBuAA as IPR and fully phosphorothioated oligonucleotides as solutes As a consequence, Sturm et al conducted their separations with much lower amounts of acetonitrile (MeCN), 0–16% MeCN gradient, as compared to 38 – 70% as in this Results and discussion The shortmer population (n -1, n -2, …, n – n +1)) constitutes the largest number of impurities generated by the solid-phase synthesis Successful separation and quantification of the individual shortmers are necessary for the quality control of APIs Generally, the separation of the n – 1-mer is the most relevant and most challenging problem Therefore, it is beneficial to have a tool that can assist in the selection of chromatographic methods and the corresponding conditions necessary to achieve critical resolution of the pair, here defined as ≥ 1.5 In Section 4.1, we will present experimental retention data obtained using two methods for three gradient slopes and discuss the characteristics of the two systems The determined retention data will then be used to train ML models, whose performance and characteristics will be discussed in Section 4.2 Finally, in Section 4.3, we will use the ML model to estimate the probability of resolving an arbitrary oligonucleotide M Enmark, J Häggström, J Samuelsson et al Journal of Chromatography A 1671 (2022) 462999 Fig Normalized experimental retention times obtained in co-solvent and IPR gradients using gradient G3 (Table 1) (a) and b) chromatogram showing the separation of sequence MHB (Supplementary Material Table S1) b) at gradient G3 (Table 1) Table Summary of model performance on the training and validation sets Gradient mode Gradient slope Model RMSE Training set (min) RMSE Validation set (min) R2 Training set Q2 Validation set IPR gradient G1 G1 G2 G2 G3 G3 G1 G1 G2 G2 G3 G3 SVR LM SVR LM SVR LM SVR LM SVR LM SVR LM 0.055 0.280 0.037 0.583 0.105 0.976 0.073 0.130 0.088 0.178 0.073 0.383 0.076 0.278 0.120 0.640 0.181 1.270 0.091 0.129 0.123 0.260 0.127 0.478 0.999 0.977 0.999 0.949 0.999 0.925 0.998 0.993 0.999 0.995 0.999 0.985 0.998 0.974 0.998 0.937 0.997 0.852 0.997 0.993 0.996 0.984 0.998 0.975 Co-solvent gradient study Previously it was shown that in separations conducted at higher amounts of MeCN the separation systems ability to separate charge differences is increased while systems ability to separate compounds with same charge is decreased [18] This result in that the feature count will be more important and that the nextneighbor effect indicating features contact and scontact will contribute less to the model, which was also observed in our study We also compared the SVR model with the LM The results indicated that the SVR model gave lower RMSE in all cases (see Table 2) The relative difference in RMSE between the SVR and the LM models increased with decreasing gradient slope for both gradient modes SVR was also markedly better at accurately predicting retention times for the IPR gradient at all gradient slopes This could be expected since the LM model was developed for cosolvent gradient elution, native oligonucleotide samples, and different IPR and stationary phases Furthermore, this model was developed to give a rough estimate of the amount of acetonitrile required to elute an oligonucleotide based on its length and relative proportions of nucleobases, for which it would still be useful given the current datasets Another way to estimate the model fit is to calculate the correlation coefficients R2 and Q2 , where R2 is estimated from the training set and Q2 is estimated from data not used in the training set; R2 will therefore estimate the goodness of fit and Q2 will estimate the goodness of prediction From Table 2, we can see that: (i) R2 was always greater than Q2 , as expected; (ii) both R2 and Q2 were substantially larger for the SVR model than the LM model; (iii) the LM model was much worse in pre- dicting the IPR gradient than the co-solvent gradient; and (iv) the SVR model was only slightly worse in predicting the IPR-gradient than the co-solvent gradient Plots of predicted versus experimental retention times for the validation subset of the experimental data obtained at gradient G3 are shown Fig 3a and c for the co-solvent and IPR gradient modes, respectively The validation subset shown in this plot contains one third of the sequences in the complete dataset The corresponding box plot of the relative error for the SVR and LM models are shown in Fig 3b and d The characteristics of the SVR models were evaluated by calculating the retention times of 250,0 0 unique random 12- and 16mers The distribution of retention times can be found in Fig The spread of the distributions increased with increasing oligonucleotide length and decreasing gradient slope for both gradient modes which could be expected In general, the spread of retention times was higher for the IPR gradient mode suggesting that the hydrophobicity of the base pairs has a larger impact in this mode The larger variance observed for 16-mers could already be predicted from Fig 2a Analyzing the base composition of sequences by fitting a normal distribution to the retention data shows that, for both gradient modes, 12-mer sequences obtained at below 1.5 standard deviations had a higher proportion of G and especially C compared with the baseline of 25% each (see Supplementary material Table S3) On the other hand, 12-mer sequences having retention times of above 1.5 standard deviations had larger than baseline (25%) proportions of A and especially T for both gradient M Enmark, J Häggström, J Samuelsson et al Journal of Chromatography A 1671 (2022) 462999 Fig Experimental (tR, exp ) and predicted (tR, pred ) retention times in the validation dataset obtained using the SVR model (dots) or LM model (crosses) for the co-solvent gradient mode (a, b) and the IPR gradient mode (c, d), respectively In c) and d), the relative errors of the predictions are summarized in boxplots: the line in the boxplot is the median and the whiskers are the first and third quantiles modes For the 16-mer sequences, the differences in base composition was less pronounced below 1.5 standard deviations for both gradient modes but the GC-content remains above 50% Among the strongest retained 16-mers, over 40% of nucleobases in the sequence are T for both modes and all gradient slopes only a weak correlation for the co-solvent gradient but a more pronounced correlation for the IPR gradient In both gradient modes, the peak widths increased with increasing retention time (see Supplementary material Fig S1) The peak widths obtained in the IPR gradient mode were greater than in the co-solvent gradient mode, both in absolute terms and by having a larger sequence variance One possible explanation is the gradient compression experienced by each solute differed because they have different sensitivity to the gradient change Also, the effective gradient slope (G) could be different between the two gradient modes However, since the retention time shift of sample S16A was shown to be about the same for gradient slopes G1, G2, and G3 between the two modes, a significant difference in effective gradient slope was unlikely Another explanation could be that the peak broadening due to partial diastereomer separation was greater in the IPR gradient mode than the co-solvent mode This explanation is plausible since we 4.3 Predictions of the probability of resolving the FLP from the n – impurity Of particular interest for the quality control of synthetic oligonucleotides is determining the purity of the FLP, which requires sufficient (i.e., Rs > 1.5) resolution when using UV detection To calculate the resolution, we need accurate predictions of retention time and peak width In addition to retention times, we therefore also investigated how the peak widths correlated with the retention times in each gradient mode; we found there was M Enmark, J Häggström, J Samuelsson et al Journal of Chromatography A 1671 (2022) 462999 Fig Distributions of the predicted retention times for 250,0 0 unique 12- and 16-mer sequences (blue and orange fill, respectively) calculated using the SVR model Subplots a)–c) show the co-solvent gradient and d)–f) the IPR gradient Gradient slope G1 (a, d), G2 (b, e), and G3 (c, f) have shown that the diastereomer separation increased at lower and constant co-solvent concentration in the IPR gradient mode as compared with co-solvent gradient elution [18] This would explain why the peak width increased with both decreasing gradient slope and increasing retention time We have previously showed the diastereomer separation involving C and G was greater than that involving A and T [17] and therefore attempted to correlate the GC-content together with the retention time and a constant, to the observed peak width This simple correlation provides a reasonable approximation of peak width, as summarized in Supplementary material S2 The predicted versus experimentally calculated resolutions for 12- and 16-mer samples are presented in Table Except for the steepest gradient slope investigated using both gradient modes, the prediction error is less than 10% We also observe that the absolute mean error of prediction decreases with decreasing gradient slope Investigating the details, we see that the n – impurity of sample S12A and S12C are always resolved at a resolution of more than 1.5 regardless of investigated gradient slope or mode For the 16-mer sequences, the critical resolution is reached at a steeper gradient slope using the IPR gradient mode compared to co-solvent gradient mode Interestingly, the two 12-mer samples always have higher resolution using the co-solvent gradient at any gradient slope whereas the GC rich sample S12A has a lower resolution than out of the investigated 16-mers using the IPR gradient mode This again highlights that the IPR gradient mode has a higher degree of separation based on sequence rather than length as compared to the co-solvent gradient An accurate estimation of resolution based on sequence composition and retention time allowed us to calculate the peak widths of all 250,0 0 random unique 12- and 16-mers as well as their n – impurities at each gradient slope The resulting distributions of calculated resolutions are shown in Fig For the co-solvent gradient mode, the 12/11-mer separation always has a higher resolution than does the 16/15-mer separation regardless of the sequences In addition, all 12-mer sequences are predicted to reach a resolution of 1.5 at all investigated gradient slopes The resolution of the 12-mer sequences using the co-solvent gradient mode was generally similar or slightly better than could be achieved with the IPR gradient mode This could also be anticipated from Fig 2a, where the selectivity be- tween shorter oligonucleotides is greater for the co-solvent gradient than the IPR gradient For the 16/15-mer separation resolution, no sequences could be separated with a resolution of at least 1.5 using the steepest co-solvent gradient investigated At the second and third steepest gradients, i.e., G2 and G3, 42 and 28% of the random sequences could not be separated (see Table 3) For the IPR gradient mode, the resolution distributions between the 16/15mer and 12/11-mer show overlap at all gradient slopes, with the overlap increasing with decreasing gradient slope The results indicate that some 12/11-mers are more difficult to resolve than some 16/15-mers using IPR gradients This could be expected from the experimental resolution data showing that a GC-rich 12-mer can have lower resolution compared to some 16-mers (Table 4) For the 16-mer FLPs, 31, 9, and 4% of all random unique sequences are expected not to reach the critical resolution of 1.5 at gradient slopes of G1, G2, and G3, respectively Investigating the characteristics of the 16-mer sequences that not reach a resolution of at least 1.5, we found, for the cosolvent gradient, that they had a marginally higher frequency of C, both throughout the sequence and in the terminal nucleobase, which when missing creates the n – 1-mer (see Table 3) For the IPR gradient, there was a similar but more pronounced trend The sequences that does reach critical resolution at G2 and G3 contained 27 and 40% C as well as above average A For the terminal nucleobase, there was a 41 or 82% probability that it was a C at gradient slopes G2 and G3 At this could be understood from two earlier observations: first, a sequence containing a large proportion of C will lead to a wider peak; second, the loss of a terminal C will give a smaller than average relative decrease in retention time These effects combined lead to difficulties obtaining sufficient resolution Investigating the FLPs of experimental dataset (Supplementary material Table S1), we found that the one of the sequences that did not reach the critical resolution using the IPR gradient at the steepest gradient slope G1 was the RHB sample (3 CGCGTGGTCCTGGTCC-5 ) This sequence has a composition of 37.5% C, 37.5% G, 25% T, and 0% A as well as a terminal C at the end The experimental resolution for the n – 1-mer was calculated to about 1.3 at G1, see Table Decreasing the gradient slope to G3 increased the resolution to about 1.9 The resolution at gradient slope G1 using the co-solvent gradient was even lower, about at G1 M Enmark, J Häggström, J Samuelsson et al Journal of Chromatography A 1671 (2022) 462999 Fig Distributions of the predicted n – resolutions for 250,0 0 unique 12- and 16-mer sequences (blue and orange fill) calculated using the SVR model Subplots a)–c) show the co-solvent gradient and d)–f) the IPR gradient Gradient slope G1 (a, d), G2 (b, e), and G3 (c, f) Vertical dashed line at a resolution of 1.5 Fig Experimental and simulated chromatograms of the RHB sample at the steep gradient slope G1 (a, c) and the shallow gradient slope G3 (b, d), respectively Co-solvent gradient mode (a, b) and IPR gradient mode (c, d) and just 1.5 at G3 Experimental and simulated chromatograms of RHB are shown in Fig The simulated peaks were constructed by generating a normal distribution with a variance calculated from the nucleobase composition and retention time The areas of the FLP and n – were manually normalized by adjusted the height separately for each peak and then stitched them together to get the final chromatogram The retention time and peak widths of the experimental and simulated chromatograms are in good agreement, although there is a slight underestimation of calculated resolution in the co-solvent gradient at gradient slope G1, also indicated from Table M Enmark, J Häggström, J Samuelsson et al Journal of Chromatography A 1671 (2022) 462999 Table Details of predicted 16-mer failure sequences (Rs < 1.5); fx is the percentage of nucleobase x Gradient mode Gradient slope Co-solvent gradient G1 G2 G3 G1 G2 G3 IPR gradient Below critical resolution,Rs < 1.5 Frequency (%) Sequence composition Terminal nucleobase composition 100 42 28 31 fA , 25 22 23 28 32 21 fA 25 24 24 27 31 fT , 25 22 23 23 22 22 fC , 25 28 27 24 27 40 fG 25 28 27 25 19 17 fT 25 24 24 0 fC 25 26 26 34 41 82 fG 25 26 26 37 28 15 Table Experimentally measured resolutions vs predictions for FLP and n – using the SVR model for retention times and the linear model for peak widths, respectively Co-solvent gradient G1 Sample -end S12A C S12C C S16A C S16B A S16C A MALAT G MHA A MHB A RHA G RHB C Abs mean error % IPR gradient G2 G3 G1 G2 G3 Exp Pred Exp Pred Exp Pred Exp Pred Exp Pred Exp Pred 1.81 1.73 1.10 1.03 1.02 0.86 1.05 1.14 1.09 0.98 15.8 1.81 1.71 1.30 0.73 1.03 0.61 0.83 0.83 0.82 0.92 2.40 2.45 1.43 1.49 1.45 1.23 1.57 1.65 1.54 1.32 7.9 2.53 2.63 1.63 1.57 1.69 1.48 1.59 1.59 1.49 1.34 2.76 2.92 1.69 1.75 1.75 1.51 1.85 1.93 1.77 1.53 4.2 2.77 2.92 2.06 1.76 1.83 1.50 1.96 1.96 1.68 1.51 2.17 2.26 1.56 1.49 1.45 1.27 1.54 1.85 1.63 1.29 10.9 2.17 2.39 1.89 1.49 1.91 1.59 1.76 1.76 1.55 1.31 2.45 2.63 1.97 1.90 1.84 1.62 1.98 2.31 2.06 1.68 7.0 2.44 2.70 2.33 1.98 2.14 1.72 2.35 2.35 2.06 1.66 2.38 2.82 2.21 2.52 2.21 1.92 2.35 2.74 2.30 1.90 4.8 2.38 2.81 2.29 2.52 2.44 2.15 2.78 2.78 2.32 1.91 Conclusions els could be expanded to account for retention shifts introduced by other oligonucleotide modifications such as -MOE, methyl-C or LNAs if sufficient data is provided Also other impurities related to the FLP if trained with such retention data Other impurities could for example include (P=O) or abasics Other chromatographic systems including other column chemistries, particle sizes, temperatures, and mobile phases could also be added to have an even greater number of possible systems to choose from The methodology could also be used to optimize the method run time in silico before running experiments This study aimed at constructing an ML model capable of predicting the retention times of phosphorothioated oligonucleotides with high accuracy The model was shown to predict retention times with low RMSE as well as high Q2 and R2 for all investigated conditions For the investigated experimental systems, the effect of secondary oligonucleotide structure was shown to be minimal, allowing us to construct a simpler model The ML models were used for predicting the chromatographic characteristics of 250,0 0 random 12- and 16-mers It was found that the variance in retention time was higher when using the IPR gradient mode than the co-solvent gradient mode However, a slight skewness in the distribution of retention times for a uniform distribution of A, T, G, C indicates that the SVR model has captured sequence specific contribution to the retention time which could indicate the presence of next neighbor effects Sequences containing high proportions of C and G gave the shortest retention times, whereas high proportions of A and T gave the longest retention times in both gradient modes Finally, the resolution of each of the 250,0 0 random sequences to its n – 1-mer was calculated using the retention time from the ML model and the peak width from the linear combination of oligonucleotide GC-content and retention time Results indicate that the co-solvent gradient mode can be expected to easily resolve all 12-mer sequences from the 11-mers, typically with greater resolution than can the IPR gradient On the other hand, the probability of successfully resolving longer 16-mer sequences from 15mers was significantly higher using the IPR gradient mode For both methods, decreasing the gradient slope increased the probability of achieving critical resolution Among the 16-mers that still could not be resolved using the IPR gradient mode, the frequencies of C were very high, respectively, at the terminal nucleobase The ML models constructed in this study could help select the appropriate gradient mode and gradient slope that would lead to successful separation before performing an experiment The mod- Availability Implementations and code used in this study can be found at: https://github.com/jakobhaggstrom/JCA- 21- 1579 Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper CRediT authorship contribution statement Martin Enmark: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing – original draft, Writing – review & editing, Visualization, Supervision Jakob Häggström: Methodology, Software, Formal analysis, Investigation, Data curation, Writing – original draft Jörgen Samuelsson: Conceptualization, Validation, Writing – original draft, Writing – review & editing, Supervision Torgny Fornstedt: Conceptualization, Writing – review & editing, Supervision, Project administration, Funding acquisition M Enmark, J Häggström, J Samuelsson et al Journal of Chromatography A 1671 (2022) 462999 Acknowledgements [17] M Enmark, M Rova, J Samuelsson, E Örnskov, et al., Investigation of factors influencing the separation of diastereomers of phosphorothioated oligonucleotides, Anal Bioanal Chem 411 (2019) 3383–3394, doi:10.1007/ s00216- 019- 01813- [18] M Enmark, S Harun, J Samuelsson, E Örnskov, et al., Selectivity limits of and opportunities for ion pair chromatographic separation of oligonucleotides, J Chromatogr A 1651 (2021) 462269, doi:10.1016/j.chroma.2021.462269 [19] A Demelenne, M.-J Gou, G Nys, C Parulski, et al., Evaluation of hydrophilic interaction liquid chromatography, capillary zone electrophoresis and drift tube ion-mobility quadrupole time of flight mass spectrometry for the characterization of phosphodiester and phosphorothioate oligonucleotides, J Chromatogr A 1614 (2020) 460716, doi:10.1016/j.chroma.2019.460716 [20] M Gilar, K.J Fountain, Y Budman, U.D Neue, et al., Ion-pair reversedphase high-performance liquid chromatography analysis of oligonucleotides: Retention prediction, J Chromatogr A 958 (2002) 167–182, doi:10.1016/ S0 021-9673(02)0 0306-0 ´ [21] S Studzinska, B Buszewski, Different approaches to quantitative structure– retention relationships in the prediction of oligonucleotide retention, J Sep Sci 38 (2015) 2076–2084, doi:10.1002/jssc.201401395 [22] M Sturm, S Quinten, C.G Huber, O Kohlbacher, A statistical learning approach to the modeling of chromatographic retention of oligonucleotides incorporating sequence and secondary structure data, Nucleic Acids Res 35 (2007) 4195– 4202, doi:10.1093/nar/gkm338 [23] C Liang, J.-Q Qiao, H.-Z Lian, A novel strategy for retention prediction of nucleic acids with their sequence information in ion-pair reversed phase liquid chromatography, Talanta 185 (2018) 592–601, doi:10.1016/j.talanta.2018.04.030 [24] O Kohlbacher, S Quinten, M Sturm, B.M Mayr, et al., Structure–Activity Relationships in Chromatography: Retention Prediction of Oligonucleotides with Support Vector Regression, Angew Chem Int Ed 45 (20 06) 70 09–7012, doi:10 10 02/anie.20 0602561 [25] L Moruz, L Käll, Peptide retention time prediction, Mass Spec Rev 36 (2017) 615–623, doi:10.1002/mas.21488 [26] M Gilar, A Jaworski, P Olivova, J.C Gebler, Peptide retention prediction applied to proteomic data analysis, Rapid Commun Mass Spectrom 21 (2007) 2813–2821, doi:10.1002/rcm.3150 [27] O.V Krokhin, R Craig, V Spicer, W Ens, et al., An improved model for prediction of retention times of tryptic peptides in ion pair reversed-phase HPLC its application to protein peptide mapping by off-line HPLC-MALDI MS, Mol Cell Proteomics (2004) 908–919, doi:10.1074/mcp.M400031-MCP200 [28] K Petritis, L.J Kangas, P.L Ferguson, G.A Anderson, et al., Use of Artificial Neural Networks for the Accurate Prediction of Peptide Liquid Chromatography Elution Times in Proteome Analyses, Anal Chem 75 (2003) 1039–1048, doi:10.1021/ac0205154 [29] A.A Klammer, X Yi, M.J MacCoss, W.S Noble, Improving Tandem Mass Spectrum Identification Using Peptide Retention Time Prediction across Diverse Chromatography Conditions, Anal Chem 79 (2007) 6111–6118, doi:10.1021/ ac070262k ˚ [30] J Samuelsson, F.F Eiriksson, D Asberg, M Thorsteinsdóttir, et al., Determining gradient conditions for peptide purification in RPLC with machine-learningbased retention time predictions, J Chromatogr A 1598 (2019) 92–100, doi:10 1016/j.chroma.2019.03.043 [31] E Stellwagen, J.M Muse, N.C Stellwagen, Monovalent Cation Size and DNA Conformational Stability, Biochemistry 50 (2011) 3084–3094, doi:10.1021/ bi1015524 ´ et al., Fluorescent base ana[32] J.R Nilsson, T Baladi, A Gallud, D Baždarevic, logues in gapmers enable stealth labeling of antisense oligonucleotide therapeutics, Sci Rep 11 (2021) 11365, doi:10.1038/s41598- 021- 90629- [33] S.G Roussis, C Koch, D Capaldi, C Rentel, Rapid oligonucleotide drug impurity determination by direct spectral comparison of ion-pair reversed-phase highperformance liquid chromatography electrospray ionization mass spectrometry data, Rapid Commun Mass Spectrom 32 (2018) 1099–1106, doi:10.1002/rcm 8125 [34] J Timmons, leshane, Lattice-Automation/seqfold 0.7.7, Zenodo (2021), doi:10 5281/zenodo.4579886 [35] F Pedregosa, G Varoquaux, A Gramfort, V Michel, et al., Scikit-learn: machine learning in python, J Mach Learn Res 12 (2011) 2825–2830 [36] M Newville, T Stensitzki, D.B Allen, A Ingargiola, LMFIT: non-linear leastsquare minimization and curve-fitting for python, Zenodo (2014), doi:10.5281/ zenodo.11813 This work was supported by the Swedish Knowledge Foundation via the project “Improved Methods for Process and Quality Controls using Digital Tools” (grant number 20210021) and by the Swedish Research Council (VR) via the project “Fundamental Studies on Molecular Interactions aimed at Preparative Separations and Biospecific Measurements” (grant number 2015-04627) Supplementary materials Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.chroma.2022.462999 References [1] S Yang, R.E Rothman, PCR-based diagnostics for infectious diseases: uses, limitations, and future applications in acute-care settings, Lancet Infect Dis (2004) 337–348, doi:10.1016/S1473-3099(04)01044-8 [2] L Becherer, N Borst, M Bakheit, S Frischmann, et al., Loop-mediated isothermal amplification (LAMP) – review and classification of methods for sequencespecific detection, Anal Methods 12 (2020) 717–746, doi:10.1039/C9AY02246E [3] M.J Heller, DNA Microarray Technology: Devices, Systems, and Applications, Annu Rev Biomed Eng (2002) 129–153, doi:10.1146/annurev.bioeng.4 020702.153438 [4] W Yin, M Rogge, Targeting RNA: A Transformative Therapeutic Strategy, Clin Translat Sci 12 (2019) 98–112, doi:10.1111/cts.12624 [5] T.C Roberts, R Langer, M.J.A Wood, Advances in oligonucleotide drug delivery, Nat Rev Drug Discovery 19 (2020) 673–694, doi:10.1038/s41573- 020- 0075- [6] C.F Bennett, E.E Swayze, RNA targeting therapeutics: molecular mechanisms of antisense oligonucleotides as a therapeutic platform, Annu Rev Pharmacol Toxicol 50 (2010) 259–293, doi:10.1146/annurev.pharmtox.010909.105654 [7] E Paredes, V Aduda, K.L Ackley, H Cramer, 6.11 - Manufacturing of Oligonucleotides, in: S Chackalamannil, D Rotella, S.E Ward (Eds.), Comprehensive Medicinal Chemistry III, Elsevier, Oxford, 2017, pp 233–279, doi:10.1016/ B978- 0- 12- 409547- 2.12423- [8] S Benizri, A Gissot, A Martin, B Vialet, et al., Bioconjugated oligonucleotides: recent developments and therapeutic applications, Bioconjugate Chem 30 (2019) 366–383, doi:10.1021/acs.bioconjchem.8b00761 [9] N.M El Zahar, N Magdy, A.M El-Kosasy, M.G Bartlett, Chromatographic approaches for the characterization and quality control of therapeutic oligonucleotide impurities, Biomed Chromatogr 32 (2018), doi:10.1002/bmc.4088 [10] D Capaldi, A Teasdale, S Henry, N Akhtar, et al., Impurities in Oligonucleotide Drug Substances and Drug Products, Nucleic Acid Ther 27 (2017) 309–322, doi:10.1089/nat.2017.0691 [11] M Enmark, J Bagge, J Samuelsson, L Thunberg, et al., Analytical and preparative separation of phosphorothioated oligonucleotides: columns and ion-pair reagents, Anal Bioanal Chem 412 (2020) 299–309, doi:10.1007/ s00216- 019- 02236- [12] S.G Roussis, M Pearce, C Rentel, Small alkyl amines as ion-pair reagents for the separation of positional isomers of impurities in phosphate diester oligonucleotides, J Chromatogr A 1594 (2019) 105–111, doi:10.1016/j.chroma 2019.02.026 [13] S.T Crooke, J.L Witztum, C.F Bennett, B.F Baker, RNA-Targeted Therapeutics, Cell Metab 27 (2018) 714–739, doi:10.1016/j.cmet.2018.03.004 [14] M Catani, C.D Luca, J.M.G Alcântara, N Manfredini, et al., Oligonucleotides: current trends and innovative applications in the synthesis, characterization, and purification, Biotechnol J (2022) 1900226 n/a (n.d.), doi:10.1002/biot 201900226 [15] A Goyon, P Yehl, K Zhang, Characterization of therapeutic oligonucleotides by liquid chromatography, J Pharm Biomed Anal 182 (2020) 113105, doi:10.1016/ j.jpba.2020.113105 ´ ´ [16] S Studzinska, S Bocian, L Siecinska, B Buszewski, Application of phenyl-based stationary phases for the study of retention and separation of oligonucleotides, J Chromatogr B 1060 (2017) 36–43, doi:10.1016/j.jchromb.2017.05.033 10 ... > 1.5) resolution when using UV detection To calculate the resolution, we need accurate predictions of retention time and peak width In addition to retention times, we therefore also investigated... states and For the 8-mer samples, a retention time of n = 8, 7, 6, 5, or was obtained in a single injection, whereas for the 16-mer samples, retention times of n = 16, …, 12 and 11, …, were obtained... Journal of Chromatography A 1671 (2022) 462999 also been investigated for the separation of PS-modified oligonucleotides [19] Retention time prediction models for the IPC separation of oligonucleotides