SỞ KHOA HỌC VÀ CƠNG NGHỆ TP HỒ CHÍ MINH VIỆN KHOA HỌC VÀ CƠNG NGHỆ TÍNH TỐN BÁO CÁO TỔNG KẾT PHÁT TRIỂN MƠ HÌNH ĐỘNG HỌC CHI TIẾT CHO SỰ HÌNH THÀNH TẤM PHIM SILICON MỎNG Đơn vị thực hiện: PTN Khoa học Phân tử Vật liệu Nano Chủ nhiệm đề tài: PGS TS Huỳnh Kim Lâm TP HỒ CHÍ MINH, THÁNG 12/2017 SỞ KHOA HỌC VÀ CƠNG NGHỆ TP HỒ CHÍ MINH VIỆN KHOA HỌC VÀ CƠNG NGHỆ TÍNH TỐN BÁO CÁO TỔNG KẾT PHÁT TRIỂN MƠ HÌNH ĐỘNG HỌC CHI TIẾT CHO SỰ HÌNH THÀNH TẤM PHIM SILICON MỎNG Viện Trưởng: Nguyễn Kỳ Phùng Đơn vị thực hiện: PTN Khoa học Phân tử Vật liệu Nano Chủ nhiệm đề tài: Huỳnh Kim Lâm Huỳnh Kim Lâm TP HỒ CHÍ MINH, THÁNG 12/2017 Phát triển mơ hình động học chi tiết cho hình thành phim silicon mỏng MỤC LỤC MỞ ĐẦU LỜI CẢM ƠN ĐƠN VỊ THỰC HIỆN KẾT QUẢ NGHIÊN CỨU I BÁO CÁO KHOA HỌC Tổng quan Những phương pháp tính toán Kết Thảo luận Kết luận 29 II CÁC TÀI LIỆU KHOA HỌC ĐÃ XUẤT BẢN 31 III CHƯƠNG TRÌNH GIÁO DỤC VÀ ĐÀO TẠO 32 IV HỘI NGHỊ, HỘI THẢO 33 V FILE DỮ LIỆU 34 TÀI LIỆU THAM KHẢO 35 PHỤ LỤC Phụ lục 1: A Computational Study on the Adsorption Configurations and Reactions of SiHx(x=1-4) on Clean and H-covered Si(100) Surfaces Phụ lục 2: Ab Initio Dynamics of Unimolecular Decomposition of β -propiolactone and β – propiolactam Phụ lục 3: Identification of hindered internal rotational mode for complex chemical species: A data mining approach with multivariate logistic regression model Phụ lục 4: Phiên chương trình Surfkin Viện Khoa học Cơng nghệ Tính tốn TP Hồ Chí Minh Trang Phát triển mơ hình động học chi tiết cho hình thành phim silicon mỏng MỞ ĐẦU Các cấu dạng hấp thụ H SiHx (x=1-4) xen phủ H lên bề mặt Si(100) xác định cách sử dụng tính tốn DFT phân cực spin Trên bề mặt sạch, kết nguyên tử hydro pha khí gốc tự SiH3 hấp thụ cách hiệu vào vị trí ngồi cùng, SiH SiH2 ưa thích vị trí cầu nối lớp Một khả khác SiH đứng vị trí trống điện tử với cấu dạng liên kết triplet Với phần bề mặt Si(100) che phủ H, chế tương tự với trường hợp khác có lượng hấp thụ cao Điều cho thấy cấu trúc gắn bề mặt trở nên ổn định có diện hydro bề mặt Các đường dẫn lượng cực tiểu cho hấp thụ / dịch chuyển phản ứng phân tử H / SiHx bề mặt khảo sát cách sử dụng phương pháp dải đàn hồi climbing image-nudge Các trình bề mặt cạnh tranh cho hình thành màng mỏng Si từ tiền thân SiHx dự đoán Nghiên cứu cho thấy di chuyển nguyên tử hydro bề mặt tinh thể (adatom) không quan trọng bật bỏ lại vị trí mở (lỗ trống) bề mặt lượng rào cao (˃ 29,0 kcal / mol) Ngoài ra, phân tách hydro (H-abstraction) bề mặt tinh thể gốc tự H / SiHx thuận lợi Hơn nữa, việc loại bỏ nguyên tử hydro khỏi lớp hấp thụ SiHx bước thiết yếu để hình thành lớp Si, bị chi phối trình phân tách trình phân hủy Viện Khoa học Cơng nghệ Tính tốn TP Hồ Chí Minh Trang Phát triển mơ hình động học chi tiết cho hình thành phim silicon mỏng LỜI CẢM ƠN Nghiên cứu thực từ ủng hộ quỹ nghiên cứu khoa học Viện Khoa học Cơng nghệ tính tốn Thành phố Hồ Chí Minh Sở Khoa học công nghệ Thành phố Hồ Chí Minh (Số: 172/2015/HĐ-SKHCN) Các tác giả cảm ơn đến tài ngun tính tốn Trung tâm tính tốn Hiệu cao Quốc gia Đài Loan, Khu Khoa học Hsinchu, Thành phố Hsinchu, Đài Loan Tác giả cảm ơn hổ trợ thời gian tính tốn từ hệ thống tính tốn hiệu cao ICST Trường Đại học Quốc tế, Đại học Quốc gia, TPHCM Viện Khoa học Cơng nghệ Tính tốn TP Hồ Chí Minh Trang Phát triển mơ hình động học chi tiết cho hình thành phim silicon mỏng ĐƠN VỊ THỰC HIỆN Phịng thí nghiệm : Khoa học Phân tử Vật liệu Nano Chủ nhiệm đề tài : PGS TS Huỳnh Kim Lâm Thành viên đề tài : ThS Lê Nguyễn Minh Thông ThS Lê Thanh Xuân ThS Nguyễn Thanh Hiếu CN Huỳnh Chơn Thiện Viện Khoa học Cơng nghệ Tính tốn TP Hồ Chí Minh Trang Phát triển mơ hình động học chi tiết cho hình thành phim silicon mỏng KẾT QUẢ NGHIÊN CỨU I BÁO CÁO KHOA HỌC Tổng quan Silicon đóng vai trị quan trọng việc chế tạo mạch tích hợp Các tiến từ silic mỏng truyền thống đến lớp màng mỏng (film) silic làm cho trở nên khả thi phạm vi rộng lớn ứng dụng vi điện tử tương lai thiết bị quang điện Phương pháp kết tủa hóa học phương pháp Plasma tăng cường (PECVD) kỹ thuật phổ biến sử dụng để phát triển lớp mỏng silicon mỏng vô định hình 5-8 tính thể 2, 8-10 từ tiền chất silan Đã có báo cáo kỹ thuật làm tăng đáng kể tốc độ tăng trưởng silicon chống lại trình lắng đọng nhiệt (CVD) điều kiện đồng nhất, trình điều khiển động bề mặt tương tự Trong trình lắng đọng tăng cường plasma, gốc tự hoạt hóa khuếch đại va chạm electron hoạt động với phân tử tiền chất SiH4 Tồn đồng thời pha khí, bên cạnh SiH4 (silane), SiH2 (silylene) silylidyne (SiH), hydrogen (H) SiH3 (silyl) biết đến gốc tự phổ biến trạng thái 5, Tốc độ tăng trưởng nhanh liên tục đạt hấp thụ phản ứng gốc tự tạo với bề mặt chất silic Các kỹ thuật lắng đọng phát triển ứng dụng thực tế với sản xuất quy mơ lớn, q trình chìa khóa phát triển màng mỏng silic chưa hiểu đầy đủ cấp độ phân tử, phản ứng SiHx bề mặt silic nghiên cứu rộng rãi thực nghiệm 8, 10-12 tính tốn mơ 13-21 Dựa vào nghiên cứu thực nghiệm, Gate cộng 11, 12 đề xuất chế phát triển silic mà trở thành nghiên cứu tảng cho phần lớn nghiên cứu sau Trong nghiên cứu này, mô hình động học bề mặt trạng thái ổn định phát triển để thiết lập tốc độ tăng trưởng silic Mơ hình cung cấp mơ tả tốt chế động học cho q trình CVD Cơ chế Gate khơng góp Viện Khoa học Cơng nghệ Tính tốn TP Hồ Chí Minh Trang Phát triển mơ hình động học chi tiết cho hình thành phim silicon mỏng phần hiểu biết CVD mà cịn PECVD tương đồng chế đến kiện động học bề mặt, PECVD phức tạp CVD khía cạnh hóa học plasma tạo loại phân tử hoạt hóa để xảy phản ứng bề mặt Tuy nhiên, nhược điểm mơ hình giả định tốc độ tăng trưởng chung tương đương với trình hấp thụ phân ly SiH4 12 Srinivasan 10 trình tách hydro (H-abstraction) ăn mịn hydro chế Eley-Rideal (ER) 22 kênh quan trọng để hình thành silicon tinh thể nhiệt độ từ 25 đến 300o C Trong thí nghiệm khác, Srinivasan nhận thấy có chuyển đổi từ silicon vơ định hình sang silicon tinh thể tinh thể hydro nhiệt độ thấp 250o C Rõ ràng, thí nghiệm tập trung vào q trình tách hydro để kích hoạt vị trí hấp thụ hoạt hóa bề mặt, vấn đề tối cần thiết qua trình lắng đọng tăng cường plasma Mặt khác, nghiên cứu tính tốn đưa cách tiếp cận sâu sắc từ tính tốn cổ điển đến tính tốn lượng tử (ab initio) thực hai mơ hình cụm (cluster) (slab) Các tính tốn mở rộng phạm vi rộng trình xảy bề mặt thông qua điều kiện lắng đọng thực tế liên quan đến trình tách hydro bề mặt, trình hấp thụ gốc tự pha khí, q trình khuếch tán gốc tự bề mặt vật liệu, trình phân rã phân tử hấp thụ, trình phân tách hydro từ phân tử gốc tự hấp thụ Ramalingam cộng 13 cho thấy khơng có khác biệt lượng tách hydro SiH3 từ bề mặt Si(100)-(2x1) tinh thể vô định hình Tuy nhiên, tính lưu động gốc tự bề mặt vơ định hình cao bề mặt tinh thể Trong nghiên cứu khác Ramalingam đồng nghiệp 14 đề cập đến “cơ chế lấp đầy điểm lõm - valley-filling mechanism” cho độ ghồ ghề bề mặt, nghĩa tiền chất lưu động SiH3 khuếch tán phản ứng với liên kết dao động điểm lõm Cereda 15 ước tính xác suất 60% cho phản ứng tách hydro mà khơng có rào (barrierless) gốc tự silyl qua chế ER Bakos cộng 16 đề xuất cách để tách hydro SiH3, chẳng hạn ER, Viện Khoa học Công nghệ Tính tốn TP Hồ Chí Minh Trang Phát triển mơ hình động học chi tiết cho hình thành phim silicon mỏng Langmuir-Hinshelwood (LH) 22 tiền chất trung gian (PM) 22 , nơi mà ER rào thế, rào cho mơ hình LH PM 17,5-18,0 9,0 kcal/mol Kết tính tốn từ Kang cộng 17 cho thấy lượng rào trình tách hydro bề mặt 1,0 kcal/mol Như thấy từ tóm tắt trên, gốc tự SiH3 khuếch tán bề mặt, sau tách hydro bề mặt thông qua chế Eley-Rideal mà khơng có rào đáng kể Khi có liên kết dao động, bước trình hấp thụ phản ứng gốc tự bề mặt Quá trình hấp thụ phân ly silan tính tốn Kang 17 với rào 7,4 kcal / mol Rào tìm thấy Brown đồng nghiệp kcal/mol Smardon 20 18, 19 12-14 xác nhận silan hấp thụ phân ly Si (100) - (2x2); mảnh (fragments) dimer dimer lân cận với chênh lệch lượng 4,2 kcal / mol Kang 17 xác định rào cản việc tách hydro khỏi SiH3 hấp thụ, với chuyển đổi sang cấu trúc bắc cầu SiH2 5,7 32,9 kcal / mol tương ứng cho trường hợp có khơng có H (khí) Sự phân hủy tồn gốc tự SiH3 thành Si H đề cập đến tính tốn Ceriotti 21, điều phù hợp với liệu thực nghiệm Gate đồng nghiệp 11 Nhìn chung, trình lắng đọng kết hợp nhiều trình cạnh tranh bề mặt bao gồm khuyếch tán, phân tách phân hủy Các trình bề mặt cần xem xét để xây dựng chế phát triển màng mỏng hoàn toàn Người ta biết chế phát triển chưa rõ ràng Hơn nữa, động học cho hầu hết trình không mô tả đầy đủ thực nghiệm tính tốn Mục đích nghiên cứu khảo sát chế hấp thụ chế phân rã gốc SiHx (x = 1-4) bề mặt Si (100) dẫn đến tăng trưởng lớp mỏng lý thuyết phiếm hàm mật độ (DFT) Phạm vi nghiên cứu bao gồm: (1) trình di chuyển hydro bề mặt sạch, (2) trình tách hydro từ bề mặt nguyên tử hydro gốc silic, (3) trình phân hủy phân tử silicon-hydrua bề mặt, (4) trình tách hydro từ bề mặt Viện Khoa học Công nghệ Tính tốn TP Hồ Chí Minh Trang Phát triển mơ hình động học chi tiết cho hình thành phim silicon mỏng silicon-hydrua Các phản ứng bề mặt liên quan đến hai chế ER LH góp phần tạo nên phát triển màng mỏng cách tạo liên kết dao động (dangling bonds) bề mặt khử hydro hoá phân tử hấp thụ Sự hiểu biết vi mô chế tăng trưởng cần thiết để xây dựng mơ hình động học để phát triển mơ hình mơ thực tế, điều giúp dự đốn hình thành phát triển phân tử bề mặt điều kiện thực tế Khi xác nhận, chế mở hội lớn để tối ưu hóa hiệu việc sản xuất màng mỏng silicon quy mơ cơng nghiệp Những phương pháp tính tốn Tất tính tốn thực cách sử dụng gói mơ Vienna Ab initio (VASP) 23-26 dựa lý thuyết phiếm hàm mật độ định kỳ (DFT) Các nhân ion đóng băng mơ tả phương pháp tăng sóng (projector aownment wave - PAW) 27 , trạng thái hóa trị Kohn-Sham mở rộng sở sóng phẳng lên đến 380 eV Năng lượng tương quan trao đổi mô tả phép tính xấp xỉ gradient tổng quát với phiếm hàm Perdew-BurkeErnzerhof (PBE)28-30 Tế bào đơn vị p (2 × 2) bề mặt Si (100) mô theo lặp lặp lại điều hòa với sáu lớp nguyên tử bốn nguyên tử Si lớp Với tính tốn bề mặt, tách khoảng cách chân không lớn 17 Å theo hướng vng góc với bề mặt, đảm bảo khơng có tương tác Ba lớp cho phép để nới lỏng cho tất tối ưu hố hình học, ba lớp giữ cố định vị trí nút với số mạng nút thực nghiệm (5,43 Å) 31 Vùng Brillouin bề mặt lấy mẫu với sơ đồ Monkhorst-Pack32 sử dụng lưới k-điểm hội tụ (6 × × 1) với mức lượng điện tử Quá trình nới lỏng ion dừng lại lực tác động tất nguyên tử tự nhỏ 0,02 eV / Å Phương pháp làm nhòe (smearing ) Gaussian Viện Khoa học Cơng nghệ Tính tốn TP Hồ Chí Minh 26, 33 với tham số nhòe Trang Figure 3: Calculated rate coefficients for the unimolecular decomposition of β-propiolactone (a) and β-propiolactam (b-c) as a function of pressure at different temperatures (i.e., 500, 800, 1000, 1500 and 2000 K) Only the important reaction pathways are shown here 14 β-propiolactone → C2H4 + CO2 (P1) Expt'l (James 1969 ) Expt'l (Frey 1985) logk1 (1/s) -1 Expt'l (Santioste 1987) -2 This work (P = 10 torr) -3 This work (P = 100 torr) This work (P = 200 torr) -4 -5 -6 -7 1.6 1.7 1.8 1.9 2.1 2.2 2.3 1000/T (1/K) Figure 4: Comparison between calculated and experimental rate coefficients as a function of temperature at different pressures for β-propiolactone → C2H4 + CO2 (Rxn 1) Experimental data are from the work of James and coworkers [12] (“James 1969”, P = 3.6 – 15.5 torr); Frey and coworkers [13] (“Frey 1985”, P = 0.1 – 6.0 torr) and Santioste Bermejo [14] (“Santiuste 1987”, P = 30.0 – 272.3 torr) The branching ratio of the two decomposition channels from β-propiolactone (P1:P2) basically remains unchanged with temperature (e.g., ~ 99.9:0.1 and 99.8:0.2 at T = 500 K and T = 2000 K, respectively, for P = 0.001 torr, cf Figure S6), while that from β-propiolactam (P3:P4) decreases noticeably with temperature (e.g., ~ 99.3:0.7 and 96.7:3.3 at T = 500 K and T = 2000 K, respectively, for P = 0.001 torr) Note that the high-pressure branching ratios were also plotted in Figure S7 It is found that pressure does not play a role for P1:P2 ratio but P3:P4 (e.g., the P3:P4 ratio is 99.3:0.7 and 97.8:2.2 at P = 0.001 and 100 torr, respectively, for T = 500 K, cf Figure S6) These observations on the effects of temperature and pressure on the mechanisms are consistent with the detailed PES provided in Figure in which the barrier height 15 differences between the two channels for each system is noticeably different as discussed previously Our calculated results including the branching ratios and the time-resolved species profiles confirm the observation in the previous experiment [11] that these authors only detected C2H4 + CO2 (P1) and C2H4 + NHCO (P3); CH2 CO + CH2NH (P4) when decomposed of βpropiolactone and β-propiolactam, respectively Conclusions In this study, the unimolecular decomposition reactions of β-propiolactone and βpropiolactam were investigated using highly accurate composite W1U method and modern deterministic/stochastic RRKM/ME statistical rate models The results show that β-propiolactone can only form C2H4 + CO2 (P1) (e.g., accounting for more than 99.9 % at T < 2000 K and P = 100 torr); while β-propiolactam can decompose to C2H4 + HNCO (P3) as a dominant channel (accounting for more than 97.8 % at T < 500 K and P = 100 torr) and minor product, CH2CO + CH2NH (P4) (accounting for less than 8.2 % at T < 2000 K and at P = 100 torr) The temperature-pressure dependent rate constant calculations based on the RRKM/ME solutions were found to be in excellent match with limited experimental data, which significantly confirm/verify the experimental observation and the available measurements; thus the calculated rates can be confidently used to describe the evolution of the two systems for the broad range of conditions (T = 500 – 2000 K and P = 0.001 – 760.0 torr) 16 Author Information Corresponding Author: Lam K Huynh *Email: lamhuynh.us@gmail.com or hklam@hcmiu.edu.vn Acknowledgements The authors are highly thankful to the Institute for Computational Science and Technology (ICST) at Ho Chi Minh City and the International University, Vietnam National University for the computer times LKH also acknowledges the Department of Science and Technology – Ho Chi Minh City for funding (Grant No 1161/QĐ-SKHCN) In addition, we would like to express our gratitude to Tuyn Phan and Tri Pham (ICST) for great technical support References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] G Wuitschik, M Rogers-Evans, K Müller, H Fischer, B Wagner, F Schuler, L Polonchuk, E.M Carreira, Angew Chem., Int Ed 118 (2006) 7900 T Bach, K Kather, O Krämer, J Org Chem 63 (1998) 1910 Y Dejaegher, N.M Kuz'menok, A.M Zvonok, N De Kimpe, Chem Rev 102 (2002) 29 E.M Wright, B.J Warner, H.E Foreman, L.R McCunn, K.N Urness, J Phys Chem A 119 (2015) 7966 X.L Zheng, H.Y Sun, C.K Law, J Phys Chem A 109 (2005) 9044 I Auzmendi-Murua, S Charaya, J.W Bozzelli, J Phys Chem A 117 (2013) 378 J.W Bozzelli, A.M Dean, J Phys Chem 97 (1993) 4427 C.A Taatjes, J Phys Chem A 110 (2006) 4299 T.H McGee, A Schleifer, J Phys Chem 76 (1972) 963 P Vansteenkiste, V Van Speybroeck, G Verniest, N De Kimpe, M Waroquier, J Phys Chem A 111 (2007) 2797 C.C Lim, Z.P Xu, H.H Huang, C.Y Mok, W.S Chin, Chem Phys Lett 325 (2000) 433 T.L James, C.A Wellington, J Am Chem Soc 91 (1969) 7743 H.M Frey, I.M Pidgeon, J Chem Soc., Faraday Trans 81 (1985) 1087 Y.M Santioste Bermejo, Afinidad 44 (1987) 424−427 A Moyano, M.A Pericas, E Valenti, J Org Chem 54 (1989) 573 I Morao, B Lecea, A Arrieta, F.P Cossío, J Am Chem Soc 119 (1997) 816 V.S Safont, J Andrés, L.R Domingo, Chem Phys Lett 288 (1998) 261 X.-F Ren, M.I Konaklieva, H Shi, S Dickey, D.V Lim, J Gonzalez, E Turos, J Org Chem 63 (1998) 8898 K Mogilaiah, R.B Rao, K.N Reddy, Ind J Chem B 38 (1999) 818 17 [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] P.J Stephens, F.J Devlin, C.F Chabalowski, M.J Frisch, J Phys Chem 98 (1994) 11623 J Poater, M Solà, M Duran, J Robles, Phys Chem Chem Phys (2002) 722 T.H Dunning, J Chem Phys 90 (1989) 1007 K Andersson, P.-A.k Malmqvist, B.r.O Roos, J Chem Phys 96 (1992) 1218 P Celani, H.-J Werner, J Chem Phys 112 (2000) 5546 H.-J Werner, Mol Phys 89 (2010) 645 M Szori, I.G Csizmadia, C Fittschen, B Viskolcz, J Phys Chem A 113 (2009) 9981 C Gonzalez, H.B Schlegel, J Chem Phys 90 (1989) 2154 C Gonzalez, H.B Schlegel, J Phys Chem 94 (1990) 5523 J.A Montgomery-Jr., M.J Frisch, J.W Ochterski, G.A Petersson, J Chem Phys 110 (1999) 2822 J.W Ochterski, G.A Petersson, J.A Montgomery, J Chem Phys 104 (1996) 2598 L.A Curtiss, K Raghavachari, P.C Redfern, V Rassolov, J.A Pople, J Chem Phys 109 (1998) 7764 L.A Curtiss, P.C Redfern, K Raghavachari, J Chem Phys 126 (2007) 084108 M.J Frisch, G.W Trucks, H.B Schlegel, G.E Scuseria, M.A Robb, J.R Cheeseman, G Scalmani, V Barone, B Mennucci, G.A Petersson, et al, Gaussian 09, Revision A.1 Gaussian, Inc., Wallingford CT, 2009 M.v Duong, H.T Nguyen, N Truong, T.N.M Le, L.K Huynh, Int J Chem Kinet 47 (2015) 564 T.H.M Le, S.T Do, L.K Huynh, Comput Theor Chem 1100 (2017) 61 C Eckart, Phys Rev 35 (1930) 1303 T Tan, X Yang, Y Ju, E.A Carter, J Phys Chem B 120 (2016) 1590 B.E Poling, J.M Prausnitz, J.P O’Connel, The Properties of Gases and Liquids, 5th ed., McGraw-Hill Education, New York, Chicago, San Francisco, Athens, London, Madrid, Mexico City, Milan, New Delhi, Singapore, Sydney, Toronto, 2004 Z Tian, T Yuan, R Fournet, P.A Glaude, B Sirjean, F Battin-Leclerc, K Zhang, F Qi, Combust Flame 158 (2011) 756 D.G Truhlar, B.C Garrett, Annu Rev Phys Chem 35 (1984) 159 J.M.L Martin, G.d Oliveira, J Chem Phys 111 (1999) 1843 B Ruscic, R.E Pinzon, G.v Laszewski, D Kodeboyina, A Burcat, D Leahy, D Montoya, A.F Wagner, J Phys.: Conf Ser 16 (2005) 561 B Ruscic, R.E Pinzon, M.L Morton, N.K Srinivasan, M.C Su, J.W Sutherland, J.V Michael, J Phys Chem A 110 (2006) 6592 Y.G Dmitriev, K.Z Kotovich, V.V Kochubei, L.O Mineravina, Vestn L'vov Politekhn Inst 221 (1988) 34 M Frenkel, K.N Marsh, R.C Wilhoit, G.J Kabo, G.N Roganov, Thermodynamics of Organic Compounds in the Gas State, Thermodynamics Research Center, College Station, TX, 1994 M.W Chase, Jr, J Phys Chem Ref Data, Monograph (1998) R.L Nuttall, A.H Laufer, M.V Kilday, J Chem Thermodyn (1971) 167 R.A.L Peerboom, S Ingemann, N.M.M Nibbering, J.F Liebman, J Chem Soc., Perkin Trans (1990) 1825 D.T Gillespie, J Comput Phys 22 (1976) 403 18 [50] T.V Mai, M.V Duong, H.T Nguyen, K.C Lin, L.K Huynh, J Phys Chem A 121 (2017) 3028 19 Graphical abstract 20 Highlights Thermal decomposition mechanisms of β-propiolactone and β-propiolactam were studied with the accurate composite W1U method Time-resolved temperature- and pressure-dependent behaviors of the title reactions were characterized using the integrated deterministic and stochastic model within the framework of master equation/Rice–Ramsperger–Kassel–Marcus (ME/RRKM) Calculated numbers are in excellent agreement with scatter experimental data A detailed kinetic sub-mechanism, consisting of thermodynamic and kinetic data in Chemkin format, was provided for the range of 500–2000 K and 0.001–760 torr 21 Chemometrics and Intelligent Laboratory Systems 172 (2018) 10–16 Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemometrics Identification of hindered internal rotational mode for complex chemical species: A data mining approach with multivariate logistic regression model Triet H.M Le a, Tung T Tran a, Lam K Huynh b, * a b School of Computer Science and Engineering, International University, VNU-HCM, Quarter 6, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam School of Biotechnology, International University, VNU-HCM, Quarter 6, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam A R T I C L E I N F O A B S T R A C T Keywords: Data mining Machine learning Multivariate data analysis Classification Thermodynamics Hindered internal rotation Thermodynamic properties are essential to understand and describe many chemical/biological processes in the real environment To obtain correct thermodynamic data of chemical species for a wide range of temperatures, a rigorous Hindered Internal Rotation (HIR) treatment must be considered Such a treatment requires detailed information about the internal rotation (i.e., rotational axis, group, frequency, symmetry and hindrance potential) However, it is very tedious, even prone-to-error, for chemists to prepare the input parameters for such a treatment Among the HIR parameters, the rotational frequency (or mode) is the most difficult element due to the complex molecular structure and mixing vibrational modes of chemical species Recently, a rule-based framework has been proposed to help chemists with this tedious process (Le et al., Comput Theor Chem., 2017, 61) This approach has been demonstrated to work well for simple species; however, it still lacked the ability to handle more complex cases Therefore, in this study, a data mining approach is proposed to overcome the challenges of the previous algorithm Within this framework, the HIR pattern was found using the features extracted from existing data provided by chemists More specifically, multivariate logistic regression was implemented to analyze the chemical data to better predict the rotational frequency (mode) of chemical species as well as to highlight the effect of each attribute of the rotation The experimental results were demonstrated to be more accurate than the previous study in terms of both accuracy and completeness It also gives meaningful insights into the HIR itself The proposed approach of this research will be integrated into MSMC-GUI (https://sites.google.com/site/ msmccode/manual/gui-1) to provide chemists with both an interactive and robust tool to prepare the data for their thermodynamic calculations on-the-fly Introduction Computational chemistry focuses on the development of novel computational methods and tools to resolve long-standing issues in chemistry Thanks to the power of contemporary computers, this field has been utilized to address the challenge of determining thermodynamic properties of chemical systems at extreme conditions that cannot be carried out in the laboratory With the theories of statistical mechanics and quantum chemistry, the thermodynamic properties of a chemical species can be calculated in terms of partition function [1] According to the Born-Oppenheimer approximation [2] in molecular spectroscopy, the statistical properties of chemical systems can be estimated using (1) electronic, (2) vibrational, (3) rotational, and (4) translational partition function and their derivatives Among them, the vibrational partition function is usually estimated with harmonic oscillation However, at very high temperatures (i.e., the thermal energy is much higher than the height of the hindrance barrier of the system of interest), the free rotor is a more reasonable approximation [3,4] Therefore, there have been several efforts to make a smooth transition from the harmonic oscillator to free rotor domains [5–8], in which Ayala et al.’s approach was integrated into Gaussian program [9] Over the last decade, the determination of the partition function for internal rotors has continued to be of interest for chemists with many research on both analytical and approximated solutions [10–12] The most rigorous 1-D HIR treatment was implemented in the Multi-Species Multi-Channel (MSMC) code [13] and was successfully applied to determine the thermal rate constants of the CH4 ỵ O2 CH3 ỵ HO2 reaction [14] The formulation of such treatment will be presented in Section However, the input numerical parameters for this 1-D treatment of HIR are numerous and tedious for the user and even experts to handle manually * Corresponding author E-mail addresses: lamhuynh.us@gmail.com, hklam@hcmiu.edu.vn (L.K Huynh) https://doi.org/10.1016/j.chemolab.2017.11.006 Received 27 September 2017; Received in revised form November 2017; Accepted November 2017 Available online November 2017 0169-7439/© 2017 Elsevier B.V All rights reserved T.H.M Le et al Chemometrics and Intelligent Laboratory Systems 172 (2018) 10–16 Therefore, an automatic framework was proposed to generate these parameters by Triet et al [15] The program has been demonstrated to work well for many different chemical systems, especially simple ones However, the previous rule-based algorithm still had difficulty predicting the rotational frequency (mode) of complex cases since it was rigidly based on the pre-defined rules deducted from prior assumptions and experience of the domain experts Such fixed conditions could be different from the ground truth of general cases Additionally, the proposed approach in the previous study did not show the contribution of the components in the rotation Such knowledge will be of great value for chemists to understand more deeply about the intrinsic nature of HIR Furthermore, the most optimal frequencies are selected from a list of all possible vibrational modes of a molecule, which can be categorized as a binary classification problem Therefore, in this study, we propose a data mining approach to improve the determination of HIR for a wider range of cases Such idea is possible thanks to adaptive learning pattern unveiled from regularly updated data, which makes this approach more flexible compared with the rule-based method In fact, machine learning algorithms have been applied widely in the field of chemoinformatics to determine physicochemical properties such as aqueous solubility, logP, melting point and so forth [16] Such properties can hardly be solved analytically using theoretical chemical methods (i.e., density functional theory or molecular dynamics) The remainder of this paper is organized as follows Section introduces about Hindered Internal Rotation treatment along with its prerequisite parameters Section presents the proposed data mining approach using logistic regression modeling combined with feature selection for the determination of rotational frequency (mode) The evaluation results of the optimal model are discussed in Section Section draws conclusions from such findings and suggests some on-going and future works and then plugged into the HIR equation in Eq (1) The matrix elements for the Hamiltonian is found to be, 2π Hmn ≡〈mjHjn〉 ¼ ∫ eimθ 2π ¼ m mn ỵ cmn 2Ired Chemometric approach In our proposed approach, the determination of Hindered Internal Rotation mode starts with the extraction and processing of the raw electronic structure input It then performs machine learning modeling using multivariate logistic regression and refines the recognized pattern Finally, the performance of the optimal model is evaluated and compared with the previous rule-based method [15] More specifically, the workflow of this study consists of five main steps (cf Fig 2) Firstly, the raw features are extracted from the electronic structure calculation of the chemical species supplied by the user Secondly, the raw data is analyzed using the preprocessing module implemented in Java Thirdly, logistic regression is applied to derive the pattern for the determination of the rotational frequency (mode) In the fourth step, the 10-fold cross-validation technique along with a customized best-first search and backward elimination is applied to select the classification algorithms Finally, the performance metrics are used to evaluate the algorithms on the validation and testing sets In the future, when there is new data of chemical species added to the dataset, the data mining approach Within the HIR scheme, some low-frequency vibrational modes should be better treated as internal rotations around single bonds These modes will be replaced in the thermodynamic calculations (e.g., via partition functions) by an explicit evaluation of the internal rotations The main idea of the HIR treatment proposed by Mai et al [14] implemented in the Multi-Species Multi-Channel [13] is to solve the 1-D Schr€ odinger equation for a HIR to obtain its energy levels The 1-D Schr€ odinger equation for the above HIR treatment is given in Eq (1), d hir ỵ Vịhir ẳ Ehir 2Ired d2 (4) Such matrix can then be diagonalized to get its eigenvalue spectra, which represent the energy levels of the internal rotor of interest Such information is utilized to calculate the partition function and determine the contributions to the thermodynamic functions, which in turn helps to obtain correct thermodynamic data Within this rigorous 1-D HIR treatment, five parameters are required: (1) vibrational modes needed to be treated separately as hindered internal rotors, (2) rotational axis, (3) rotational group, (4) rotational symmetry and (5) hindrance potential for the period of 2π (360 ) The rotational group is defined to be a collection of one or more atoms (i.e., two groups of atoms C1, H2, H3, H4 and C5, H6, H7, H8 in Fig 1) The rotational axis is the bond of two atoms connecting both rotational groups together (i.e., atoms C1 and C5 in Fig 1) Background of Hindered Internal Rotation treatment ! L X il ỵ Cl e ein d 2Ired ∂θ2 l¼L (1) where Ψhir, E, Ired are the wave function, energy and the reduced moment of inertia of the internal rotor of interest, respectively It is noted Ired that is defined as I(2,3) in [17] derived from the original work of Kilpatrick and Pitzer [18] The hindrance potential, VðθÞ, is computed directly as a function of the dihedral angle (i.e., torsional angle), θ More specifically, VðθÞ will be obtained using sequential relaxed potential energy surface scans with the 10-degree step size for the torsional angles of the considered rotation The above HIR equation was solved by transforming into a Mathieu-type equation by expressing the hindrance potential in terms of a Fourier series, Vị ẳ L X cl eil (2) l¼L where L is a cut-off number which is dependent on the nature of the potential energy Simultaneously, the wave function can be expanded using a harmonic series as follows jm〉 ¼ pffiffiffiffiffieimθ ; 2π (3) Fig Rotational group (within the dashed oval) and axis (long-dashed line) in the case of C2H6 with the frequency ¼ 308 cm1 11 T.H.M Le et al Chemometrics and Intelligent Laboratory Systems 172 (2018) 10–16 produce 2165 records in total The dataset contained two different independent sets: (i) training and (ii) testing sets with similar sizes of 1110 (79 species) and 1055 records (48 species), respectively For each record in the aforementioned dataset, there are fifteen features pro- proposed in this study will be repeated to update the pattern to re-capture the generality of the new data It is noted that the new data must be carefully verified by experts before being incorporated into the existing dataset to ensure the integrity of the result of the prediction Fig Overall workflow of the proposed data mining approach to detect the rotational frequency posed for the current study to help determine the rotational frequency for chemical species The list of proposed features is shown in Table It is noted that there are eight features adapted from the previous study [15] They are the average values and standard deviations of the changes of the dihedral angles (features 3, 4), the angles between every pair of atoms (features 6, 7), the directional angles (features 8, 9) and the bending angles (features 10, 11) The new attributes include features 1, 2, 5, 12–15 The brief description of the new features will be given hereafter Firstly, the out-of-axis angle and the out-of-axis vibrational vector are calculated in Eqs (5) and (6), respectively, 3.1 Electronic structure calculation input data and feature selection for the determination of hindered internal rotational mode There were 127 chemical compounds used in this study [19], consisting of simple hydrocarbons, transition states and more complex ones with cyclic bond and multiple types of atoms (e.g., oxygen, nitrogen, etc.) Most of the data were validated and retrieved from the previous study [15] The electronic structure calculation input data of such species were generated by Gaussian program [20] In this work, the low-frequency vibrational modes (i.e., the frequency of less than 700 cm1) of each species were then extracted from such input data to 12 T.H.M Le et al Chemometrics and Intelligent Laboratory Systems 172 (2018) 10–16 The direction of the rotation is determined by taking the algebraic multiplication of STorque (feature 12) and OTorque (feature 13) Subsequently, the binary feature SameDir (feature 15) takes the value of zero when the result of the multiplication is negative or zero, meaning two rotational groups are rotating in the opposite direction relative to each other On the other hand, SameDir is one when the multiplication result is positive, corresponding to the same directional rotation of two rotational groups The values of the above fifteen features along with the correct rotational frequency of each rotor of each species verified by chemists were stored in a semi-structured text file (e.g., Comma-Separated Values file) Since the features in this problem were on different scales of measurement, standardization (i.e., a type of feature scaling) was utilized to help the classification algorithm converge faster as well as provide a solid baseline for result interpretation in later steps The formulas of feature standardization are defined in Eqs (9)–(11) Table List of selected features of this research and the corresponding abbreviations (italics in parentheses) Out of axis angle (AxisAngle) Magnitude of the out-of-axis vibrational vector (AxisMag) Average value of the changes of the dihedral angles in the rotational group (DAngle) Standard deviation of the changes of the dihedral angles in the rotational group (StDAngle) Magnitude of the net vibrational vector of the rotational group (VecMag) Average value of the changes of angle between every pair of atoms (BAngle) Standard deviation of the changes of the angles between every pair of atoms (StBAngle) Average value of the directional angles (DirAngle) Standard deviation of the directional angles (StDirAngle) 10 Average value of the bending angles (StrAngle) 11 Standard deviation of the bending angles (StStrAngle) 12 Magnitude of the total torque of the current considered rotational group (STorque) 13 Magnitude of the total torque of the opposite side with respect to the considered rotational group (Torque) 14 Magnitude of the difference between (12) and (13) (TTorque) 15 Boolean feature to check whether two sides are rotating in the same direction (SameDir) x0 ¼ xμ σ N P α ¼ cos 1 μ¼ v1 ⋅v2 kv1 kkv2 k vnetaxis ẳ v1 ỵ v2 (5) (6) ẳ where α, vnetaxis are the angle and the net vibrational vector between two axis atoms 1, (cf Fig 1), respectively, with vibrational vectors v1 , v2 and their magnitudes kv1 k, kv2 k The out-of-axis angle (AxisAngle – feature 1) shows the direction of the vibration between two axis atoms, while its magnitude is described by the magnitude of the out-of-axis vibrational vector (AxisMag – feature 2) The units of AxisAngle and AxisMag are degree ( ) and angstrom (Å), respectively The net vibrational vector of all atoms (VecMag – feature 5) in the selected rotational group (group with fewer atoms) is determined using Eq (7) vnetrot:group ¼ n X vi xi i¼1 (10) N vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uN uP u xi ị2 tiẳ1 (11) N1 where x and x are the original and standardized values of the features, respectively; N is the number of data samples; μ is the mean value of the considered feature as in Eq (10); and σ is the sample standard deviation of the considered feature as in Eq (11) 3.2 Machine learning model building and selection In this study, multivariate logistic regression was utilized to classify the internal rotational modes from normal vibrational ones Since this is a binary classification problem, there are two outputs: (True) and (False), where is assigned to the mode(s) which should be treated as HIR, and for the others which can be reasonably approximated by the harmonic oscillation model This linear model has the function form of y ẳ wT x ỵ b (where x is the matrix of input features, y is classification output, w is the vector of feature weights, b is the bias or intercept), which is appropriate for this problem because it offers an insight into the nature of HIR thanks to the assigned weight of each rotational attribute The modeling step was carried out using the scikit-learn library [22] in Python This library is very efficient in handling semi-structured dataset (e.g., CSV file) and performing scientific calculations with large array manipulation Besides the goal to improve the performance of the classification, this study also aims to determine the most important subset of features contributing to the internal rotation Therefore, in the model selection step, backward elimination was used for variable selection based on the value of the negative log loss function of logistic regression as given in Eq (12), (7) i¼1 where vnetrot:group and n are the net vibrational vector all atoms and the number of atoms of the selected rotational group, respectively; whereas, vi is the vibrational vector of atom i The magnitude of vnetrot:group represents the extent of other types of movement (e.g., translation, vibration) other than rotation of the considered rotational group When only rotation exists (i.e., perfect internal rotation), vnetrot:group is zero The unit of VecMag is angstrom (Å) The torque is also a useful feature to characterize the rotation [21], which helps represent the rotation of the atoms in the rotational group Torque is calculated using Eq (8), ẳrF ẳ krkkFksinr; Fị (9) (8) where τ, r, F are the magnitudes of torque, the position vector (a vector from the coordinate origin to the position on which the force is exerted), and the force vector, respectively Since the vibrational vector is applied on each atom, it can also be considered as the force vector F exerted on the atom of interest The magnitude of torque is N.Å The torque of the considered rotational group, opposite group and the magnitude of their difference are STorque (feature 12), OTorque (feature 13), TTorque (feature 14), respectively Jnegativelogloss w; bị ẳ n X iẳ1 zi ẳ Tx ỵ ew yi log zi ỵ yi ịlog1 zi ị (12) i ỵb where zi is the logistic function for the data point ith in the sample, mapping the value of the linear function into the probability (i.e., within the range of (0,1)) of a data point taking the class one (i.e., the True class 13 T.H.M Le et al Chemometrics and Intelligent Laboratory Systems 172 (2018) 10–16 for binary classification.) More specifically, one feature (i.e., a column in the dataset) would be removed after each model selection round and the lists of features of each step were stored in a tree-like structure Therefore, best-first search [23] was used to retrieve the list of features resulting in the current minimum value of loss function in each step However, since the search space is large (i.e., 215 – ¼ 32767 subsets excluding the empty set for fifteen features), a predefined maximum number of expanded nodes (e.g., 3000–5000) for early-stopping was adopted for the best-first search to resolve this issue The most optimal model along with its subset of features was chosen based on the minimum average value of the loss function using 10-fold cross-validation technique The pseudocode for the model selection step is given in the following algorithm Accuracy ẳ TP ỵ TN TP ỵ FP ỵ TN ỵ FN (13) Precision ẳ TP TP ỵ FP (14) Recall ẳ TP TP ỵ FN F Score ¼ Precision Recall Precision ỵ Recall (15) (16) The aforementioned performance metrics were used for two purposes Firstly, in the model selection step, after the algorithm was carried out, each model with its corresponding optimal subset of features were re-evaluated using 10-fold cross-validation technique based on the performance metrics (i.e., Accuracy, Precision, Recall and F-Score) This step was used to demonstrate whether the current linear approach would be appropriate for the determination of the rotational frequency (mode) If the validation results were sufficiently good, the optimal models would then be tested using the testing set to evaluate its approximated performance on the unseen data Secondly, in the testing phase, the above performance metrics were applied to the testing set containing 1055 records, which were also extracted and pre-processed using the preprocessing module as described in Section 3.1 Moreover, the descriptive statistics (i.e., mean and standard deviation of each feature) of the training set found in preprocessing step were fit to the testing data to produce consistent results In addition, the testing set was kept secret during the training phase; hence, it could guarantee that the trained model did not learn the pattern of the testing set, and hence did not produce bias in the results As a result, the performance found in this step could be utilized to make conclusions on the generality and appropriateness of the model on new and unseen data Algorithm Customized best-first search combined with backward elimination and 10-fold cross-validation to select the most optimal subset of features for logistic regression Results and discussion After the model selection step in Section 3.2, the optimal model of logistic regression with the corresponding selected subset of features was recorded in Table It is worth noting that the positive weight of a feature means that the presence of such feature will lead to better classification result On the contrary, the feature with negative weight means that its appearance hinders the internal rotation of a molecule Additionally, since all features were standardized to the same scale using the z-score (cf Section 3.1), their weights can be interpreted as their importance in the classification results After the model selection step, there were six features along with a bias (intercept) found to best represent the Hindered Internal Rotation of a chemical species There were three features with positive weights: (1) the average value of the changes in dihedral angles of the rotational group (DAngle – feature 3), (2) the torque of the current rotational group of interest (STorque – feature 12), and (3) the difference between torque of the current group and that of the opposite side (TTorque – feature 14) The weights were supported by their definition since these features characterize a good internal rotation Among these positive features, DAngle was the most important one with the largest magnitude of the weight This finding was predictable since the dihedral angle is one the most important elements of the HIR, which is usually identified right after the rotational group and axis On the other hand, the other three negative features: (1) the average changes of the bond angles (BAngle – feature 6), (2) the average of the directional angles (DirAngle – feature 8), and (3) standard deviation of the bending angles (StStrAngle – feature 11), made the internal rotation less feasible This result was also reasonable because these three features measure the extent of other types of movement of a molecule rather than a rotating basis (e.g., stretching, bending, etc.) Furthermore, the directional angle had the most 3.3 Model evaluation on the validation and testing sets In this study, Accuracy, Precision, Recall, and F-Score metrics were introduced to evaluate the performance of the classifier Normally, Accuracy is a good measure for binary classification However, in this problem, the number of data points of the False class was many times more than those of the True class In fact, for each rotor of a chemical species, there were only one or two satisfactory frequencies (i.e., mixing modes) out of 3N – total modes (N is the number of atoms) Therefore, Precision and Recall were required to evaluate for the positive class The formulas used to calculate Accuracy, Precision, Recall, F-Score based on the definitions in Table are given in Eqs (13)–(16) Table Definitions of the results for the binary classification problem Prediction result True Actual value False Positive Negative True positives (TP) False positives (FP) False negatives (FN) True negatives (TN) 14 T.H.M Le et al Chemometrics and Intelligent Laboratory Systems 172 (2018) 10–16 Table Weights of optimal features of logistic regression after the model selection step Feature Weight DAngle BAngle DirAngle StStrAngle STorque TTorque Bias 1.813 1.047 2.210 0.808 1.100 0.730 6.076 simple investigation into this issue of logistic regression led to the discovery that ten cases in the testing set fell into this category Therefore, if such information was considered, the Precision of the data mining approach could have boosted to 0.958, which was approximately 2% higher than the previous method A remarkable contribution of the data mining over the previous study was the improvement in Recall, witnessing an increase of 5.3% Such improvement is very beneficial for the domain experts since they can retrieve as many correct frequencies as possible out of the list of available normal vibrational modes In overall, the F-Score of the proposed data mining approach showed an improvement in performance compared with the previous rule-based method remarkable negative contribution because it approximates how the atoms move around the rotational axis in general Last but not least, the bias is taken into account when all of other features are zero, which means that the molecule remains stationary without any movement In this case, no internal rotation whatsoever can exist Such statement was strongly supported by the large negative value of the bias The optimal model of logistic regression was then evaluated using the metrics as described in Section 3.3 and the results were recorded in Table The performance of such an optimal model was sufficiently high (i.e., over 90%), which could support the argument that a linear model is a good approach to unveil the pattern of the rotational frequency as stated at the beginning of Section 3.2 Table Performance of 10-fold cross-validation of the optimal logistic regression model using Accuracy, Precision, Recall, F-Score and their corresponding standard deviations (Std.) Accuracy Accuracy Std Precision Precision Std Recall Recall Std F-Score F-Score Std 0.026 0.911 0.079 0.933 0.080 0.920 0.069 0.969 Conclusions Because of the promising results demonstrated in the cross-validation step, logistic regression was tested on the testing set containing 1055 data records of 48 chemical species The species were of various forms, ranging from simple hydrocarbons to more complex molecular structures with multiple-bonding, cyclic bonds and even active bonds of transition states Subsequently, the values of selected features for the classifier as listed in Table were used for the input of the testing phase It is noted that the descriptive statistics of such features found during the training phase were applied to the testing data Such method was used to ensure the consistency of standardization in the testing phase since sometimes the size of the testing sample is not sufficiently large and representative, which results in its mean and standard deviation to be far divergent from those of the training set Logistic regression was then utilized to classify the testing data, and the following results (cf Table 5) were obtained after the testing phase had been completed In this study, the data mining approach has solved the remaining challenges of a rule-based approach for the determination of the rotational frequency (mode) (i.e., the most important yet complex parameter of the Hindered Internal Rotation) of a chemical species of interest This approach has been shown to improve both the completeness and overall performance of the problem More specifically, multivariate logistic regression is found to be appropriate for this problem It delivers meaningful results in terms of the weights of the features to explain the nature of the HIR To be more detailed, the most important features are (1) the average changes of the dihedral angles (feature 3), (2) the average value of the directional angles (feature 8), and (3) the torque of the rotational group of interest (feature 12) The (1) and (3) features support the internal rotation, while the (2) hinders such rotation It is noted that logistic regression has the probability associated with the result to offer the domain experts a good mean to justify the classification results Hopefully, this study can encourage other applications of data mining in the field of computational chemistry and other computational sciences as well In the future, the model of this study will be incorporated into MSMCGUI (https://sites.google.com/site/msmccode/manual/gui-1) to display the HIR parameters for chemists in an interactive and friendly user interface Despite the above advantages, logistic regression still has difficulty with the complex cases whose patterns did not exist in the training set due to its simple linear decision boundary Therefore, to enhance the robustness of the proposed approach, more electronic structure calculation data of different chemical systems and new features will be utilized Other non-parametric machine learning algorithms like decision tree, support vector machines, or neural networks can also be considered Table Performance of logistic regression on the testing set using Accuracy, Precision, Recall, and F-Score Algorithm Logistic Regression Rule-based method [15] Accuracy Precision Recall F-Score 0.976 0.975 0.958y (0.904z) 0.939 0.953 0.905 0.928 0.922 Note: Results with (y) and without (z) the consideration of only one vibrational mode with the highest probability in the result list More details are explained in the following text Compared with the rule-based method proposed in the previous study [15], the Accuracy was nearly the same for the two approaches because this metric is not significant for imbalanced classification problem as explained in Section 3.3 Instead, more attention should be paid to Precision and Recall More specifically, the Precision of the rule-based method was roughly 3.8% higher than the data mining approach, but it must be noted that the rule-based method only returns a single promising frequency per rotor In contrast, the data mining approach may classify more than one mode as a promising candidate for a rotor Under such circumstances, some of the promising modes may be considered as incorrect, which decreased the Precision In fact, if we also take into account the information that all of the promising modes belong to a single rotor and then select only one frequency with the highest probability, the Precision of the data mining approach will be improved dramatically A Conflict of interest The authors declare that there is no conflict of interest regarding the publication of this article Acknowledgments We would like to express our gratitude to Dang H Nguyen from the 15 T.H.M Le et al Chemometrics and Intelligent Laboratory Systems 172 (2018) 10–16 International University, VNU-HCM, and Tam V.-T Mai and Xuan T Le from Institute for Computational Science and Technology, Ho Chi Minh City for helping with the chemical data and test cases This research was funded by the Department of Science and Technology – Ho Chi Minh City under grant number 1161/QĐ-SKHCN [14] T.V.T Mai, M.V Duong, X.T Le, L.K Huynh, A Ratkiewicz, Direct ab initio dynamics calculations of thermal rate constants for the CH4 ỵ O2 CH3 ỵ HO2 reaction, Struct Chem 25 (2014) 1495–1503 [15] T.H.M Le, S.T Do, L.K Huynh, Algorithm for auto-generation of hindered internal rotation parameters for complex chemical systems, Comp Theor Chem 1100 (2017) 61–69 [16] J.B Mitchell, Machine learning methods in chemoinformatics, Wiley interdisciplinary reviews, Comput Mol Sci (2014) 468–481 [17] A.L East, L Radom, Ab initio statistical thermodynamical models for the computation of third-law entropies, J Chem Phys 106 (1997) 6655–6674 [18] J.E Kilpatrick, K.S Pitzer, Energy levels and thermodynamic functions for molecules with internal rotation III Compound rotation, J Chem Phys 17 (1949) 1064–1075 [19] T.H.M Le, T.T Tran, L.K Huynh, Data for: Identification of hindered internal rotational mode for complex chemical species: A data mining approach with multivariate logistic regression model, Mendeley Data (2017), https://doi.org/ 10.17632/d37mzs3b3m.2 v2 [20] M.J Frisch, G.W Trucks, H.B Schlegel, G.E Scuseria, M.A Robb, J.R Cheeseman, G Scalmani, V Barone, B Mennucci, G.A Petersson, H.C Nakatsuji, M Li, X Hratchian, H.P Izmaylov, A.F Bloino, J Zheng, G Sonnenberg, J.L Hada, M Ehara, M Toyota, K Fukuda, R Hasegawa, J Ishida, M Nakajima, T Honda, Y Kitao, O Nakai, H Vreven, T Montgomery Jr., J.A Peralta, J.E Ogliaro, F Bearpark, M Heyd, J.J Brothers, E Kudin, K.N Staroverov, V.N Kobayashi, R Normand, J Raghavachari, K Rendell, A Burant, J.C Iyengar, S.S Tomasi, J Cossi, M Rega, N Millam, J.M Klene, M Knox, J.E Cross, J.B Bakken, V Adamo, C Jaramillo, J Gomperts, R Stratmann, R.E Yazyev, O Austin, A.J Cammi, R Pomelli, C Ochterski, J.W Martin, R.L Morokuma, K Zakrzewski, V.G Voth, G.A Salvador, P Dannenberg, J.J Dapprich, S Daniels, A.D Farkas, € Foresman, J.B Ortiz, J.V Cioslowski, J Fox, D J, Gaussian 09, Gaussian, Inc., O Wallingford CT, 2009 [21] P.A Tipler, G Mosca, Physics for Scientists and Engineers, Macmillan, 2007 [22] F Pedregosa, G Varoquaux, A Gramfort, V Michel, B Thirion, O Grisel, M Blondel, P Prettenhofer, R Weiss, V Dubourg, Scikit-learn: machine learning in Python, J Mach Learn Res 12 (2011) 2825–2830 [23] R Dechter, J Pearl, Generalized best-first search strategies and the optimality of A*, J ACM 32 (1985) 505–536 References [1] D.A McQuarrie, Statistical Thermodynamics, University Science Books, Michigan, 1985 [2] M Born, R Oppenheimer, Zur quantentheorie der molekeln, Ann Phys 389 (1927) 457–484 [3] K.S Pitzer, W.D Gwinn, Thermodynamic functions for molecules with internal rotation, J Chem Phys (1941) 485–486 [4] K.S Pitzer, Thermodynamic functions for molecules having restricted internal rotations, J Chem Phys (1937) 469–472 [5] Y.-Y Chuang, D.G Truhlar, Statistical thermodynamics of bond torsional modes, J Chem Phys 112 (2000) 1221–1228 [6] P.Y Ayala, H.B Schlegel, Identification and treatment of internal rotation in normal mode vibrational analysis, J Chem Phys 108 (1998) 2314–2325 [7] R.B McClurg, R.C Flagan, W.A Goddard, The hindered rotor density-of-states interpolation function, J Chem Phys 106 (1997) 6675–6680 [8] D.G Truhlar, A simple approximation for the vibrational partition function of a hindered internal rotation, J Comput Chem 12 (1991) 266–270 [9] J.W Ochterski, Thermochemistry in Gaussian, Gaussian whitenote, 2000 [10] M.L Strekalov, Energy levels and partition functions of internal rotation: analytical approximations, Chem Phys 362 (2009) 75–81 [11] M.L Strekalov, Partition function of the hindered rotor: analytical solutions, Chem Phys 355 (2008) 62–66 [12] J Pfaendtner, X Yu, L.J Broadbelt, The 1-D hindered rotor approximation, Theor Chem Acc 118 (2007) 881–898 [13] M.V Duong, H.T Nguyen, N Truong, T.N.M Le, L.K Huynh, Multi-Species MultiChannel (MSMC): an ab initio-based parallel thermodynamic and kinetic code for complex chemical systems, Int J Chem Kinet 47 (2015) 564–575 16