PubChemQC PM6: Data Sets of 221 Million Molecules with Optimized Molecular Geometries and Electronic Properties PubChemQC PM6: Data Sets of 221 Million Molecules with Optimized Molecular Geometries and Electronic Properties
2020.10.26
2020.10.26
Maho Nakata, Tomomi Shimazaki, Masatomo Hashimoto, and Toshiyuki Maeda. PubChemQC PM6: Data Sets of 221 Million Molecules with Optimized Molecular Geometries and Electronic Properties. Journal of Chemical Information and Modeling. DOI: 10.1021/acs.jcim.0c00740
We report on optimized molecular geometries and electronic properties calculated by the PM6 method for 94.0% of the 91.6 million molecules cataloged in PubChem Compounds retrieved on August 29, 2016. In addition to neutral states, we also calculated those for cationic, anionic, and spin flipped electronic states of 56.2%, 49.7%, and 41.3% of the molecules, respectively. Thus, the grand total of the PM6 calculations amounted to 221 million. We compared the resulting molecular geometries with B3LYP/6-31G* optimized geometries for 2.6 million molecules. The root-mean-square deviations in bond length and bond angle were approximately 0.016 Å and 1.7°, respectively. Then, using linear regression to examine the HOMO energy levels E(HOMO) in the B3LYP and PM6 calculations, we found that EB3LYP(HOMO) = 0.876EPM6(HOMO) + 1.975 (eV) and calculated the coefficient of determination to be 0.803. Likewise, we examined the LUMO energy levels and found EB3LYP(LUMO) = 1.069EPM6(LUMO) – 0.420 (eV); the coefficient of determination was 0.842. We also generated four subdata sets, each of which was composed of molecules with molecular weights less than 500. Subdata set i contained C, H, O and N, ii contained C, H, N, O, P, and S, iii contained C, H, N, O, P, S, F, and Cl, and iv contained C, H, N, O, P, S, F, Cl, Na, K, Mg, and Ca. The data sets are available at http://pubchemqc.riken.jp/pm6_datasets.html under a Creative Commons Attribution 4.0 International license.