At the 2019 ACS conference in Orlando, I saw some talks about analyzing publically available chemistry datasets. However, there doesn’t seem to be a central repository of chemistry datasets and tools. As such, I have charged myself with doing so. Please inform me if you know of any datasets or tools that I have neglected (my apologies!).


  • UniProt - database containing protein sequence and functional information
  • The Harvard organic photovoltaic dataset - experimental and quantum-mechanical PV data
  • phonondb - database containing phonon band structures, DOS, and thermal properties for hundreds of materials
  • PubChemQC PM6 dataset - database containing 221 million molecules with optimized molecular geometries and electronic properties


  • PubChemPy - Python tool for interacting with the PubChem database
  • ChemSpiPy - Python tool for interacting with the ChemSpider database
  • mendeleev - Python tool for accessing elemental properties
  • Open Babel - general-purpose tool for analyzing chemical data
  • pyEQL - Python tool for accessing the properties of aqueous electrolyte solutions
  • RDKit - Python tool for cheminformatics and ML
  • Neural Graph Fingerprints - specialized CNNs for predicting the properties of molecules
  • Covariant Compositional Networks - another specialized CNNs for molecules
  • ChemML - another Python tool for cheminformatics and ML