Datasets
At the 2019 ACS conference in Orlando, I saw some talks about analyzing publically available chemistry datasets. However, there doesn’t seem to be a central repository of chemistry datasets and tools. As such, I have charged myself with doing so. Please inform me if you know of any datasets or tools that I have neglected (my apologies!).
Datasets
- UniProt - database containing protein sequence and functional information
- The Harvard organic photovoltaic dataset - experimental and quantum-mechanical PV data
- phonondb - database containing phonon band structures, DOS, and thermal properties for hundreds of materials
- PubChemQC PM6 dataset - database containing 221 million molecules with optimized molecular geometries and electronic properties
Tools
- PubChemPy - Python tool for interacting with the PubChem database
- ChemSpiPy - Python tool for interacting with the ChemSpider database
- mendeleev - Python tool for accessing elemental properties
- Open Babel - general-purpose tool for analyzing chemical data
- pyEQL - Python tool for accessing the properties of aqueous electrolyte solutions
- RDKit - Python tool for cheminformatics and ML
- Neural Graph Fingerprints - specialized CNNs for predicting the properties of molecules
- Covariant Compositional Networks - another specialized CNNs for molecules
- ChemML - another Python tool for cheminformatics and ML