openff.qcsubmit.datasets.BasicDataset

pydantic model openff.qcsubmit.datasets.BasicDataset[source]

The general QCFractal dataset class which contains all of the molecules and information about them prior to submission.

The class is a simple holder of the dataset and information about it and can do simple checks on the data before submitting it such as ensuring that the molecules have cmiles information and a unique index to be identified by.

Note:

The molecules in this dataset are all expanded so that different conformers are unique submissions.

Show JSON schema
{
   "title": "BasicDataset",
   "description": "The general QCFractal dataset class which contains all of the molecules and information about them prior to\nsubmission.\n\nThe class is a simple holder of the dataset and information about it and can do simple checks on the data before\nsubmitting it such as ensuring that the molecules have cmiles information\nand a unique index to be identified by.\n\nNote:\n    The molecules in this dataset are all expanded so that different conformers are unique submissions.",
   "type": "object",
   "properties": {
      "qc_specifications": {
         "title": "Qc Specifications",
         "description": "The QCSpecifications which will be computed for this dataset.",
         "default": {
            "default": {
               "method": "B3LYP-D3BJ",
               "basis": "DZVP",
               "program": "psi4",
               "spec_name": "default",
               "spec_description": "Standard OpenFF optimization quantum chemistry specification.",
               "store_wavefunction": "none",
               "implicit_solvent": null,
               "maxiter": 200,
               "scf_properties": [
                  "dipole",
                  "quadrupole",
                  "wiberg_lowdin_indices",
                  "mayer_indices"
               ],
               "keywords": {}
            }
         },
         "type": "object",
         "additionalProperties": {
            "$ref": "#/definitions/QCSpec"
         }
      },
      "driver": {
         "description": "The type of single point calculations which will be computed. Note some services require certain calculations for example optimizations require graident calculations.",
         "default": "energy",
         "allOf": [
            {
               "$ref": "#/definitions/SinglepointDriver"
            }
         ]
      },
      "priority": {
         "title": "Priority",
         "description": "The priority the dataset should be computed at compared to other datasets currently running.",
         "default": "normal",
         "type": "string"
      },
      "dataset_tags": {
         "title": "Dataset Tags",
         "description": "The dataset tags which help identify the dataset.",
         "default": [
            "openff"
         ],
         "type": "array",
         "items": {
            "type": "string"
         }
      },
      "compute_tag": {
         "title": "Compute Tag",
         "description": "The tag the computes tasks will be assigned to, managers wishing to execute these tasks should use this compute tag.",
         "default": "openff",
         "type": "string"
      },
      "dataset_name": {
         "title": "Dataset Name",
         "description": "The name of the dataset, this will be the name given to the collection in QCArchive.",
         "type": "string"
      },
      "dataset_tagline": {
         "title": "Dataset Tagline",
         "description": "The tagline should be a short description of the dataset which will be displayed by the QCArchive client when the datasets are listed.",
         "minLength": 8,
         "pattern": "[a-zA-Z]",
         "type": "string"
      },
      "type": {
         "title": "Type",
         "default": "DataSet",
         "enum": [
            "DataSet"
         ],
         "type": "string"
      },
      "description": {
         "title": "Description",
         "description": "A long description of the datasets purpose and details about the molecules within.",
         "minLength": 8,
         "pattern": "[a-zA-Z]",
         "type": "string"
      },
      "metadata": {
         "title": "Metadata",
         "description": "The metadata describing the dataset.",
         "default": {
            "submitter": "docs",
            "creation_date": "2024-01-18",
            "collection_type": null,
            "dataset_name": null,
            "short_description": null,
            "long_description_url": null,
            "long_description": null,
            "elements": []
         },
         "allOf": [
            {
               "$ref": "#/definitions/Metadata"
            }
         ]
      },
      "provenance": {
         "title": "Provenance",
         "description": "A dictionary of the software and versions used to generate the dataset.",
         "default": {},
         "type": "object",
         "additionalProperties": {
            "type": "string"
         }
      },
      "dataset": {
         "title": "Dataset",
         "description": "The actual dataset to be stored in QCArchive.",
         "default": {},
         "type": "object",
         "additionalProperties": {
            "$ref": "#/definitions/DatasetEntry"
         }
      },
      "filtered_molecules": {
         "title": "Filtered Molecules",
         "description": "The set of workflow components used to generate the dataset with any filtered molecules.",
         "default": {},
         "type": "object",
         "additionalProperties": {
            "$ref": "#/definitions/FilterEntry"
         }
      }
   },
   "required": [
      "dataset_name",
      "dataset_tagline",
      "description"
   ],
   "definitions": {
      "WavefunctionProtocolEnum": {
         "title": "WavefunctionProtocolEnum",
         "description": "Wavefunction to keep from a computation.",
         "enum": [
            "all",
            "orbitals_and_eigenvalues",
            "occupations_and_eigenvalues",
            "return_results",
            "none"
         ],
         "type": "string"
      },
      "PCMSettings": {
         "title": "PCMSettings",
         "description": "A class to handle PCM settings which can be used with PSi4.",
         "type": "object",
         "properties": {
            "units": {
               "title": "Units",
               "description": "The units used in the input options atomic units are used by default.",
               "type": "string"
            },
            "codata": {
               "title": "Codata",
               "description": "The set of fundamental physical constants to be used in the module.",
               "default": 2010,
               "type": "integer"
            },
            "cavity_Type": {
               "title": "Cavity Type",
               "description": "Completely specifies type of molecular surface and its discretization.",
               "default": "GePol",
               "type": "string"
            },
            "cavity_Area": {
               "title": "Cavity Area",
               "description": "Average area (weight) of the surface partition for the GePol cavity in the specified units. By default this is in AU.",
               "default": 0.3,
               "type": "number"
            },
            "cavity_Scaling": {
               "title": "Cavity Scaling",
               "description": "If true, the radii for the spheres will be scaled by 1.2. For finer control on the scaling factor for each sphere, select explicit creation mode.",
               "default": true,
               "type": "boolean"
            },
            "cavity_RadiiSet": {
               "title": "Cavity Radiiset",
               "description": "Select set of atomic radii to be used. Currently Bondi-Mantina Bondi, UFF  and Allinger\u2019s MM3 sets available. Radii in Allinger\u2019s MM3 set are obtained by dividing the value in the original paper by 1.2, as done in the ADF COSMO implementation We advise to turn off scaling of the radii by 1.2 when using this set.",
               "default": "Bondi",
               "type": "string"
            },
            "cavity_MinRadius": {
               "title": "Cavity Minradius",
               "description": "Minimal radius for additional spheres not centered on atoms. An arbitrarily big value is equivalent to switching off the use of added spheres, which is the default in AU.",
               "default": 100,
               "type": "number"
            },
            "cavity_Mode": {
               "title": "Cavity Mode",
               "description": "How to create the list of spheres for the generation of the molecular surface.",
               "default": "Implicit",
               "type": "string"
            },
            "medium_SolverType": {
               "title": "Medium Solvertype",
               "description": "Type of solver to be used. All solvers are based on the Integral Equation Formulation of the Polarizable Continuum Model.",
               "default": "IEFPCM",
               "type": "string"
            },
            "medium_Nonequilibrium": {
               "title": "Medium Nonequilibrium",
               "description": "Initializes an additional solver using the dynamic permittivity. To be used in response calculations.",
               "default": false,
               "type": "boolean"
            },
            "medium_Solvent": {
               "title": "Medium Solvent",
               "description": "Specification of the dielectric medium outside the cavity. Note this will always be converted to the molecular formula to aid parsing via PCM.",
               "type": "string"
            },
            "medium_MatrixSymm": {
               "title": "Medium Matrixsymm",
               "description": "If True, the PCM matrix obtained by the IEFPCM collocation solver is symmetrized.",
               "default": true,
               "type": "boolean"
            },
            "medium_Correction": {
               "title": "Medium Correction",
               "description": "Correction, k for the apparent surface charge scaling factor in the CPCM solver.",
               "default": 0.0,
               "minimum": 0,
               "type": "number"
            },
            "medium_DiagonalScaling": {
               "title": "Medium Diagonalscaling",
               "description": "Scaling factor for diagonal of collocation matrices, values commonly used in the literature are 1.07 and 1.0694.",
               "default": 1.07,
               "minimum": 0,
               "type": "number"
            },
            "medium_ProbeRadius": {
               "title": "Medium Proberadius",
               "description": "Radius of the spherical probe approximating a solvent molecule. Used for generating the solvent-excluded surface (SES) or an approximation of it. Overridden by the built-in value for the chosen solvent. Default in AU.",
               "default": 1.0,
               "type": "number"
            }
         },
         "required": [
            "units",
            "medium_Solvent"
         ]
      },
      "SCFProperties": {
         "title": "SCFProperties",
         "description": "The type of SCF property that should be extracted from a single point calculation.",
         "enum": [
            "dipole",
            "quadrupole",
            "mulliken_charges",
            "lowdin_charges",
            "wiberg_lowdin_indices",
            "mayer_indices",
            "mbis_charges"
         ],
         "type": "string"
      },
      "QCSpec": {
         "title": "QCSpec",
         "description": "A basic config class for results structures.",
         "type": "object",
         "properties": {
            "method": {
               "title": "Method",
               "description": "The name of the computational model used to execute the calculation. This could be the QC method or the forcefield name.",
               "default": "B3LYP-D3BJ",
               "type": "string"
            },
            "basis": {
               "title": "Basis",
               "description": "The name of the basis that should be used with the given method, outside of QC this can be the parameterization ie antechamber or None.",
               "default": "DZVP",
               "type": "string"
            },
            "program": {
               "title": "Program",
               "description": "The name of the program that will be used to perform the calculation.",
               "default": "psi4",
               "type": "string"
            },
            "spec_name": {
               "title": "Spec Name",
               "description": "The name the specification will be stored under in QCArchive.",
               "default": "default",
               "type": "string"
            },
            "spec_description": {
               "title": "Spec Description",
               "description": "The description of the specification which will be stored in QCArchive.",
               "default": "Standard OpenFF optimization quantum chemistry specification.",
               "type": "string"
            },
            "store_wavefunction": {
               "description": "The level of wavefunction detail that should be saved in QCArchive. Note that this is done for every calculation and should not be used with optimizations.",
               "default": "none",
               "allOf": [
                  {
                     "$ref": "#/definitions/WavefunctionProtocolEnum"
                  }
               ]
            },
            "implicit_solvent": {
               "title": "Implicit Solvent",
               "description": "If PCM is to be used with psi4 this is the full description of the settings that should be used.",
               "allOf": [
                  {
                     "$ref": "#/definitions/PCMSettings"
                  }
               ]
            },
            "maxiter": {
               "title": "Maxiter",
               "description": "The maximum number of SCF iterations in QM calculations this will be ignored by programs where this does not make sense.",
               "default": 200,
               "exclusiveMinimum": 0,
               "type": "integer"
            },
            "scf_properties": {
               "description": "The SCF properties which should be extracted after every single point calculation.",
               "default": [
                  "dipole",
                  "quadrupole",
                  "wiberg_lowdin_indices",
                  "mayer_indices"
               ],
               "type": "array",
               "items": {
                  "$ref": "#/definitions/SCFProperties"
               }
            },
            "keywords": {
               "title": "Keywords",
               "description": "An optional set of program specific computational keywords that should be passed to the program. These may include, for example, DFT grid settings.",
               "default": {},
               "type": "object",
               "additionalProperties": {
                  "anyOf": [
                     {
                        "type": "string"
                     },
                     {
                        "type": "integer"
                     },
                     {
                        "type": "number"
                     },
                     {
                        "type": "boolean"
                     },
                     {
                        "type": "array",
                        "items": {
                           "type": "number"
                        }
                     }
                  ]
               }
            }
         }
      },
      "SinglepointDriver": {
         "title": "SinglepointDriver",
         "description": "An enumeration.",
         "enum": [
            "energy",
            "gradient",
            "hessian",
            "properties",
            "deferred"
         ],
         "type": "string"
      },
      "Metadata": {
         "title": "Metadata",
         "description": "A general metadata class which is required to be filled in before submitting a dataset to the qcarchive.",
         "type": "object",
         "properties": {
            "submitter": {
               "title": "Submitter",
               "description": "The name of the submitter/creator of the dataset, this is automatically generated but can be changed.",
               "default": "docs",
               "type": "string"
            },
            "creation_date": {
               "title": "Creation Date",
               "description": "The date the dataset was created on, this is automatically generated.",
               "default": "2024-01-18",
               "type": "string",
               "format": "date"
            },
            "collection_type": {
               "title": "Collection Type",
               "description": "The type of collection that will be created in QCArchive this is automatically updated when attached to a dataset.",
               "type": "string"
            },
            "dataset_name": {
               "title": "Dataset Name",
               "description": "The name that will be given to the collection once it is put into QCArchive, this is updated when attached to a dataset.",
               "type": "string"
            },
            "short_description": {
               "title": "Short Description",
               "description": "A short informative description of the dataset.",
               "minLength": 8,
               "pattern": "[a-zA-Z]",
               "type": "string"
            },
            "long_description_url": {
               "title": "Long Description Url",
               "description": "The url which links to more information about the submission normally a github repo with scripts showing how the dataset was created.",
               "minLength": 1,
               "maxLength": 2083,
               "format": "uri",
               "type": "string"
            },
            "long_description": {
               "title": "Long Description",
               "description": "A long description of the purpose of the dataset and the molecules within.",
               "minLength": 8,
               "pattern": "[a-zA-Z]",
               "type": "string"
            },
            "elements": {
               "title": "Elements",
               "description": "The unique set of elements present in the dataset",
               "default": [],
               "type": "array",
               "items": {
                  "type": "string"
               },
               "uniqueItems": true
            }
         }
      },
      "Identifiers": {
         "title": "Identifiers",
         "description": "Canonical chemical identifiers",
         "type": "object",
         "properties": {
            "molecule_hash": {
               "title": "Molecule Hash",
               "type": "string"
            },
            "molecular_formula": {
               "title": "Molecular Formula",
               "type": "string"
            },
            "smiles": {
               "title": "Smiles",
               "type": "string"
            },
            "inchi": {
               "title": "Inchi",
               "type": "string"
            },
            "inchikey": {
               "title": "Inchikey",
               "type": "string"
            },
            "canonical_explicit_hydrogen_smiles": {
               "title": "Canonical Explicit Hydrogen Smiles",
               "type": "string"
            },
            "canonical_isomeric_explicit_hydrogen_mapped_smiles": {
               "title": "Canonical Isomeric Explicit Hydrogen Mapped Smiles",
               "type": "string"
            },
            "canonical_isomeric_explicit_hydrogen_smiles": {
               "title": "Canonical Isomeric Explicit Hydrogen Smiles",
               "type": "string"
            },
            "canonical_isomeric_smiles": {
               "title": "Canonical Isomeric Smiles",
               "type": "string"
            },
            "canonical_smiles": {
               "title": "Canonical Smiles",
               "type": "string"
            },
            "pubchem_cid": {
               "title": "Pubchem Cid",
               "description": "PubChem Compound ID",
               "type": "string"
            },
            "pubchem_sid": {
               "title": "Pubchem Sid",
               "description": "PubChem Substance ID",
               "type": "string"
            },
            "pubchem_conformerid": {
               "title": "Pubchem Conformerid",
               "description": "PubChem Conformer ID",
               "type": "string"
            }
         },
         "additionalProperties": false
      },
      "Provenance": {
         "title": "Provenance",
         "description": "Provenance information.",
         "type": "object",
         "properties": {
            "creator": {
               "title": "Creator",
               "description": "The name of the program, library, or person who created the object.",
               "type": "string"
            },
            "version": {
               "title": "Version",
               "description": "The version of the creator, blank otherwise. This should be sortable by the very broad `PEP 440 <https://www.python.org/dev/peps/pep-0440/>`_.",
               "default": "",
               "type": "string"
            },
            "routine": {
               "title": "Routine",
               "description": "The name of the routine or function within the creator, blank otherwise.",
               "default": "",
               "type": "string"
            }
         },
         "required": [
            "creator"
         ],
         "$schema": "http://json-schema.org/draft-04/schema#"
      },
      "Molecule": {
         "title": "Molecule",
         "description": "The physical Cartesian representation of the molecular system.\n\nA QCSchema representation of a Molecule. This model contains\ndata for symbols, geometry, connectivity, charges, fragmentation, etc while also supporting a wide array of I/O and manipulation capabilities.\n\nMolecule objects geometry, masses, and charges are truncated to 8, 6, and 4 decimal places respectively to assist with duplicate detection.\n\nNotes\n-----\nAll arrays are stored flat but must be reshapable into the dimensions in attribute ``shape``, with abbreviations as follows:\n\n  * nat: number of atomic = calcinfo_natom\n  * nfr: number of fragments\n  * <varies>: irregular dimension not systematically reshapable",
         "type": "object",
         "properties": {
            "schema_name": {
               "title": "Schema Name",
               "description": "The QCSchema specification to which this model conforms. Explicitly fixed as qcschema_molecule.",
               "default": "qcschema_molecule",
               "pattern": "^(qcschema_molecule)$",
               "type": "string"
            },
            "schema_version": {
               "title": "Schema Version",
               "description": "The version number of :attr:`~qcelemental.models.Molecule.schema_name` to which this model conforms.",
               "default": 2,
               "type": "integer"
            },
            "validated": {
               "title": "Validated",
               "description": "A boolean indicator (for speed purposes) that the input Molecule data has been previously checked for schema (data layout and type) and physics (e.g., non-overlapping atoms, feasible multiplicity) compliance. This should be False in most cases. A ``True`` setting should only ever be set by the constructor for this class itself or other trusted sources such as a Fractal Server or previously serialized Molecules.",
               "default": false,
               "type": "boolean"
            },
            "symbols": {
               "title": "Symbols",
               "description": "The ordered array of atomic elemental symbols in title case. This field's index sets atomic order for all other per-atom fields like :attr:`~qcelemental.models.Molecule.real` and the first dimension of :attr:`~qcelemental.models.Molecule.geometry`. Ghost/virtual atoms must have an entry here in :attr:`~qcelemental.models.Molecule.symbols`; ghostedness is indicated through the :attr:`~qcelemental.models.Molecule.real` field.",
               "shape": [
                  "nat"
               ],
               "type": "array",
               "items": {
                  "type": "string"
               }
            },
            "geometry": {
               "title": "Geometry",
               "description": "The ordered array for Cartesian XYZ atomic coordinates [a0]. Atom ordering is fixed; that is, a consumer who shuffles atoms must not reattach the input (pre-shuffling) molecule schema instance to any output (post-shuffling) per-atom results (e.g., gradient). Index of the first dimension matches the 0-indexed indices of all other per-atom settings like :attr:`~qcelemental.models.Molecule.symbols` and :attr:`~qcelemental.models.Molecule.real`.\nSerialized storage is always flat, (3*nat,), but QCSchema implementations will want to reshape it. QCElemental can also accept array-likes which can be mapped to (nat,3) such as a 1-D list of length 3*nat, or the serialized version of the array in (3*nat,) shape; all forms will be reshaped to (nat,3) for this attribute.",
               "shape": [
                  "nat",
                  3
               ],
               "units": "a0",
               "type": "array",
               "items": {
                  "type": "number"
               }
            },
            "name": {
               "title": "Name",
               "description": "Common or human-readable name to assign to this molecule. This field can be arbitrary; see :attr:`~qcelemental.models.Molecule.identifiers` for well-defined labels.",
               "type": "string"
            },
            "identifiers": {
               "title": "Identifiers",
               "description": "An optional dictionary of additional identifiers by which this molecule can be referenced, such as INCHI, canonical SMILES, etc. See the :class:`~qcelemental.models.results.Identifiers` model for more details.",
               "allOf": [
                  {
                     "$ref": "#/definitions/Identifiers"
                  }
               ]
            },
            "comment": {
               "title": "Comment",
               "description": "Additional comments for this molecule. Intended for pure human/user consumption and clarity.",
               "type": "string"
            },
            "molecular_charge": {
               "title": "Molecular Charge",
               "description": "The net electrostatic charge of the molecule.",
               "default": 0.0,
               "type": "number"
            },
            "molecular_multiplicity": {
               "title": "Molecular Multiplicity",
               "description": "The total multiplicity of the molecule.",
               "default": 1,
               "type": "integer"
            },
            "masses": {
               "title": "Masses",
               "description": "The ordered array of atomic masses. Index order matches the 0-indexed indices of all other per-atom fields like :attr:`~qcelemental.models.Molecule.symbols` and :attr:`~qcelemental.models.Molecule.real`. If this is not provided, the mass of each atom is inferred from its most common isotope. If this is provided, it must be the same length as :attr:`~qcelemental.models.Molecule.symbols` but can accept ``None`` entries for standard masses to infer from the same index in the :attr:`~qcelemental.models.Molecule.symbols` field.",
               "shape": [
                  "nat"
               ],
               "units": "u",
               "type": "array",
               "items": {
                  "type": "number"
               }
            },
            "real": {
               "title": "Real",
               "description": "The ordered array indicating if each atom is real (``True``) or ghost/virtual (``False``). Index matches the 0-indexed indices of all other per-atom settings like :attr:`~qcelemental.models.Molecule.symbols` and the first dimension of :attr:`~qcelemental.models.Molecule.geometry`. If this is not provided, all atoms are assumed to be real (``True``).If this is provided, the reality or ghostedness of every atom must be specified.",
               "shape": [
                  "nat"
               ],
               "type": "array",
               "items": {
                  "type": "boolean"
               }
            },
            "atom_labels": {
               "title": "Atom Labels",
               "description": "Additional per-atom labels as an array of strings. Typical use is in model conversions, such as Elemental <-> Molpro and not typically something which should be user assigned. See the :attr:`~qcelemental.models.Molecule.comment` field for general human-consumable text to affix to the molecule.",
               "shape": [
                  "nat"
               ],
               "type": "array",
               "items": {
                  "type": "string"
               }
            },
            "atomic_numbers": {
               "title": "Atomic Numbers",
               "description": "An optional ordered 1-D array-like object of atomic numbers of shape (nat,). Index matches the 0-indexed indices of all other per-atom settings like :attr:`~qcelemental.models.Molecule.symbols` and :attr:`~qcelemental.models.Molecule.real`. Values are inferred from the :attr:`~qcelemental.models.Molecule.symbols` list if not explicitly set. Ghostedness should be indicated through :attr:`~qcelemental.models.Molecule.real` field, not zeros here.",
               "shape": [
                  "nat"
               ],
               "type": "array",
               "items": {
                  "type": "number",
                  "multipleOf": 1.0
               }
            },
            "mass_numbers": {
               "title": "Mass Numbers",
               "description": "An optional ordered 1-D array-like object of atomic *mass* numbers of shape (nat). Index matches the 0-indexed indices of all other per-atom settings like :attr:`~qcelemental.models.Molecule.symbols` and :attr:`~qcelemental.models.Molecule.real`. Values are inferred from the most common isotopes of the :attr:`~qcelemental.models.Molecule.symbols` list if not explicitly set. If single isotope not (yet) known for an atom, -1 is placeholder.",
               "shape": [
                  "nat"
               ],
               "type": "array",
               "items": {
                  "type": "number",
                  "multipleOf": 1.0
               }
            },
            "connectivity": {
               "title": "Connectivity",
               "description": "A list of bonds within the molecule. Each entry is a tuple of ``(atom_index_A, atom_index_B, bond_order)`` where the ``atom_index`` matches the 0-indexed indices of all other per-atom settings like :attr:`~qcelemental.models.Molecule.symbols` and :attr:`~qcelemental.models.Molecule.real`. Bonds may be freely reordered and inverted.",
               "minItems": 1,
               "type": "array",
               "items": {
                  "type": "array",
                  "minItems": 3,
                  "maxItems": 3,
                  "items": [
                     {
                        "type": "integer",
                        "minimum": 0
                     },
                     {
                        "type": "integer",
                        "minimum": 0
                     },
                     {
                        "type": "number",
                        "minimum": 0,
                        "maximum": 5
                     }
                  ]
               }
            },
            "fragments": {
               "title": "Fragments",
               "description": "List of indices grouping atoms (0-indexed) into molecular fragments within the molecule. Each entry in the outer list is a new fragment; index matches the ordering in :attr:`~qcelemental.models.Molecule.fragment_charges` and :attr:`~qcelemental.models.Molecule.fragment_multiplicities`. Inner lists are 0-indexed atoms which compose the fragment; every atom must be in exactly one inner list. Noncontiguous fragments are allowed, though no QM program is known to support them. Fragment ordering is fixed; that is, a consumer who shuffles fragments must not reattach the input (pre-shuffling) molecule schema instance to any output (post-shuffling) per-fragment results (e.g., n-body energy arrays).",
               "shape": [
                  "nfr",
                  "<varies>"
               ],
               "type": "array",
               "items": {
                  "type": "array",
                  "items": {
                     "type": "number",
                     "multipleOf": 1.0
                  }
               }
            },
            "fragment_charges": {
               "title": "Fragment Charges",
               "description": "The total charge of each fragment in the :attr:`~qcelemental.models.Molecule.fragments` list. The index of this list matches the 0-index indices of :attr:`~qcelemental.models.Molecule.fragments` list. Will be filled in based on a set of rules if not provided (and :attr:`~qcelemental.models.Molecule.fragments` are specified).",
               "shape": [
                  "nfr"
               ],
               "type": "array",
               "items": {
                  "type": "number"
               }
            },
            "fragment_multiplicities": {
               "title": "Fragment Multiplicities",
               "description": "The multiplicity of each fragment in the :attr:`~qcelemental.models.Molecule.fragments` list. The index of this list matches the 0-index indices of :attr:`~qcelemental.models.Molecule.fragments` list. Will be filled in based on a set of rules if not provided (and :attr:`~qcelemental.models.Molecule.fragments` are specified).",
               "shape": [
                  "nfr"
               ],
               "type": "array",
               "items": {
                  "type": "integer"
               }
            },
            "fix_com": {
               "title": "Fix Com",
               "description": "Whether translation of geometry is allowed (fix F) or disallowed (fix T).When False, QCElemental will pre-process the Molecule object to translate the center of mass to (0,0,0) in Euclidean coordinate space, resulting in a different :attr:`~qcelemental.models.Molecule.geometry` than the one provided. 'Fix' is used in the sense of 'specify': that is, `fix_com=True` signals that the origin in `geometry` is a deliberate part of the Molecule spec, whereas `fix_com=False` (default) allows that the origin is happenstance and may be adjusted. guidance: A consumer who translates the geometry must not reattach the input (pre-translation) molecule schema instance to any output (post-translation) origin-sensitive results (e.g., an ordinary energy when EFP present).",
               "default": false,
               "type": "boolean"
            },
            "fix_orientation": {
               "title": "Fix Orientation",
               "description": "Whether rotation of geometry is allowed (fix F) or disallowed (fix T). When False, QCElemental will pre-process the Molecule object to orient via the intertial tensor, resulting in a different :attr:`~qcelemental.models.Molecule.geometry` than the one provided. 'Fix' is used in the sense of 'specify': that is, `fix_orientation=True` signals that the frame orientation in `geometry` is a deliberate part of the Molecule spec, whereas `fix_orientation=False` (default) allows that the frame is happenstance and may be adjusted. guidance: A consumer who rotates the geometry must not reattach the input (pre-rotation) molecule schema instance to any output (post-rotation) frame-sensitive results (e.g., molecular vibrations).",
               "default": false,
               "type": "boolean"
            },
            "fix_symmetry": {
               "title": "Fix Symmetry",
               "description": "Maximal point group symmetry which :attr:`~qcelemental.models.Molecule.geometry` should be treated. Lowercase.",
               "type": "string"
            },
            "provenance": {
               "title": "Provenance",
               "description": "The provenance information about how this Molecule (and its attributes) were generated, provided, and manipulated.",
               "allOf": [
                  {
                     "$ref": "#/definitions/Provenance"
                  }
               ]
            },
            "id": {
               "title": "Id",
               "description": "A unique identifier for this Molecule object. This field exists primarily for Databases (e.g. Fractal's Server) to track and lookup this specific object and should virtually never need to be manually set."
            },
            "extras": {
               "title": "Extras",
               "description": "Additional information to bundle with the molecule. Use for schema development and scratch space.",
               "type": "object"
            }
         },
         "required": [
            "symbols",
            "geometry"
         ],
         "additionalProperties": false,
         "$schema": "http://json-schema.org/draft-04/schema#"
      },
      "MoleculeAttributes": {
         "title": "MoleculeAttributes",
         "description": "A class to hold and validate the molecule attributes associated with a QCArchive entry, All attributes are required\nto be entered into a dataset.\n\nNote:\n    The attributes here are not exhaustive but are based on those given by cmiles and can all be obtain through the openforcefield toolkit Molecule class.",
         "type": "object",
         "properties": {
            "canonical_smiles": {
               "title": "Canonical Smiles",
               "type": "string"
            },
            "canonical_isomeric_smiles": {
               "title": "Canonical Isomeric Smiles",
               "type": "string"
            },
            "canonical_explicit_hydrogen_smiles": {
               "title": "Canonical Explicit Hydrogen Smiles",
               "type": "string"
            },
            "canonical_isomeric_explicit_hydrogen_smiles": {
               "title": "Canonical Isomeric Explicit Hydrogen Smiles",
               "type": "string"
            },
            "canonical_isomeric_explicit_hydrogen_mapped_smiles": {
               "title": "Canonical Isomeric Explicit Hydrogen Mapped Smiles",
               "description": "The fully mapped smiles where every atom should have a numerical tag so that the molecule can be rebuilt to match the order of the coordinates.",
               "type": "string"
            },
            "molecular_formula": {
               "title": "Molecular Formula",
               "description": "The hill formula of the molecule as given by the openfftoolkit.",
               "type": "string"
            },
            "standard_inchi": {
               "title": "Standard Inchi",
               "description": "The standard inchi given by the inchi program ie not fixed hydrogen layer.",
               "type": "string"
            },
            "inchi_key": {
               "title": "Inchi Key",
               "description": "The standard inchi key given by the inchi program.",
               "type": "string"
            },
            "fixed_hydrogen_inchi": {
               "title": "Fixed Hydrogen Inchi",
               "description": "The non-standard inchi with a fixed hydrogen layer to distinguish tautomers.",
               "type": "string"
            },
            "fixed_hydrogen_inchi_key": {
               "title": "Fixed Hydrogen Inchi Key",
               "description": "The non-standard inchikey with a fixed hydrogen layer.",
               "type": "string"
            },
            "unique_fixed_hydrogen_inchi_keys": {
               "title": "Unique Fixed Hydrogen Inchi Keys",
               "description": "The list of unique non-standard inchikey with a fixed hydrogen layer.",
               "type": "array",
               "items": {
                  "type": "string"
               },
               "uniqueItems": true
            }
         },
         "required": [
            "canonical_smiles",
            "canonical_isomeric_smiles",
            "canonical_explicit_hydrogen_smiles",
            "canonical_isomeric_explicit_hydrogen_smiles",
            "canonical_isomeric_explicit_hydrogen_mapped_smiles",
            "molecular_formula",
            "standard_inchi",
            "inchi_key"
         ]
      },
      "DatasetEntry": {
         "title": "DatasetEntry",
         "description": "A basic data class to construct the datasets which holds any information about the molecule and options used in\nthe qcarchive calculation.\n\nNote:\n    * ``extras`` are passed into the qcelemental.models.Molecule on creation.\n    * any extras that should passed to the calculation like extra constrains should be passed to ``keywords``.",
         "type": "object",
         "properties": {
            "index": {
               "title": "Index",
               "description": "The index name the molecule will be stored under in QCArchive. Note that if multipule geometries are provided the index will be augmented with a value indecating the conformer number so -0, -1.",
               "type": "string"
            },
            "initial_molecules": {
               "title": "Initial Molecules",
               "description": "A list of QCElemental Molecule objects which contain the geometries to be used as inputs for the calculation.",
               "type": "array",
               "items": {
                  "$ref": "#/definitions/Molecule"
               }
            },
            "attributes": {
               "title": "Attributes",
               "description": "The complete set of required cmiles attributes for the molecule.",
               "allOf": [
                  {
                     "$ref": "#/definitions/MoleculeAttributes"
                  }
               ]
            },
            "extras": {
               "title": "Extras",
               "description": "Any extra information that should be injected into the QCElemental models before being submited like the cmiles information.",
               "default": {},
               "type": "object"
            },
            "keywords": {
               "title": "Keywords",
               "description": "Any extra keywords that should be used in the QCArchive calculation should be passed here.",
               "default": {},
               "type": "object"
            }
         },
         "required": [
            "index",
            "initial_molecules",
            "attributes"
         ]
      },
      "FilterEntry": {
         "title": "FilterEntry",
         "description": "A basic data class that contains information on components run in a workflow and the associated molecules which were\nremoved by it.",
         "type": "object",
         "properties": {
            "component": {
               "title": "Component",
               "description": "The name of the component ran, this should be one of the components registered with qcsubmit.",
               "type": "string"
            },
            "component_settings": {
               "title": "Component Settings",
               "description": "The run time settings of the component used to filter the molecules.",
               "type": "object"
            },
            "component_provenance": {
               "title": "Component Provenance",
               "description": "A dictionary of the version information of all dependencies of the component.",
               "type": "object",
               "additionalProperties": {
                  "type": "string"
               }
            },
            "molecules": {
               "title": "Molecules",
               "type": "array",
               "items": {
                  "type": "string"
               }
            }
         },
         "required": [
            "component",
            "component_settings",
            "component_provenance",
            "molecules"
         ]
      }
   }
}

Config:
  • allow_mutation: bool = True

  • arbitrary_types_allowed: bool = True

  • json_encoders: dict = {<class ‘numpy.ndarray’>: <function DatasetConfig.Config.<lambda> at 0x7f6888389800>, <enum ‘Enum’>: <function DatasetConfig.Config.<lambda> at 0x7f68883b82c0>}

  • validate_assignment: bool = True

Fields:
  • compute_tag (str)

  • dataset (Dict[str, openff.qcsubmit.datasets.entries.DatasetEntry])

  • dataset_name (str)

  • dataset_tagline (openff.qcsubmit.datasets.datasets.ConstrainedStrValue)

  • dataset_tags (List[str])

  • description (openff.qcsubmit.datasets.datasets.ConstrainedStrValue)

  • driver (qcportal.singlepoint.record_models.SinglepointDriver)

  • filtered_molecules (Dict[str, openff.qcsubmit.datasets.entries.FilterEntry])

  • metadata (openff.qcsubmit.common_structures.Metadata)

  • priority (str)

  • provenance (Dict[str, str])

  • qc_specifications (Dict[str, openff.qcsubmit.common_structures.QCSpec])

  • type (Literal['DataSet'])

field type: Literal['DataSet'] = 'DataSet'
to_tasks() Dict[str, List[AtomicInput]][source]

Build a dictionary of single QCEngine tasks that correspond to this dataset organised by program name. The tasks can be passed directly to qcengine.compute.

add_molecule(index: str, molecule: Optional[Molecule], extras: Optional[Dict[str, Any]] = None, keywords: Optional[Dict[str, Any]] = None, **kwargs) None

Add a molecule to the dataset under the given index with the passed cmiles.

Args:
index:

The index that should be associated with the molecule in QCArchive.

molecule:

The instance of the molecule which contains its conformer information.

extras:

The extras that should be supplied into the qcportal.moldels.Molecule.

keywords:

Any extra keywords which are required for the calculation.

Note:

Each molecule in this basic dataset should have all of its conformers expanded out into separate entries. Thus here we take the general molecule index and increment it.

add_qc_spec(method: str, basis: Optional[str], program: str, spec_name: str, spec_description: str, store_wavefunction: str = 'none', overwrite: bool = False, implicit_solvent: Optional[PCMSettings] = None, maxiter: PositiveInt = 200, scf_properties: Optional[List[SCFProperties]] = None, keywords: Optional[Dict[str, Union[StrictStr, StrictInt, StrictFloat, StrictBool, List[StrictFloat]]]] = None) None

Add a new qcspecification to the factory which will be applied to the dataset.

Parameters:

method: The name of the method to use eg B3LYP-D3BJ basis: The name of the basis to use can also be None program: The name of the program to execute the computation spec_name: The name the spec should be stored under spec_description: The description of the spec store_wavefunction: what parts of the wavefunction that should be saved overwrite: If there is a spec under this name already overwrite it implicit_solvent: The implicit solvent settings if it is to be used. maxiter: The maximum number of SCF iterations that should be done. scf_properties: The list of SCF properties that should be extracted from the calculation. keywords: Program specific computational keywords that should be passed to

the program

clear_qcspecs() None

Clear out any current QCSpecs.

property components: List[Dict[str, Union[str, Dict[str, str]]]]

Gather the details of the components that were ran during the creation of this dataset.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = ‘allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters:
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns:

new model instance

coverage_report(force_field: ForceField, verbose: bool = False) Dict[str, Dict[str, int]]

Returns a summary of how many molecules within this dataset would be assigned each of the parameters in a force field.

Notes:
  • Parameters which would not be assigned to any molecules in the dataset will not be included in the returned summary.

Args:

force_field: The force field containing the parameters to summarize. verbose: If true a progress bar will be shown on screen.

Returns:

A dictionary of the form coverage[handler_name][parameter_smirks] = count which stores the number of molecules within this dataset that would be assigned to each parameter.

dict(*args, **kwargs)

Overwrite the dict method to handle any enums when saving to yaml/json via a dict call.

export_dataset(file_name: str, compression: Optional[str] = None) None

Export the dataset to file so that it can be used to make another dataset quickly.

Args:
file_name:

The name of the file the dataset should be wrote to.

compression:

The type of compression that should be added to the export.

Raises:

UnsupportedFiletypeError: If the requested file type is not supported.

Note:

The supported file types are:

  • json

Additionally, the file will automatically compressed depending on the final extension if compression is not explicitly supplied:

  • json.xz

  • json.gz

  • json.bz2

Check serializers.py for more details. Right now bz2 seems to produce the smallest files.

filter_molecules(molecules: Union[Molecule, List[Molecule]], component: str, component_settings: Dict[str, Any], component_provenance: Dict[str, str]) None

Filter a molecule or list of molecules by the component they failed.

Args:
molecules:

A molecule or list of molecules to be filtered.

component_settings:

The dictionary representation of the component that filtered this set of molecules.

component:

The name of the component.

component_provenance:

The dictionary representation of the component provenance.

property filtered: Molecule

A generator which yields a openff molecule representation for each molecule filtered while creating this dataset.

Note:

Modifying the molecule will have no effect on the data stored.

classmethod from_orm(obj: Any) Model
get_molecule_entry(molecule: Union[Molecule, str]) List[str]

Search through the dataset for a molecule and return the dataset index of any exact molecule matches.

Args:

molecule: The smiles string for the molecule or an openforcefield.topology.Molecule that is to be searched for.

Returns:

A list of dataset indices which contain the target molecule.

json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

property molecules: Generator[Molecule, None, None]

A generator that creates an openforcefield.topology.Molecule one by one from the dataset.

Note:

Editing the molecule will not effect the data stored in the dataset as it is immutable.

molecules_to_file(file_name: str, file_type: str) None

Write the molecules to the requested file type.

Args:
file_name:

The name of the file the molecules should be stored in.

file_type:

The file format that should be used to store the molecules.

Important:

The supported file types are:

  • SMI

  • INCHI

  • INCKIKEY

property n_components: int

Return the amount of components that have been ran during generating the dataset.

property n_filtered: int

Calculate the total number of molecules filtered by the components used in a workflow to create this dataset.

property n_molecules: int

Calculate the number of unique molecules to be submitted.

Notes:
  • This method has been improved for better performance on large datasets and has been tested on an optimization dataset of over 10500 molecules.

  • This function does not calculate the total number of entries of the dataset see n_records

property n_qc_specs: int

Return the number of QCSpecs on this dataset.

property n_records: int

Return the total number of records that will be created on submission of the dataset.

Note:
  • The number returned will be different depending on the dataset used.

  • The amount of unique molecule can be found using n_molecules

classmethod parse_file(file_name: str)

Create a Dataset object from a compressed json file.

Args:

file_name: The name of the file the dataset should be created from.

classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
remove_qcspec(spec_name: str) None

Remove a QCSpec from the dataset.

Parameters:

spec_name: The name of the spec that should be removed.

Note:

The QCSpec settings are not mutable and so they must be removed and a new one added to ensure they are fully validated.

classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
submit(client: PortalClient, ignore_errors: bool = False, verbose: bool = False) Dict

Submit the dataset to a QCFractal server.

Args:
client:

Instance of a portal client

ignore_errors:

If the user wants to submit the compute regardless of errors set this to True. Mainly to override basis coverage.

verbose:

If progress bars and submission statistics should be printed True or not False.

Returns:

A dictionary of the compute response from the client for each specification submitted.

Raises:
MissingBasisCoverageError:

If the chosen basis set does not cover some of the elements in the dataset.

classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
visualize(file_name: str, columns: int = 4, toolkit: Optional[Literal['openeye', 'rdkit']] = None) None

Create a pdf file of the molecules with any torsions highlighted using either openeye or rdkit.

Args:
file_name:

The name of the pdf file which will be produced.

columns:

The number of molecules per row.

toolkit:

The option to specify the backend toolkit used to produce the pdf file.

field dataset_name: str [Required]

The name of the dataset, this will be the name given to the collection in QCArchive.

field dataset_tagline: constr(min_length=8, regex='[a-zA-Z]') [Required]

The tagline should be a short description of the dataset which will be displayed by the QCArchive client when the datasets are listed.

Constraints:
  • minLength = 8

  • pattern = [a-zA-Z]

field description: constr(min_length=8, regex='[a-zA-Z]') [Required]

A long description of the datasets purpose and details about the molecules within.

Constraints:
  • minLength = 8

  • pattern = [a-zA-Z]

field metadata: Metadata = Metadata(submitter='docs', creation_date=datetime.date(2024, 1, 18), collection_type=None, dataset_name=None, short_description=None, long_description_url=None, long_description=None, elements=set())

The metadata describing the dataset.

field provenance: Dict[str, str] = {}

A dictionary of the software and versions used to generate the dataset.

field dataset: Dict[str, DatasetEntry] = {}

The actual dataset to be stored in QCArchive.

field filtered_molecules: Dict[str, FilterEntry] = {}

The set of workflow components used to generate the dataset with any filtered molecules.

field driver: SinglepointDriver = SinglepointDriver.energy

The type of single point calculations which will be computed. Note some services require certain calculations for example optimizations require graident calculations.

field priority: str = 'normal'

The priority the dataset should be computed at compared to other datasets currently running.

field dataset_tags: List[str] = ['openff']

The dataset tags which help identify the dataset.

field compute_tag: str = 'openff'

The tag the computes tasks will be assigned to, managers wishing to execute these tasks should use this compute tag.

field qc_specifications: Dict[str, QCSpec] = {'default': QCSpec(method='B3LYP-D3BJ', basis='DZVP', program='psi4', spec_name='default', spec_description='Standard OpenFF optimization quantum chemistry specification.', store_wavefunction=<WavefunctionProtocolEnum.none: 'none'>, implicit_solvent=None, maxiter=200, scf_properties=[<SCFProperties.Dipole: 'dipole'>, <SCFProperties.Quadrupole: 'quadrupole'>, <SCFProperties.WibergLowdinIndices: 'wiberg_lowdin_indices'>, <SCFProperties.MayerIndices: 'mayer_indices'>], keywords={})}

The QCSpecifications which will be computed for this dataset.