This repository contains a MMCIF investigation dictionary that provides a data representation to capture the relationships between macromolecule structures deposited in the worldwide Protein Data Bank (wwPDB), and data from other databases and databanks, with enrichment of additional information / metadata to describe an investigation -- aka a series of related structures that were collected for a project and together provide insight.
This dictionary is an extension of the PDBx/mmCIF dictionary and provides the additional definitions required for an investigation files. Investigation files are umbrella files for a set of coordinates / models and their corresponding experimental data files.
The primary example showcased here is for fragment screening investigations. Fragment screening experiments in structural biology involve the determination of multiple atomic-level models to analyze how small molecule fragments interact with protein targets. These experiments facilitate drug discovery efforts.
Traditional PDB entries represent individual structures, but many research projects generate collections of related structures. InvestigationCIF solves this problem by:
- Creating umbrella files that link multiple coordinate files and their experimental data
- Adding contextual metadata about the overall investigation goals and methods
- Enabling better discoverability and analysis of related structural data
- Supporting reproducible research through standardized metadata capture
Fragment Screening Investigation mmCIF files created from PDB group depositions are available at:
https://ftp.ebi.ac.uk/pub/databases/msd/fragment_screening/investigations/
An investigation mmCIF file can be created through mmcif-gen, which is a Python tool for generating mmCIF files.
mmcif-gen can be used to create an investigation mmCIF file from internal databases at research facilities, such as a synchrotron. Each facility stores their data internally in different ways and the data is available in different formats (e.g. SQL files, JSON files, etc). Consequently each facility has a different configuration file (i.e. a different operation facility.json).
Alterantively, one can generate an investigation mmCIF file from a set of PDB ids the correspond to fragment screen hits that have been deposited to the wwPDB, for example:
# Fetch configuration for PDB files
mmcif-gen fetch-facility-json pdbe_investigation
# Generate an investigation file
mmcif-gen make-mmcif --json pdbe_investigation.json --output-folder ./out --id I_321 pdbe --pdb-ids 5rvz 5rvy 5rvw
For more extensive documentation on using it:
check mmcif-gen PyPI page
--or--
check mmcif-gen GitHub repository
README.md - this file
MMCIF investigation extension - Investigation dictionary
--> This is an extension to the wwPDB mmCIF dictionary (file name: mmcif_pdbx_v50.dic).
MMCIF investigation combined with the wwPDB dictionary
Examples - directory with examples of investigation mmCIF file(s) compliant with the MMCIF investgation dictionary
Fragment-based-screening (FBS) is a complex and data-rich endeavour, wherein each stage of the process can generate different file types of complex data, in both raw and processed forms. The popularity of fragment screening in academic scientific research and the pharmaceutical industry is reflected by the increasing number of facilities, such as synchrotrons, that support fragment screening experiments.
Synchrotrons are central service centres that support experimental data generation with multiple options related to structural biology using X-ray crystallography.
Individuals from synchrotrons across Europe were involve in developing the data model for fragment-screening in this repository. The Protein Data Bank in Europe, in collaboration with other organizations from the worldwide Protein Data Bank, has led the project.
Synchrotrons and associated facilities involved in developing this data model:
- The Crystallisation Facility at the European Molecular Biology Laboratory (EMBL) Grenoble and European Synchrotron Radiation Facility (ESFR) in France
- XChem: Diamond Fragment Screening at Diamond Light Source (DLS) in the United Kingdom
- Fragment Screening Facility at Berlin synchrotron BESSY-MX and Helmholtz-Zentrum Berlin/HZB in Germany
- FragMAX at Swedish synchrotron MAX IV in Sweden
- iNEXT-Discovery - a European Union funded project via Horizon Europe (Grant agreement ID: 871037)
- FragmentScreen - a European Union funded project via Horizon Europe (Grant agreement ID: 101094131)
Available to all in accordance with the Creative Commons Zero (CC0) designation.
We welcome contributions to improve the InvestigationCIF dictionary. For changes, please open an issue first to discuss what you would like to change.
For any feedback or suggestions, email us at pdbehelp@ebi.ac.uk. Please include 'InvestigationCIF' in your subject line.
