GitHub - aspuru-guzik-group/MERMaid: Multimodal aid for mining of chemical reactions from PDFs

MERMaid (Multimodal aid for Reaction Mining)

1. Overview

MERMaid is an end-to-end knowledge ingestion pipeline to automatically convert disparate information conveyed through figures, schemes, and tables across various PDFs into a coherent and machine-actionable knowledge graph. It integrates three sequential modules:

VisualHeist for table and figure segmentation from PDFs
DataRaider for multimodal analysis to extract relevant information as structured reaction schema
KGWizard for automated knowledge graph construction

You can run MERMaid directly or use VisualHeist and DataRaider as standalone tools for their specific functionality.

⚠️ MERMaid is integrated with the OpenAI provider at present. Please ensure that you have sufficient credits in your account otherwise you will encounter errors (Note: running VisualHeist by itself does not require an API key). We will extend MERMaid to support other providers and open-source VLMs in future updates.

VisualHeist works best on systems with high RAM. For optimal performance, ensure that your system has sufficient memory, as running out of memory may cause the process to be terminated prematurely.

If you use MERMaid and its submodules in your research, please cite our preprint. Note that this content is a preprint and has not been peer-reviewed.

@article{
    MERMaid,
    title = {MERMaid: Universal multimodal mining of chemical reactions from PDFs using vision-language models},
    author = {Shi Xuan Leong, Sergio Pablo-García, Brandon Wong, Alán Aspuru-Guzik},
    DOI = {10.26434/chemrxiv-2025-8z6h2},
    journal = {ChemRxiv},
    year = {2025},
}

2. Installation

2.1 Create a new virtual environment

The recommended Python version is 3.9.

Using Conda:

conda create -n mermaid-env python=3.9
conda activate mermaid-env

Using venv:

python3.9 -m venv mermaid-env
source mermaid-env/bin/activate

2.2 Install RxnScribe for Optical Chemical Structure Recognition

This module is required to extract the SMILES strings of reactants and products.

git clone https://github.com/thomas0809/RxnScribe.git
cd RxnScribe
pip install -r requirements.txt
python setup.py install
cd ..

⚠️ You may see a compatibility warning about MolScribe version 1.1.1 not being compatible with Torch versions >2.0. This can be safely ignored.

2.3 Install/Setup of JanusGraph server

In order to run KGWizard, a running local JanusGraph server is required.

Install Java 8 SE from Oracle.

Install JanusGraph, tested with version 1.1.0.

Unzip the JanusGraph zip file in the same folder that has RxnScribe.

unzip janusgraph-1.1.0.zip

2.4 Install MERMaid

Download the repository and install dependencies in the same folder that has both RxnSribe and janusgraph-1.1.0:

git clone https://github.com/aspuru-guzik-group/MERMaid/
cd MERMaid
pip install -e .

For the full MERMaid pipeline:

pip install MERMaid[full]

For individual modules:

pip install MERMaid[visualheist]
pip install MERMaid[dataraider]
pip install MERMaid[kgwizard]

3. Usage

3.1 Setting Up Your Configuration File

Settings can be set through a configuration file found in scripts/startup.json or throught a created configuration file (see 3.3.3 CFG Command). VisualHeist and DataRaider can be run via the configuration file or via the command line (see 3.4 Running Individual Modules) whereas full MERMaid pipeline requires settings to be provided through a configuration file (see 3.3.2 RUN Command).

Define custom settings in scripts/startup.json:

pdf_dir: Full path to directory where PDFs are stored (required for running VisualHeist).
image_dir: Full path to directory to store extracted images or where images are currently stored (required for running DataRaider).
json_dir: Full path to directory to store JSON output (required for running DataRaider and/or KGWizard).
graph_dir: Full path to directory to store graph files (required for running KGWizard).
prompt_dir: Full path to directory containing prompt files (required for running DataRaider).
model_size: Choose between 'base' or 'large' (required for running VisualHeist).
keys: List of reaction parameter keys (required for running DataRaider).
new_keys: Additional keys for new reactions (required for running DataRaider).
graph_name: Name for the generated knowledge graph (required for running KGWizard).
schema: User-prepared schema for the knowledge graph (required for running KGWizard).

Additional notes:

The in-built reaction parameter keys are in Prompts/inbuilt_keyvaluepairs.txt.
For post-processing extracted JSON reaction dictionaries:
- Modify COMMON_NAMES in dataraider/postprocess.py to add custom chemical names.
- Modify KEYS in dataraider/postprocess.py to clean specific key names.
Customize filter_prompt in Prompts/ to filter relevant images.
You can use one of our prepared schema found in src/kgwizard/graphdb/schemas

3.2 Setting Up API Key

The environment variable OPENAI_API_KEY is required for DataRaider and KGWizard. You can set this variable in your terminal session using the following command:

export OPENAI_API_KEY="your-openai-api-key"

This method sets the API key for the current terminal session, and the environment variable will be available to any processes started from that session.

Alternatively, you can create a .env file in the root directory of the MERMaid project (the same directory where README.md is located) and add the following line to it:

OPENAI_API_KEY="your-openai-api-key"

This will automatically set the OPENAI_API_KEY environment variable whenever you run the project.

3.3 Running the Full MERMaid Pipeline

3.3.1 Start JanusGraph Server

A running JanusGraph server is required for running the full MERMaid pipline and KGWizard (see 3.4.3 KGWizard – Data-to-Knowledge Graph Translation)

Start the JanusGraph Server (Choose either option):

Note: Server requires 2–8 GB RAM.

Foreground:

Open a seperate terminal and navigate into to the janusgraph-1.1.0 folder.

To start the server:

./bin/janusgraph-server.sh ./conf/gremlin-server/gremlin-server.yaml

To terminate the server use Ctrl+C

Background:

To start the server:

cd janusgraph-1.1.0
./bin/janusgraph-server.sh start

To terminate the server:

./bin/janusgraph-server.sh stop

The port of the running JanusGraph server is automatically set to 8182 with address ws://localhost

3.3.2 RUN Command

Run the mermaid pipeline (visualheist, dataraider, kgwizard sequentially)

mermaid RUN   --config ./scripts/startup.json

Option	Description
`--config`	Path to the configuration file

Intermediate files will be saved in the Results/ directory.

3.3.3 CFG Command

Output a configuration file of the same form as scripts/startup.json

mermaid CFG   --out_location ./

Option	Description
`--out_location`	Path to save new configuration file

3.4 Running Individual Modules

3.4.1 VisualHeist – Image Segmentation from PDFs

VisualHeist can be run using the settings provided in scripts/startup.json using:

visualheist

Or can be run using command line arguments with the following:

visualheist   --config ./scripts/startup.json   --pdf_dir /path/to/pdf   --image_dir /path/to/save/images   --model_size base

Option	Description
`--config`	Path to the configuration file. If specified, ignores other arguments
`--pdf_dir`	Path to the input PDF directory
`--image_dir`	Path to the output image directory
`--model_size`	Model size to use, either `base` or `large`

3.4.2 DataRaider – Image-to-Data Conversion

DataRaider can be run using the settings provided in scripts/startup.json using:

dataraider

Or can be run using command line arguments with the following:

dataraider   --config ./scripts/startup.json   --image_dir /path/to/save/images   --prompt_dir ./Prompts   --json_dir ./

Option	Description
`--config`	Path to the configuration file. If specified, ignores other arguments
`--image_dir`	Directory containing images to process
`--prompt_dir`	Directory containing prompt files (should point to `Prompts` directory)
`--json_dir`	Directory to save processed JSON data
`--keys`	List of keys to extract
`--new_keys`	List of new keys for data extraction

A sample output JSON is available in the Assets folder.

3.4.3 KGWizard – Data-to-Knowledge Graph Translation

KGWizard comes with two commands.

3.4.3.1 Transform Command

Converts raw JSON to intermediate format, optionally performs RAG lookup and updates database.

kgwizard transform   ./input_data   --output_dir ./results   --output_file ./results/my_graph.graphml   --substitutions "material:Material" "atmosphere:Atmosphere"   --address ws://localhost   --port 8182   --schema echem   --graph_name g

Option	Description
`input_dir` (positional argument)	Folder where the JSON files from DataRaider are stored
`--output_dir`	Folder where the generate JSON intermediate files will be stored. The folder will be automatically created. Defaults to ./results.
`--no_parallel`	If active, run the conversions sequentially instead of using the dynamic increase parallel algorithm. Overrides the --workers flag.
`--workers`	If defined, use this number of parallel workers instead of the dynamic increase algorithm.
`--substitutions`	Substitution to be made in the instructions file. The input format consists on a pair formed by the substitution keyword and the node label separated by a colon (keyword:node_name). If substitutions are not given, RAG module will not be executed.
`--dynamic_start`	Starting number of workers for the dynamic algorithms..
`--dynamic_steps`	Maximum number of steps of the dynamic paralelization algorithm.
`--dynamic_max_workers`	Maximum number of workers of the dynamic paralelization algorithm.
`--address`	JanusGraph server address. Defaults to ws://localhost.
`--port`	JanusGraph port. Defaults to 8182.
`--graph_name`	JanusGraph graph name. Defaults to g.
`--schema`	Node/Edge schema to be used during the json conversion. Can be either a path or any of the default schemas: photo,org,echem. Defaults to echem
`--output_file`	"If set, save the generated graph into the specified file after updating the database.

3.4.3.2 Parse Command

Parses intermediate JSONs (from transform command) into schema-based graph and uploads to JanusGraph. It also saves a .graphml file representing the final graph state.

kgwizard parse   ./results   --address ws://localhost   --port 8182   --graph_name g   --schema /path/to/custom_schema.py   --output_file ./final_graph.graphml

Option	Description
`input_dir` (positional argument)	Folder where the JSON files from `transform` are stored
`--address`	JanusGraph server address. Defaults to ws://localhost
`--port`	JanusGraph port. Defaults to 8182
`--graph_name`	JanusGraph graph name. Defaults to g
`--schema`	Node/Edge schema to be used during the json conversion. Can be either a path or any of the default schemas: photo,org,echem. Defaults to echem
`--output_file`	If set, save the generated graph into the specified file after updating the database

4. Running the MERMaid Web App (Recommended for New Users)

MERMaid comes with a web interface for running the modules interactively via a browser. You can configure your input folders, select extraction keys, and run modules with no coding required.

To launch the app locally:

./launch_webapp.sh

Then, open http://localhost:850x in your browser.

You must have your OPENAI_API_KEY set in your .env file (or terminal) before launching the app. You can follow the instructions in 3.2 Setting Up API Key

A JanusGraph server is not required for the web interface to run, but is required if using either KGWIzard or the full MERMaid pipeline throught the interface (see 3.3.1 Start JanusGraph Server)

5. Adapting MERMaid

For instructions on how to extend DataRaider and KGWizard for your target chemical domains, please check out the DataRaider README file and the KGWizard README file

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MERMaid (Multimodal aid for Reaction Mining)

Table of Contents

1. Overview

2. Installation

2.1 Create a new virtual environment

Using Conda:

Using venv:

2.2 Install RxnScribe for Optical Chemical Structure Recognition

2.3 Install/Setup of JanusGraph server

2.4 Install MERMaid

3. Usage

3.1 Setting Up Your Configuration File

3.2 Setting Up API Key

3.3 Running the Full MERMaid Pipeline

3.3.1 Start JanusGraph Server

Foreground:

Background:

3.3.2 RUN Command

3.3.3 CFG Command

3.4 Running Individual Modules

3.4.1 VisualHeist – Image Segmentation from PDFs

3.4.2 DataRaider – Image-to-Data Conversion

3.4.3 KGWizard – Data-to-Knowledge Graph Translation

3.4.3.1 Transform Command

3.4.3.2 Parse Command

4. Running the MERMaid Web App (Recommended for New Users)

5. Adapting MERMaid

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 309 Commits
Assets		Assets
Prompts		Prompts
scripts		scripts
src		src
webapp		webapp
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
launch_webapp.sh		launch_webapp.sh
mermaid.log		mermaid.log
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

aspuru-guzik-group/MERMaid

Folders and files

Latest commit

History

Repository files navigation

MERMaid (Multimodal aid for Reaction Mining)

Table of Contents

1. Overview

2. Installation

2.1 Create a new virtual environment

Using Conda:

Using venv:

2.2 Install RxnScribe for Optical Chemical Structure Recognition

2.3 Install/Setup of JanusGraph server

2.4 Install MERMaid

3. Usage

3.1 Setting Up Your Configuration File

3.2 Setting Up API Key

3.3 Running the Full MERMaid Pipeline

3.3.1 Start JanusGraph Server

Foreground:

Background:

3.3.2 RUN Command

3.3.3 CFG Command

3.4 Running Individual Modules

3.4.1 VisualHeist – Image Segmentation from PDFs

3.4.2 DataRaider – Image-to-Data Conversion

3.4.3 KGWizard – Data-to-Knowledge Graph Translation

3.4.3.1 Transform Command

3.4.3.2 Parse Command

4. Running the MERMaid Web App (Recommended for New Users)

5. Adapting MERMaid

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages