This repository provides a simple implementation of how to parse academic papers (in PDF format) into XML format using GROBID and then convert these XML files into JSON format. This tool is useful for extracting structured information from research papers for data analysis or machine learning purposes.
- Converts PDF files of academic papers into XML format using GROBID.
- Parses XML files into structured JSON format.
- Extracts metadata such as title, authors, abstract, body content, and references.
- Python: Ensure you are using Python version 3.6 or above.
- GROBID: You will need access to the GROBID service for XML conversion. This can be done using their cloud server or by setting up a local instance.
Start by cloning the repository to your local machine:
git clone https://github.com/bayyy7/automatic_paperParser.git
Ensure you have Python installed (version 3.6 or higher). It's recommended to use a virtual environment to manage dependencies.
To convert PDF files into XML, you need to use the GROBID service:
- Cloud Option: Open the GROBID cloud server.
- Local Option: Alternatively, you can set up GROBID locally by following the instructions in the GROBID repository.
Once you have access to GROBID:
- Navigate to the GROBID homepage.
- Select TEI from the navigation bar.
- Choose Process Fulltext Document in the "Service to call" section.
- Check the Consolidate Header option to improve metadata extraction.
- Upload your PDF file and click Submit.
After processing, download the resulting TEI XML file and place it in the Dataset
folder of this repository. If you have multiple PDFs, repeat steps 4 and 5 for each file.
With the XML files in place, run the parser.py
script to convert the XML files into JSON format:
python parser.py
The parsed JSON files will be saved in the JSON_Parsed
folder. Each JSON file will have a structure similar to the original XML file but formatted for easy data manipulation.
The output JSON files follow a structure similar to the XML file, with specific sections:
- teiHeader: Contains the title, publication date, and authors of the paper.
- profileDesc: Contains the abstract of the paper.
- body: Includes all sections of the paper, from the introduction to the conclusion.
- back: Contains all references cited in the paper.
- The results of the JSON parse may occasionally be incorrect or contain blank spaces due to variations in the paper's formatting by different authors or journals. You may need to modify the
parser.py
script to suit your specific requirements.
For additional papers to process, you can find academic articles on these platforms:
You can check all my great stuff and works in my repository.
Rizky Indrabayu