This project provides a pipeline to convert academic/research PDFs into cleaned Markdown files, with support for:
- Trimming margins and removing line numbers.
- YOLO-based region detection (text, tables, figures, formulas).
- Post-processing filters for noise, licenses, and references.
- Markdown export with embedded images for figures/tables.
- Optional raw JSONL outputs for downstream processing.
.
├── main.py
├── pdf_extractor.py
├── markdown_coverter.py
├── yolo_model/
│ ├── YoloModel.py
│ ├── YoloHelper.py
│ ├── YoloPipline.py
│ └── doclaynet.pt # YOLO model weights (DocLayNet)
├── pdf_processor/
│ ├── NumberPaper.py
│ └── PdfTrimmer.py
├── text_filters/
│ ├── LicenseFilter.py
│ └── ReferenceFilter.py
PdfTrimmer.py– trims page margins (top, bottom, left, right) while avoiding rotated pages.NumberPaper.py– detects and removes line numbers from PDF margins using heuristics.
YoloModel.py– runs YOLO (DocLayNet weights) on PDF pages to detect text, figures, tables, and formulas.YoloHelper.py– processes YOLO outputs: merges bounding boxes, removes noise, raises headers, and restructures text.YoloPipline.py– orchestrates detection + post-processing into JSONL outputs.
markdown_coverter.py– converts processed JSONL into Markdown format, embedding images when available.LicenseFilter.py– removes boilerplate license/rights text.ReferenceFilter.py– removes references/bibliographies and cleans extraneous text.
pdf_extractor.py– core pipeline that:- Trims PDFs and removes line numbers.
- Runs YOLO layout extraction.
- Applies license/reference filters.
- Saves
.md,.jsonl, and removed sections.
main.py– example script to run extraction on a folder of PDFs.
pip install -r requirements.txt
Dependencies include:
PyMuPDF(fitz)PyPDF2ultralytics(YOLOv8)Pillowtqdm
The first time you run the pipeline, weights_utils.py will automatically download the DocLayNet YOLOv8X checkpoint from Hugging Face
(malaysia-ai/YOLOv8X-DocLayNet-Full-1024-42)
and save it as: yolo_model/doclaynet.pt
On subsequent runs, the script will detect the existing file and reuse it, so you don’t need to manually manage the weights.
python main.py <input_folder> <output_folder> [--save-raw-json] [--save-removed]
Arguments
- input_folder – Path to the folder containing PDFs (e.g., paper/前列腺癌)
- output_folder – Name of the dataset/output folder (e.g., 前列腺癌).
- --save-raw-json – (optional) Save raw JSONL outputs from YOLO post-processing.
- --save-removed – (optional) Save removed license/reference sections.
After running, you’ll get:
output_folder/
│── outputs/
│ ├── <paper_name>.md # Final Markdown
│── outputs/images/ # Extracted figures/tables
│── raw_outputs/ # JSONL structured outputs
│── removed/ # Removed license/reference sections
- Place your PDFs into
paper/<topic>/. - Run
python main.py paper/<topic>/ outputs. - Check
outputs/for Markdown files. - Open
.mdfiles in any Markdown editor to view cleaned text + figures.
This project is provided for research and educational purposes.
Make sure you comply with dataset (DocLayNet) and paper license terms when processing PDFs.