You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
and set MinerU backend ('vlm-vllm-engine' or 'vlm-transformers') and LLM max token length (recommended not to exceed 128000 to avoid LLM forgetting details).
63
+
**Caution: The pipeline was only tested with the `vlm` backend; compatibility with the `pipeline` backend is uncertain due to format differences. Using the `vlm` backend is recommended.**
64
+
The `vlm-vllm-engine` backend requires GPU support.
`VQAExtractor` chunks the layout JSON to respect token limits, builds subject-aware prompts (`QAExtractPrompt`), and batches LLM calls via `APILLMServing_request`. Key behaviors:
109
119
120
+
- Grouping and pairing Q&A based, and inserting images to proper positions.
110
121
- Supports `question_pdf_path` + `answer_pdf_path`, or a single `pdf_path` (auto-detect interleaved mode).
111
122
- Copies rendered images into `output_dir/question_images` and/or `answer_images`.
112
-
- Parses `<qa_pair>`, `<question>`, `<answer>`, `<solution>`, `<chapter>`tags from the LLM response, with figure references preserved as `<pic>tag:box</pic>`.
123
+
- Parses `<qa_pair>`, `<question>`, `<answer>`, `<solution>`, `<chapter>`, `<label>`tags from the LLM response.
113
124
114
125
### 4. Post-processing and outputs
115
126
@@ -128,8 +139,8 @@ Filtering keeps entries where the question exists and either `answer` or `soluti
128
139
129
140
Each filtered record includes:
130
141
131
-
-`question`: question text (with inline `<pic>` tags if figures are referenced)
132
-
-`answer`: answer text (if extracted from answer PDF)
142
+
-`question`: question text and images
143
+
-`answer`: answer text and images(if extracted from answer PDF)
133
144
-`solution`: optional worked solution (if present)
134
145
-`label`: original numbering (e.g., “Example 3”, “习题2”)
135
146
-`chapter_title`: chapter/section header detected on the same page
input_pdf_path_key="pdf_path",# for interleaved mode
194
+
input_subject_key="subject",
189
195
output_dir_key="output_dir",
190
196
output_jsonl_key="output_jsonl_path",
191
-
mineru_backend='vlm-vllm-engine',
192
197
)
193
198
199
+
200
+
194
201
if__name__=="__main__":
202
+
# Each line in the JSONL contains `question_pdf_path`, `answer_pdf_path`, `subject` (math, physics, chemistry, ...), and `output_dir`
203
+
# If the question and the answer are in the same PDF, set both `question_pdf_path` and `answer_pdf_path` to the same path; the pipeline will automatically switch to interleaved mode.
0 commit comments