Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
103 commits
Select commit Hold shift + click to select a range
cbf7538
First new outlines-partial
AxisAlexNT Sep 30, 2025
fd9843d
Qwen does not work great for this task right now
AxisAlexNT Sep 30, 2025
50845f6
Added multiple options support into pipeline
AxisAlexNT Sep 30, 2025
2d238de
Before launching on all texts
AxisAlexNT Sep 30, 2025
71b76b8
Launching on massive data
AxisAlexNT Sep 30, 2025
3c07822
Reduce timeout
AxisAlexNT Sep 30, 2025
235980d
Will now start with micro-C steps
AxisAlexNT Oct 1, 2025
d9fa7d1
Add all schemas
AxisAlexNT Oct 1, 2025
0866a2e
Update schema to try pass Ollama limitations
AxisAlexNT Oct 1, 2025
a58254c
Will retry with the smaller outputs
AxisAlexNT Oct 1, 2025
319b296
Looks better for schema C
AxisAlexNT Oct 1, 2025
2ed1f6d
Attempt to test schema C with longer context
AxisAlexNT Oct 1, 2025
6b8da65
Tried to update schema C more strictly specifying types
AxisAlexNT Oct 1, 2025
b5d2dbe
C5 schema created
AxisAlexNT Oct 1, 2025
ffd5bc6
C5 schema works, but not ideal
AxisAlexNT Oct 1, 2025
8c40a1c
Added descriptions to all fields in C5
AxisAlexNT Oct 1, 2025
5bd0b1b
Temporarily switch off tools
AxisAlexNT Oct 1, 2025
1b4444b
Try to run all schemas on all models
AxisAlexNT Oct 1, 2025
8dbe352
Adding Marker-PDF
AxisAlexNT Oct 2, 2025
56aafdb
Updated marker-pdf script to stream progress
AxisAlexNT Oct 2, 2025
62ed81f
Reordered models for bench
AxisAlexNT Oct 3, 2025
ad14cb2
Added all models and Ollama base URL to config, but requires a lot of…
AxisAlexNT Oct 3, 2025
5585682
Removed excessive models
AxisAlexNT Oct 3, 2025
fa6803a
Update prompts for Chinese models to ask results in English
AxisAlexNT Oct 3, 2025
f05cbf5
Slightly updated schemas and added full run
AxisAlexNT Oct 3, 2025
9e6b257
Schem supports per-step timeouts
AxisAlexNT Oct 3, 2025
9108abe
Split B schema into B and B_types
AxisAlexNT Oct 3, 2025
5ed7f4c
Frozen requirements and environment
AxisAlexNT Oct 3, 2025
3b37c41
Added support for API_TOKEN at Vast.AI
AxisAlexNT Oct 3, 2025
428b3e4
A lot of articles were parsed
AxisAlexNT Oct 3, 2025
8f5c610
Change models
AxisAlexNT Oct 3, 2025
6758291
Will start adding chat mode
AxisAlexNT Oct 5, 2025
e85425c
Added pre-pass schemas
AxisAlexNT Oct 6, 2025
1c2aca3
Add models, enforce schema
AxisAlexNT Oct 6, 2025
5f3c3f3
Try implement pre-step
AxisAlexNT Oct 6, 2025
f69cd91
Adding brutal check for thinking
AxisAlexNT Oct 6, 2025
c6da7e1
Fixed chatting function and added call to it
AxisAlexNT Oct 6, 2025
dbfc1d9
Update parameters to change model and step sets
AxisAlexNT Oct 6, 2025
0595c8b
Running on small set of articles
AxisAlexNT Oct 6, 2025
fdd554f
Fixed script
AxisAlexNT Oct 6, 2025
87fa67c
Seems that per-sequence schema is broken
AxisAlexNT Oct 6, 2025
ed87b10
Fixed single experiment schema started at least
AxisAlexNT Oct 6, 2025
ed7fcc9
Set global timeout for 10 minutes
AxisAlexNT Oct 6, 2025
e344736
Added new construction schemas and logic, lower timeout for debugging
AxisAlexNT Oct 6, 2025
20fe9db
Fixed copypaste bugs
AxisAlexNT Oct 6, 2025
5b197a9
Fixed errorneous schema
AxisAlexNT Oct 6, 2025
449cf9a
Update timeout
AxisAlexNT Oct 6, 2025
f6f3312
Added sequence descriptors calling
AxisAlexNT Oct 7, 2025
68fdaae
Bug fixed
AxisAlexNT Oct 7, 2025
1387851
One more typo fixed
AxisAlexNT Oct 7, 2025
8281a9c
More typos fixed
AxisAlexNT Oct 7, 2025
11148fd
Should have fixed regex error
AxisAlexNT Oct 7, 2025
1408f92
Yet again update prompts
AxisAlexNT Oct 7, 2025
d78d6ac
Make strict schema less strict
AxisAlexNT Oct 7, 2025
d182f48
Update schema and logging
AxisAlexNT Oct 7, 2025
2e6fb6f
Fixed missing logger
AxisAlexNT Oct 7, 2025
9536ad2
Roll-back schema, remove bools from schemas in queries
AxisAlexNT Oct 7, 2025
2ca6087
Temporarily removed construction prompts
AxisAlexNT Oct 7, 2025
9dd2704
Fixed bug in ask with schema
AxisAlexNT Oct 7, 2025
2dcfa32
Make logging easier
AxisAlexNT Oct 7, 2025
1b36bbb
Bug fixed
AxisAlexNT Oct 7, 2025
1992711
Fixed passing the wrong chat
AxisAlexNT Oct 7, 2025
7c34f42
Added query for the modifications in the prompt, as well as fluoropho…
AxisAlexNT Oct 7, 2025
8b6608d
Updated schema ans queries, it now works
AxisAlexNT Oct 7, 2025
844312d
More choice for LLM to parse sequences
AxisAlexNT Oct 7, 2025
e79fc0b
Improve tqdm progressbar
AxisAlexNT Oct 7, 2025
0968894
tqdm progress bar didn't really go well
AxisAlexNT Oct 7, 2025
4fa0f74
Now looks like logging works with tqdm just fine
AxisAlexNT Oct 7, 2025
36e7122
Pretty-print fix
AxisAlexNT Oct 7, 2025
8917f0b
Update regexp for primers and target sequence
AxisAlexNT Oct 7, 2025
318b5e7
Update pretty-print
AxisAlexNT Oct 7, 2025
7747827
Lower context size and num predict
AxisAlexNT Oct 7, 2025
e6e635c
Trying new approach with insert into DB
AxisAlexNT Oct 8, 2025
5629df1
Faster generation for chat mode
AxisAlexNT Oct 9, 2025
7316a53
Should have added format fixer
AxisAlexNT Oct 9, 2025
5a89f89
Tiny typo fixed
AxisAlexNT Oct 9, 2025
5df5357
Update seq
AxisAlexNT Oct 9, 2025
116d5b7
Prompt slightly changed
AxisAlexNT Oct 9, 2025
c939969
Fixed errors in tuples
AxisAlexNT Oct 9, 2025
b1c0b4d
f-string bug fixed
AxisAlexNT Oct 9, 2025
6ed6443
Bugfix messages
AxisAlexNT Oct 9, 2025
6e045fd
Update params to include more models
AxisAlexNT Oct 9, 2025
bd4e8c9
Start run for all articles
AxisAlexNT Oct 10, 2025
3ffa95c
Re-run try faster
AxisAlexNT Oct 11, 2025
a668f07
Will pass the whole article text and lower the temperature for the f…
AxisAlexNT Oct 11, 2025
75c7a1d
Validator bug fixed
AxisAlexNT Oct 11, 2025
915aa22
Return back to snippets
AxisAlexNT Oct 11, 2025
bdeafa9
Try to easy-fix an object
AxisAlexNT Oct 11, 2025
2bef036
Easy validation fixed
AxisAlexNT Oct 11, 2025
98c6ba3
Easy-fix sould be even more applicable
AxisAlexNT Oct 11, 2025
83d05af
Try adding full article, not snippets
AxisAlexNT Oct 11, 2025
9772942
Try making simple fixer even more robust
AxisAlexNT Oct 11, 2025
8a68061
Remove non-strict seq pass
AxisAlexNT Oct 11, 2025
dfcab0d
Seems that easy fixer should take first mapping
AxisAlexNT Oct 11, 2025
51d2309
Add other non-qwen-1m models
AxisAlexNT Oct 15, 2025
3e3d409
Add more passes
AxisAlexNT Oct 15, 2025
72d686b
Intermediate hyb_db perfmetrics added
AxisAlexNT Oct 30, 2025
e99266a
Upgraded solution for continuation and performance metrics
AxisAlexNT Oct 30, 2025
5ac2aaf
Added more models and steps for benchmark
AxisAlexNT Oct 30, 2025
5f4d279
Changed input directory to benchmarks in markdown
AxisAlexNT Oct 30, 2025
f22c21b
Merge PROBEst history before migrating selected extraction files
AxisAlexNT Dec 12, 2025
778ecc2
Add MIT license header to Python files
AxisAlexNT Dec 12, 2025
02b61ba
Added README for the extraction module
AxisAlexNT Dec 12, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions app/app.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,29 @@
# MIT License
#
# Copyright (c) 2025 CTLab-ITMO
#
# Authors: Daniil Smutin, Aleksandr Serdiukov, Vitalii Dravgelis, Artem Ivanov,
# Aleksei Zabashta, Sergey Muravyov, and the CTLab-ITMO university team.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.


import os
import sys
import re
Expand Down
26 changes: 26 additions & 0 deletions app/database.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,30 @@
#!/usr/bin/env python3
# MIT License
#
# Copyright (c) 2025 CTLab-ITMO
#
# Authors: Daniil Smutin, Aleksandr Serdiukov, Vitalii Dravgelis, Artem Ivanov,
# Aleksei Zabashta, Sergey Muravyov, and the CTLab-ITMO university team.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.


"""
PROBESt Database Search Application

Expand Down
26 changes: 26 additions & 0 deletions benchmark/make_primer3_template.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,29 @@
# MIT License
#
# Copyright (c) 2025 CTLab-ITMO
#
# Authors: Daniil Smutin, Aleksandr Serdiukov, Vitalii Dravgelis, Artem Ivanov,
# Aleksei Zabashta, Sergey Muravyov, and the CTLab-ITMO university team.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.


from Bio import SeqIO
import os
import sys
Expand Down
26 changes: 26 additions & 0 deletions benchmark/parse_blast_stats.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,29 @@
# MIT License
#
# Copyright (c) 2025 CTLab-ITMO
#
# Authors: Daniil Smutin, Aleksandr Serdiukov, Vitalii Dravgelis, Artem Ivanov,
# Aleksei Zabashta, Sergey Muravyov, and the CTLab-ITMO university team.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.


from Bio import SeqIO
import pandas as pd
import sys
Expand Down
114 changes: 114 additions & 0 deletions extraction/README.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Extraction submodule

This folder contains a **file-driven, multi-pass extraction pipeline** built on **Outlines** (JSON-guided generation) and **Ollama** (local model serving). The pipeline reads configuration, prompts, and JSON Schemas from disk, runs a configured sequence of passes, writes versioned artifacts to an output directory (never overwriting), and can optionally persist results + timing metadata into an SQLite database.

## What it does

- **Pass-based extraction**: runs a configurable sequence of passes (e.g., `A_core`, `B_index`, `C_sequences`, ...) using Outlines JSON schema guidance.
- **Artifacts on disk**: writes raw text, pretty JSON, and error logs for each pass without overwriting prior runs.
- **Final stitching + validation**: stitches per-pass outputs into a final “FULL” object and can validate it against a “full schema” if configured.
- **SQLite optional**: can insert stitched results into SQLite via `hyb_db.insert_article_object(...)`.
- **Perf sidecars + continuation**: each JSON artifact can have a `*.perf.json` sidecar; the same metrics can also be mirrored into SQLite (`pipeline_artifacts`) and used for “resume” mode.

## Repository layout

The pipeline expects a “project directory” that contains:

- `config/pipeline.json` (main configuration)
- `passes/<pass_name>/schema.json` and `passes/<pass_name>/prompt.txt` (per-pass assets)
- `passes/common.txt` (shared prompt prefix, optional)
- `schema/json/article.json` (full schema for final validation, optional)
- input directory with source files (configured in `pipeline.json`)

The config shown in `config/pipeline.json` includes keys such as:
- `model_names`, `ollama_base_url`, `ollama_parameters`, `timeout_s`
- `input_dir`, `out_dir`, `article_glob`
- `pre_passes`, `construct_single_experiment_passes`, `passes`

## Installation

Python dependencies (minimum set used by the pipeline):

```bash
pip install -r requirements.txt
```

Or use the conda/mamba to initialize environment from `environment.yml`.

You also need:
- **Ollama** running locally (or reachable over HTTP), matching `ollama_base_url` in `config/pipeline.json`.

Optional:
- If `db_path` is set in config, SQLite will be used and schema will be auto-created.

### Environment variables

- `OPEN_BUTTON_TOKEN` (optional): if set, it is passed as a Bearer token in Ollama client headers.

## How to run

### 1) Configure `config/pipeline.json`

Edit paths to match your machine. In the attached example, `input_dir` is set to an absolute path and `article_glob` uses a recursive pattern.

Key fields you typically tune:
- `model_names`: list of Ollama model identifiers to run.
- `ollama_parameters`: e.g. `num_ctx`, `num_predict`, `temperature`, `seed`.
- `timeout_s`, `ollama_base_url`
- `out_dir`, `db_path`

### 2) Run the pipeline

#### CLI

From the repository root (or anywhere, as long as you pass the correct project directory):

```bash
python extraction/pipeline_filedriven.py extraction --fresh
```

- `project_dir` is the folder containing `config/`, `passes/`, etc.
- omit `--fresh` to enable continuation/resume behavior.

#### Python

```python
from extraction.pipeline_filedriven import run_project
run_project("extraction", fresh=False)
```

## Outputs

Artifacts are written under `out_dir` (from `pipeline.json`).

The pipeline writes, per pass and per model/article:
- raw text: `*.txt`
- JSON outputs: `*.json`
- log JSON: `*.log.json`
- error logs: `logs/*.log`
- perf sidecars: `*.perf.json` (one per emitted JSON artifact)

Perf sidecars include timestamps, wallclock duration, and (when Ollama reports it) token counts.

## Continuation / resume mode

When `db_path` is configured, the pipeline can skip already completed work:

- default `fresh=False`: for each `(model_name, article_name)`, if a successful `pass_name="FULL"` is recorded in the DB, the article can be skipped.
- `--fresh`: disables skipping and forces re-processing.

Implementation note:
- completion is tracked in `pipeline_artifacts` and queried via `hyb_db.get_completed_passes(...)`.

## Database schema (optional)

If `db_path` is set, `hyb_db` auto-creates tables and views and inserts:
- stitched article objects (`insert_article_object`)
- artifact-level perf bookkeeping (`pipeline_artifacts`)

## Overall design (short)

- **Config-first**: a project is a directory of config + prompts + schemas, making experiments easy to reproduce and version-control.
- **Multi-pass extraction**: each pass targets a specific sub-problem and produces a structured JSON artifact.
- **Immutable artifacts**: outputs are timestamped and never overwritten, enabling auditing and comparisons across runs.
- **Optional persistence**: results and metrics can be stored in SQLite for analysis and “resume” behavior.
Loading
Loading