ProfOlaf

The ProfOlaf tool was built to help researchers with literature reviews. It automates the process of snowballing articles through an initial seed, and helps raters through the process of screening.

If you use our tool, please cite it:

Link to the article

@article{afonso2025profolaf,
  title={ProfOlaf: Semi-Automated Tool for Systematic Literature Reviews},
  author={Afonso, Martim and Saavedra, Nuno and Louren{\c{c}}o, Bruno and Mendes, Alexandra and Ferreira, Jo{\~a}o},
  journal={arXiv preprint arXiv:2510.26750},
  year={2025}
}

Setup - Generating the Search Configuration

generate_search_conf.py is used to interactively create a search_conf.json file that stores all configuration parameters needed for scraping and data collection.

Features

Prompts the user for:
- Year interval (start_year, end_year)
- Accepted venue ranks (comma-separated, e.g. A, B1, B2)
- Proxy key or environment variable name (optional)
- Initial file (input seed file)
- Path to the database
- Path to the final CSV file
Saves all parameters into a JSON file: search_conf.json
Provides an easy way to customize and reuse scraping settings.

Example Usage

Run the script: python generate_search_conf.py

You will be asked step-by-step:

Enter the starting year: 2020
Enter the ending year: 2025

Enter the accepted venue ranks (stops with empty input): A, B1
Enter the proxy key (or the env variable name): MY_PROXY_KEY

Enter the initial file: seed.txt
Enter the db path: ./data/database.db
Enter the path to the final csv file: ./results/output.csv
Enter the search method used: google scholar

Example Output (search_conf.json)

{
  "start_year": 2020,
  "end_year": 2025,
  "venue_rank_list": ["A", "B1"],
  "proxy_key": "MY_PROXY_KEY",
  "initial_file": "seed.txt",
  "db_path": "./data/database.db",
  "csv_path": " ./results/output.csv",
  "search_method": "google_scholar"
}

Note

The proxy key is only required when using google scholar as the search method.

Ensure that the initial file, DB path, and CSV path are accessible from your environment

Step 0 — Generate Snowball Starting Points

0_generate_snowball_start.py reads paper titles from a file, looks them up on a search database(via the scholarly library for google scholar or the semantic scholar api for semantic scholar), and writes the resulting initial publications into your database for iteration 0 of the snowballing process.

What it does

Loads config from search_conf.json (created in Step 1).
Reads titles from:
- JSON: {"papers": [{"title": "..."}, ...]}
- TXT: one title per line
Queries Google Scholar for each title and builds a normalized record.
Respects a delay between requests to reduce rate limiting

Accepted input formats

JSON (preferred)

{
  "papers": [
    { "title": "Awesome Paper Title 1" },
    { "title": "Another Great Title 2" }
  ]
}

TXT

Awesome Paper Title 1
Another Great Title 2

Usage

With defaults from search_conf.json:

python 0_generate_snowball_start.py

Override paths and delay:

python 0_generate_snowball_start.py \
  --input_file ./data/accepted_papers.json \
  --db_path ./data/database.db \
  --delay 2.5

Arguments

--input_file Path to .json or .txt with titles (default: search_conf["initial_file"])
--db_path Path to database (default: search_conf["db_path"])
--delay Seconds to sleep between queries (default: 2.0)

Pipeline context

Step 1: Generate search_conf.json with generate_search_conf.py
Step 2 (this step): 0_generate_snowball_start.py → seeds iteration 0 in the DB
Next: Continue with your snowballing/expansion scripts using the stored iteration 0 results

Step 1 — Snowballing

1_start_iteration.py takes the seed publications from the previous iteration and fetches the citations (forward snowballing) and references (backward snowballing) for each one.

Note

The google scholar search method only supports forward snowballing. For both backward and forward snowballing, use semantic scholar as the search_method

What it does

Loads config from search_conf.json (proxy, DB path)
Opens the database for the target --iteration
Pulls the seed set from the previous iteration:

get_iteration_data(iteration=ITERATION-1, selected=SelectionStage.NOT_SELECTED)

For each seed paper, queries scholarly.search_citedby(<citedby_id>)
Normalizes each result with get_article_data(...) and writes: -- insert_iteration_data(articles) for the current iteration -- ```insert_seen_titles_data([(title, id), ...])´´´ for deduping
Uses exponential backoff (starts at 30s) on failures to reduce rate limiting
If a paper has no citedby_url, falls back to a SHA-256 hash of the title as its ID

Usage

Typical: expand from iteration 0 → 1

python 1_start_iteration.py --iteration 1

Custom DB path

python 1_start_iteration.py --iteration 2 --db_path ./data/database.db

Arguments

--iteration Target iteration to generate (int). Seeds are read from iteration-1
--db_path Path to the SQLite DB (default: search_conf["db_path"])

Input / Output

Input

DB must already contain iteration N-1 data (e.g., created by 0_generate_snowball_start.py for iteration 0)

Writes to DB

Current iteration’s articles (normalized records)
seen_titles pairs (title, id) used for deduplication

Troubleshooting

“No citations found”: The seed’s citedby page has zero results—this is normal for some papers.
Captcha / throttling: Ensure a working proxy and let the backoff run; rerun later if needed.
Seed count is zero: Verify that the previous iteration exists in the DB and that items are marked with SelectionStage.NOT_SELECTED.

Step 2 — Fetch BibTeX for Iteration N

2_get_bibtex.py enriches the papers in iteration N by fetching their BibTeX from Google Scholar (via scholarly) and updating your database. The information present in the bibtex is necessary for the metadata screening step

What it does

Loads config from search_conf.json (proxy, DB path)
Reads all articles for the target iteration from the DB
For each article:
1. Looks up the publication by title (scholarly.search_single_pub → scholarly.bibtex)
2. Parses the BibTeX to extract the venue (booktitle or journal)
3. If the venue looks like arXiv/CoRR, it tries to find a non-arXiv version by checking all versions (scholarly.get_all_versions) and selecting one with a proper venue (conference/journal)
4. Writes the chosen BibTeX back to the DB (update_iteration_data)
Uses exponential backoff (starting at 30s) on errors to reduce throttling

Usage

Fetch BibTeX for iteration 1:

python 2_get_bibtex.py --iteration 1

Custom DB path:

python 2_get_bibtex.py --iteration 1 --db_path ./data/database.db

Arguments

--iteration (required) Target iteration number (int)
--db_path (optional) Path to the SQLite DB (default: search_conf["db_path"])

Input / Output

Input

DB entries for iteration N (e.g., produced by 1_start_iteration.py)

Writes to DB

Updates each article in iteration N with a bibtex string

Proxy & Rate Limiting

Proxy session is initialized via get_proxy(search_conf["proxy_key"])
Google Scholar may throttle; the script retries with exponential backoff (30s → 60s → 120s ...)

Troubleshooting

Repeated retries / never finishes on arXiv-only papers
The script is strict about replacing arXiv/CoRR with a non-arXiv venue and will keep trying Consider relaxing this logic if arXiv should be accepted
Captcha / throttling
Use a reliable proxy; give the backoff time to proceed; rerun later if needed
Venue not detected
The venue is extracted from booktitle or journal. Some BibTeX records lack these fields; alternative versions are attempted

Step 3 — Assign Venue Ranks (interactive)

3_generate_conf_rank.py scans the iteration N articles’ BibTeX, extracts their venues (conference/journal), and lets you assign a rank to any venue that isn’t already in your DB. Results are written to the conf_rank table as you go.

To assist this manual process, the tool searches both Scimago and a local core ranking database foe the venues.

Note

Run Step 4 (2_get_bibtex.py) first so venues can be read from BibTeX.

Usage

Rank venues for iteration 1:

python 3_generate_conf_rank.py --iteration 1

Custom DB path:

python 3_generate_conf_rank.py --iteration 1 --db_path ./data/database.db

Arguments

--iteration (required) Target iteration number (int)
--db_path (optional) Path to the SQLite DB (default: from search_conf.json)

Example session

(1/5) IEEE Symposium on Example Security
What is the rank of this venue? A

(2/5) Journal of Hypothetical Research
What is the rank of this venue? Q1

(3/5) arXiv
-> auto-assigned NA
...

Each answer is immediately stored:

db_manager.insert_conf_rank_data([(venue, rank)])

Input / Output

Input

DB entries for iteration N, each with a BibTeX string (from Step 4)

Writes to DB

Table with venue–rank pairs (queried via db_manager.get_conf_rank_data())

Troubleshooting

No venues found → Ensure Step 4 populated BibTeX for this iteration
Invalid rank → The script will reprompt until you enter a valid label
arXiv/SSRN assigned as NA → This is by design; override later by updating the DB if you need a different policy

Step 4 — Filter by Metadata

4_filter_by_metadata.py reviews iteration N records and decides whether each paper is selected or filtered out based on venue/peer-review, year window, language, and download availability. It writes the results back to the DB in a single batch.

What it checks (in order)

Venue & peer-review
- Parses the article’s BibTeX and extracts booktitle or journal
- Automatically rejects if the BibTeX ENTRYTYPE is book, phdthesis, or mastersthesis, or if venue is NA/missing
- Looks up the venue’s rank in the DB and compares it against search_conf["venue_rank_list"]
- If the venue isn’t known in the DB, it asks you: Is the publication peer-reviewed and A or B or ... (y/n)
Year window
- Accepts if pub_year is between search_conf["start_year"] and search_conf["end_year"]
- If the year is unknown/non-numeric, it asks you to confirm
Language (English)
- If the venue check already passed (peer-reviewed + ranked OK), it auto-assumes English
- Otherwise, it asks: Is the publication in English (y/n)
Download availability
- Accepts if an eprint_url is present; else asks: Is the publication available for download (y/n)

If all checks pass → Selected. Otherwise the first failing reason is recorded.

DB writes

For each article, one of the following fields is updated (via update_batch_iteration_data):

Outcome	Field set on the article
Venue/peer-review failed	`venue_filtered_out = True`
Year outside window	`year_filtered_out = True`
Not English	`language_filtered_out = True`
No downloadable copy	`download_filtered_out = True`
All checks passed	`selected = SelectionStage.SELECTED`

Usage

Filter iteration 1:

python 4_filter_by_metadata.py --iteration 1

Custom DB path:

python 4_filter_by_metadata.py --iteration 1 --db_path ./data/database.db

Arguments

--iteration (required) Target iteration (int)
--db_path (optional) SQLite DB path (default: from search_conf.json)

Example session

Element 3 out of 42
ID: 123456
Title: Cool Paper on X
Venue: IEEE S&P
Url: https://example.org/paper.pdf

Is the publication peer-reviewed and A or B or Q1 (y/n): y
Is the publication year between 2018 and 2024 (y/n): y

Selected

Note

Auto-logic shortcut: If venue + rank already prove peer-review and the venue is in your allowed list (venue_rank_list), check_english returns True without aski_

Unknown year: You’re prompted to confirm it’s within the configured window

Interactive prompts: The script is designed to be conservative—if metadata is incomplete, it asks you rather than guessing

Step 5 — Title Screening

5_filter_by_title.py iteratively asks the user if he wants to keep each paper or not, based solely on the title. Along with the user's choice (yes, no, or skip), ProfOlaf also prompts the user for its reasoning in their choice. This is used to help each rater remember their thought process in the following steps

Usage

Filter iteration 1:

python 5_filter_by_title.py --iteration 1

Custom DB path:

python 5_filter_by_title.py  --iteration 1 --db_path ./data/database.db

Arguments

--iteration (required) Target iteration (int)
--db_path (optional) SQLite DB path (default: from search_conf.json)

Step 6 — Solve Title Screening Disagreements

6_8_solve_disagreements.py is used to collect the search databases of different users/raters and reach a consensus on the articles where there was a decision conflict. The script goes over the articles where at least 2 users disagreed in their screening process and presents the reasoning of each rater. After a careful discussion, the raters can make their final decision on the relevance of the paper.

This script is used twice: after Step 5 and Step 7.

Usage

Filter iteration 1:

python 6_8_solve_disagreements.py --iteration 1 --search_dbs rater1.db rater2.db ... --selection_stage TITLE

Arguments

--iteration (required) Target iteration (int)
--search_dbs (required) SQLite DB paths (one for each rater involved)
--selection_stage (required: TITLE or CONTENT) String representing which disagreements are being processed

Step 7 — Full Content Screening

7_filter_by_content.py iteratively asks the user if he wants to keep each paper or not, based on the content of the full paper. For each article, the tool presents the url for the article. Along with the user's choice (yes, no, or skip), ProfOlaf also prompts the user for its reasoning in their choice. This is used to help each rater remember their thought process in the following steps

Usage

Filter iteration 1:

python 5_filter_by_title.py --iteration 1

Custom DB path:

python 5_filter_by_title.py  --iteration 1 --db_path ./data/database.db

Arguments

--iteration (required) Target iteration (int)
--db_path (optional) SQLite DB path (default: from search_conf.json)

Step 8 — Solve Content Screening Disagreements

6_8_solve_disagreements.py is used again, similarly to the previous steps

### Usage

```bash
python 6_8_solve_disagreements.py --iteration 1 --search_dbs rater1.db rater2.db ... --selection_stage CONTENT

Arguments

--iteration (required) Target iteration (int)
--search_dbs (required) SQLite DB paths (one for each rater involved)
--selection_stage (required: TITLE or CONTENT) String representing which disagreements are being processed

Step 9 — Remove Duplicates

9_remove_duplicates.py is used to remove potential duplicates in the results. The same article is often present in a certain database under a slightly different title. This script catches possible duplicate articles in the database and prompts the user which one he wants to keep (or if it is a false positive and both should be kept).

Usage

python 9_remove_duplicates.py --iterations 1 2 3 4

Arguments

--iterations (required) Iterations to include
--db_path (optional) Final SQLite DB path
--auto-remove (optional) Automatically removes duplicates without user confirmation

Step 10 — Generate CSV

10_generate_csv.py takes the information on the final search database and exports it into a csv

Usage

python 10_generate_csv.py --iterations 1 2 3 4

Arguments

--iterations (required) Iterations to include
--db_path (optional) Final SQLite DB path
--output_path (optional) Automatically removes duplicates without user confirmation

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
evaluation		evaluation
paper_analysis		paper_analysis
prompts		prompts
prompts_pl		prompts_pl
qa_prompts		qa_prompts
ranking_tables		ranking_tables
results_processing		results_processing
setup		setup
utils		utils
0_generate_snowball_start.py		0_generate_snowball_start.py
10_generate_csv.py		10_generate_csv.py
1_start_iteration.py		1_start_iteration.py
2_get_bibtex.py		2_get_bibtex.py
3_generate_conf_rank.py		3_generate_conf_rank.py
4_filter_by_metadata.py		4_filter_by_metadata.py
5_filter_by_title.py		5_filter_by_title.py
6_8_solve_disagreements.py		6_8_solve_disagreements.py
7_filter_by_content.py		7_filter_by_content.py
9_remove_duplicates.py		9_remove_duplicates.py
README.md		README.md
banner.webp		banner.webp
check_search_stage.py		check_search_stage.py
logo.png		logo.png
requirements.txt		requirements.txt
review_arxiv_articles.py		review_arxiv_articles.py
review_no_bibtex_articles.py		review_no_bibtex_articles.py
run_project.sh		run_project.sh

sr-lab/ProfOlaf

Folders and files

Latest commit

History

Repository files navigation

ProfOlaf

Setup - Generating the Search Configuration

Features

Step 0 — Generate Snowball Starting Points

What it does

Accepted input formats

JSON (preferred)

TXT

Usage

Arguments

Pipeline context

Step 1 — Snowballing

What it does

Usage

Input / Output

Troubleshooting

Step 2 — Fetch BibTeX for Iteration N

What it does

Usage

Input / Output

Proxy & Rate Limiting

Troubleshooting

Step 3 — Assign Venue Ranks (interactive)

Usage

Arguments

Example session

Input / Output

Troubleshooting

Step 4 — Filter by Metadata

What it checks (in order)

DB writes

Usage

Arguments

Example session

Step 5 — Title Screening

Usage

Arguments

Step 6 — Solve Title Screening Disagreements

Usage

Arguments

Step 7 — Full Content Screening

Usage

Arguments

Step 8 — Solve Content Screening Disagreements

Arguments

Step 9 — Remove Duplicates

Usage

Arguments

Step 10 — Generate CSV

Usage

Arguments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages