The ProfOlaf tool was built to help researchers with literature reviews. It automates the process of snowballing articles through an initial seed, and helps raters through the process of screening.
If you use our tool, please cite it:
@article{afonso2025profolaf,
title={ProfOlaf: Semi-Automated Tool for Systematic Literature Reviews},
author={Afonso, Martim and Saavedra, Nuno and Louren{\c{c}}o, Bruno and Mendes, Alexandra and Ferreira, Jo{\~a}o},
journal={arXiv preprint arXiv:2510.26750},
year={2025}
}
generate_search_conf.py is used to interactively create a search_conf.json file that stores all configuration parameters needed for scraping and data collection.
- Prompts the user for:
- Year interval (
start_year,end_year) - Accepted venue ranks (comma-separated, e.g.
A, B1, B2) - Proxy key or environment variable name (optional)
- Initial file (input seed file)
- Path to the database
- Path to the final CSV file
- Year interval (
- Saves all parameters into a JSON file:
search_conf.json - Provides an easy way to customize and reuse scraping settings.
Example Usage
Run the script: python generate_search_conf.py
You will be asked step-by-step:
Enter the starting year: 2020
Enter the ending year: 2025
Enter the accepted venue ranks (stops with empty input): A, B1
Enter the proxy key (or the env variable name): MY_PROXY_KEY
Enter the initial file: seed.txt
Enter the db path: ./data/database.db
Enter the path to the final csv file: ./results/output.csv
Enter the search method used: google scholarExample Output (search_conf.json)
{
"start_year": 2020,
"end_year": 2025,
"venue_rank_list": ["A", "B1"],
"proxy_key": "MY_PROXY_KEY",
"initial_file": "seed.txt",
"db_path": "./data/database.db",
"csv_path": " ./results/output.csv",
"search_method": "google_scholar"
}Note
The proxy key is only required when using google scholar as the search method.
Ensure that the initial file, DB path, and CSV path are accessible from your environment
0_generate_snowball_start.py reads paper titles from a file, looks them up on a search database(via the scholarly library for google scholar or the semantic scholar api for semantic scholar), and writes the resulting initial publications into your database for iteration 0 of the snowballing process.
- Loads config from
search_conf.json(created in Step 1). - Reads titles from:
- JSON:
{"papers": [{"title": "..."}, ...]} - TXT: one title per line
- JSON:
- Queries Google Scholar for each title and builds a normalized record.
- Respects a delay between requests to reduce rate limiting
{
"papers": [
{ "title": "Awesome Paper Title 1" },
{ "title": "Another Great Title 2" }
]
}Awesome Paper Title 1
Another Great Title 2With defaults from search_conf.json:
python 0_generate_snowball_start.pyOverride paths and delay:
python 0_generate_snowball_start.py \
--input_file ./data/accepted_papers.json \
--db_path ./data/database.db \
--delay 2.5--input_filePath to.jsonor.txtwith titles (default:search_conf["initial_file"])--db_pathPath to database (default:search_conf["db_path"])--delaySeconds to sleep between queries (default:2.0)
- Step 1: Generate
search_conf.jsonwithgenerate_search_conf.py - Step 2 (this step):
0_generate_snowball_start.py→ seeds iteration 0 in the DB - Next: Continue with your snowballing/expansion scripts using the stored iteration 0 results
1_start_iteration.py takes the seed publications from the previous iteration and fetches the citations (forward snowballing) and references (backward snowballing) for each one.
Note
The google scholar search method only supports forward snowballing. For both backward and forward snowballing, use semantic scholar as the search_method
- Loads config from
search_conf.json(proxy, DB path) - Opens the database for the target
--iteration - Pulls the seed set from the previous iteration:
get_iteration_data(iteration=ITERATION-1, selected=SelectionStage.NOT_SELECTED)- For each seed paper, queries
scholarly.search_citedby(<citedby_id>) - Normalizes each result with
get_article_data(...)and writes: --insert_iteration_data(articles)for the current iteration -- ```insert_seen_titles_data([(title, id), ...])´´´ for deduping - Uses exponential backoff (starts at 30s) on failures to reduce rate limiting
- If a paper has no
citedby_url, falls back to a SHA-256 hash of the title as its ID
Typical: expand from iteration 0 → 1
python 1_start_iteration.py --iteration 1Custom DB path
python 1_start_iteration.py --iteration 2 --db_path ./data/database.dbArguments
--iterationTarget iteration to generate (int). Seeds are read fromiteration-1--db_pathPath to the SQLite DB (default:search_conf["db_path"])
Input
- DB must already contain iteration N-1 data (e.g., created by
0_generate_snowball_start.pyfor iteration 0)
Writes to DB
- Current iteration’s articles (normalized records)
seen_titlespairs(title, id)used for deduplication
- “No citations found”: The seed’s
citedbypage has zero results—this is normal for some papers. - Captcha / throttling: Ensure a working proxy and let the backoff run; rerun later if needed.
- Seed count is zero: Verify that the previous iteration exists in the DB and that items are marked with
SelectionStage.NOT_SELECTED.
2_get_bibtex.py enriches the papers in iteration N by fetching their BibTeX from Google Scholar (via scholarly) and updating your database. The information present in the bibtex is necessary for the metadata screening step
- Loads config from
search_conf.json(proxy, DB path) - Reads all articles for the target iteration from the DB
- For each article:
- Looks up the publication by title (
scholarly.search_single_pub→scholarly.bibtex) - Parses the BibTeX to extract the venue (
booktitleorjournal) - If the venue looks like arXiv/CoRR, it tries to find a non-arXiv version by checking all versions (
scholarly.get_all_versions) and selecting one with a proper venue (conference/journal) - Writes the chosen BibTeX back to the DB (
update_iteration_data)
- Looks up the publication by title (
- Uses exponential backoff (starting at 30s) on errors to reduce throttling
Fetch BibTeX for iteration 1:
python 2_get_bibtex.py --iteration 1Custom DB path:
python 2_get_bibtex.py --iteration 1 --db_path ./data/database.dbArguments
--iteration(required) Target iteration number (int)--db_path(optional) Path to the SQLite DB (default:search_conf["db_path"])
Input
- DB entries for iteration N (e.g., produced by
1_start_iteration.py)
Writes to DB
- Updates each article in iteration N with a
bibtexstring
- Proxy session is initialized via
get_proxy(search_conf["proxy_key"]) - Google Scholar may throttle; the script retries with exponential backoff (30s → 60s → 120s ...)
-
Repeated retries / never finishes on arXiv-only papers
The script is strict about replacing arXiv/CoRR with a non-arXiv venue and will keep trying Consider relaxing this logic if arXiv should be accepted -
Captcha / throttling
Use a reliable proxy; give the backoff time to proceed; rerun later if needed -
Venue not detected
The venue is extracted frombooktitleorjournal. Some BibTeX records lack these fields; alternative versions are attempted
3_generate_conf_rank.py scans the iteration N articles’ BibTeX, extracts their venues (conference/journal), and lets you assign a rank to any venue that isn’t already in your DB. Results are written to the conf_rank table as you go.
To assist this manual process, the tool searches both Scimago and a local core ranking database foe the venues.
Note
Run Step 4 (2_get_bibtex.py) first so venues can be read from BibTeX.
Rank venues for iteration 1:
python 3_generate_conf_rank.py --iteration 1Custom DB path:
python 3_generate_conf_rank.py --iteration 1 --db_path ./data/database.db--iteration(required) Target iteration number (int)--db_path(optional) Path to the SQLite DB (default: fromsearch_conf.json)
(1/5) IEEE Symposium on Example Security
What is the rank of this venue? A
(2/5) Journal of Hypothetical Research
What is the rank of this venue? Q1
(3/5) arXiv
-> auto-assigned NA
...Each answer is immediately stored:
db_manager.insert_conf_rank_data([(venue, rank)])
Input
- DB entries for iteration N, each with a BibTeX string (from Step 4)
Writes to DB
- Table with venue–rank pairs (queried via
db_manager.get_conf_rank_data())
- No venues found → Ensure Step 4 populated BibTeX for this iteration
- Invalid rank → The script will reprompt until you enter a valid label
- arXiv/SSRN assigned as NA → This is by design; override later by updating the DB if you need a different policy
4_filter_by_metadata.py reviews iteration N records and decides whether each paper is selected or filtered out based on venue/peer-review, year window, language, and download availability. It writes the results back to the DB in a single batch.
-
Venue & peer-review
- Parses the article’s BibTeX and extracts
booktitleorjournal - Automatically rejects if the BibTeX
ENTRYTYPEisbook,phdthesis, ormastersthesis, or if venue isNA/missing - Looks up the venue’s rank in the DB and compares it against
search_conf["venue_rank_list"] - If the venue isn’t known in the DB, it asks you:
Is the publication peer-reviewed and A or B or ... (y/n)
- Parses the article’s BibTeX and extracts
-
Year window
- Accepts if
pub_yearis betweensearch_conf["start_year"]andsearch_conf["end_year"] - If the year is unknown/non-numeric, it asks you to confirm
- Accepts if
-
Language (English)
- If the venue check already passed (peer-reviewed + ranked OK), it auto-assumes English
- Otherwise, it asks:
Is the publication in English (y/n)
-
Download availability
- Accepts if an
eprint_urlis present; else asks:Is the publication available for download (y/n)
- Accepts if an
If all checks pass → Selected. Otherwise the first failing reason is recorded.
For each article, one of the following fields is updated (via update_batch_iteration_data):
| Outcome | Field set on the article |
|---|---|
| Venue/peer-review failed | venue_filtered_out = True |
| Year outside window | year_filtered_out = True |
| Not English | language_filtered_out = True |
| No downloadable copy | download_filtered_out = True |
| All checks passed | selected = SelectionStage.SELECTED |
Filter iteration 1:
python 4_filter_by_metadata.py --iteration 1Custom DB path:
python 4_filter_by_metadata.py --iteration 1 --db_path ./data/database.db--iteration(required) Target iteration (int)--db_path(optional) SQLite DB path (default: fromsearch_conf.json)
Element 3 out of 42
ID: 123456
Title: Cool Paper on X
Venue: IEEE S&P
Url: https://example.org/paper.pdf
Is the publication peer-reviewed and A or B or Q1 (y/n): y
Is the publication year between 2018 and 2024 (y/n): y
SelectedNote
Auto-logic shortcut: If venue + rank already prove peer-review and the venue is in your allowed list (venue_rank_list), check_english returns True without aski_
Unknown year: You’re prompted to confirm it’s within the configured window
Interactive prompts: The script is designed to be conservative—if metadata is incomplete, it asks you rather than guessing
5_filter_by_title.py iteratively asks the user if he wants to keep each paper or not, based solely on the title. Along with the user's choice (yes, no, or skip), ProfOlaf also prompts the user for its reasoning in their choice. This is used to help each rater remember their thought process in the following steps
Filter iteration 1:
python 5_filter_by_title.py --iteration 1Custom DB path:
python 5_filter_by_title.py --iteration 1 --db_path ./data/database.db--iteration(required) Target iteration (int)--db_path(optional) SQLite DB path (default: fromsearch_conf.json)
6_8_solve_disagreements.py is used to collect the search databases of different users/raters and reach a consensus on the articles where there was a decision conflict. The script goes over the articles where at least 2 users disagreed in their screening process and presents the reasoning of each rater. After a careful discussion, the raters can make their final decision on the relevance of the paper.
This script is used twice: after Step 5 and Step 7.
Filter iteration 1:
python 6_8_solve_disagreements.py --iteration 1 --search_dbs rater1.db rater2.db ... --selection_stage TITLE--iteration(required) Target iteration (int)--search_dbs(required) SQLite DB paths (one for each rater involved)--selection_stage(required: TITLE or CONTENT) String representing which disagreements are being processed
7_filter_by_content.py iteratively asks the user if he wants to keep each paper or not, based on the content of the full paper. For each article, the tool presents the url for the article. Along with the user's choice (yes, no, or skip), ProfOlaf also prompts the user for its reasoning in their choice. This is used to help each rater remember their thought process in the following steps
Filter iteration 1:
python 5_filter_by_title.py --iteration 1Custom DB path:
python 5_filter_by_title.py --iteration 1 --db_path ./data/database.db--iteration(required) Target iteration (int)--db_path(optional) SQLite DB path (default: fromsearch_conf.json)
6_8_solve_disagreements.py is used again, similarly to the previous steps
### Usage
```bash
python 6_8_solve_disagreements.py --iteration 1 --search_dbs rater1.db rater2.db ... --selection_stage CONTENT
--iteration(required) Target iteration (int)--search_dbs(required) SQLite DB paths (one for each rater involved)--selection_stage(required: TITLE or CONTENT) String representing which disagreements are being processed
9_remove_duplicates.py is used to remove potential duplicates in the results. The same article is often present in a certain database under a slightly different title. This script catches possible duplicate articles in the database and prompts the user which one he wants to keep (or if it is a false positive and both should be kept).
python 9_remove_duplicates.py --iterations 1 2 3 4--iterations(required) Iterations to include--db_path(optional) Final SQLite DB path--auto-remove(optional) Automatically removes duplicates without user confirmation
10_generate_csv.py takes the information on the final search database and exports it into a csv
python 10_generate_csv.py --iterations 1 2 3 4 --iterations(required) Iterations to include--db_path(optional) Final SQLite DB path--output_path(optional) Automatically removes duplicates without user confirmation
