Version of record repository for A retrospective analysis of 400 publications reveals patterns of irreproducibility across an entire life sciences research field
Because of the sensitive nature of the topic, all author names were anonymized in this version of the record. Each entry containing an author name was replaced by its MD5 hash using the script anonymize_authors.py
. The de-anonymized data is available upon request.
This repository contains the complete analysis code and data for a comprehensive study examining reproducibility patterns across 400 publications in Drosophila research. It is used to generates all figure (see example below) and all statistical analysis.
├── preprocessed_data/ # Processed datasets ready for analysis
├── analysis_claims.py # Main claim analysis (Figures 1-3)
├── analysis_authors_first.py # First author analysis (Figures 4-5)
├── analysis_authors_last.py # Last author analysis (Figures 6-7, 9)
├── statistical_analysis.py # Multivariate model (Figure 10)
├── plot_info.py # Plotting utilities
├── wrangling.py # Data processing functions
├── stat_lib.py # Statistical analysis functions
├── preprocess_db.ipynb # Database preprocessing
└── preprocess_xlsx.ipynb # Excel file preprocessing
The simplest way to reproduce all figures and analyses is to use the preprocessed data stored in preprocessed_data/
:
# Run the main analyses
python analysis_claims.py # Generates Main Text Figures 1-3
python analysis_authors_first.py # Generates Main Text Figures 4-5
python analysis_authors_last.py # Generates Main Text Figures 6-7, 9
python statistical_analysis.py # Generates Main Text Figure 10
This generates all paper figures, all tables, all numbers in the text with Wilson confidence interval, all the ddd-ratios for categorical variable comparison with significance test, and the main model, a random effect mixed regression (with Bambi) with diagnostic checks (rhat, posterior predictive checks) ...
To reproduce the complete preprocessing pipeline:
- Download the SQL dump from the ReproSci database (https://reprosci.epfl.ch)
- Run
preprocess_db.ipynb
to process the database - Run
preprocess_xlsx.ipynb
to extract manual covariates from Excel files - Execute the analysis scripts as above
article_db.csv
: Article metadata (journal, year) from ReproSci databaseauthor_db.csv
: Author information (sex, etc.) from ReproSci databaseclaims_db_truncated.csv
: Main dataset with one row per claim, merged with article and author datafirst_author_claims.csv
: First author covariates merged with claim datalast_author_claims.csv
: Last author covariates merged with claim data
The analysis uses data from the ReproSci database (https://reprosci.epfl.ch). It is built from an SQL dump, but all dataframes extracted from the database are included in this repository.
- PyMC (for Bayesian modeling)
- Bambi for model buiding
- pandas, numpy (data manipulation)
- matplotlib, seaborn (visualization)
- scipy (statistical tests)