Version of record repository for A retrospective analysis of 400 publications reveals patterns of irreproducibility across an entire life sciences research field

Because of the sensitive nature of the topic, all author names were anonymized in this version of the record. Each entry containing an author name was replaced by its MD5 hash using the script anonymize_authors.py. The de-anonymized data is available upon request.

Overview

This repository contains the complete analysis code and data for a comprehensive study examining reproducibility patterns across 400 publications in Drosophila research. It is used to generates all figure (see example below) and all statistical analysis.

Repository Structure

├── preprocessed_data/          # Processed datasets ready for analysis
├── analysis_claims.py         # Main claim analysis (Figures 1-3)
├── analysis_authors_first.py  # First author analysis (Figures 4-5)
├── analysis_authors_last.py   # Last author analysis (Figures 6-7, 9)
├── statistical_analysis.py    # Multivariate model (Figure 10)
├── plot_info.py              # Plotting utilities
├── wrangling.py              # Data processing functions
├── stat_lib.py               # Statistical analysis functions
├── preprocess_db.ipynb       # Database preprocessing
└── preprocess_xlsx.ipynb     # Excel file preprocessing

Quick Start

Option 1: Using Preprocessed Data (Recommended)

The simplest way to reproduce all figures and analyses is to use the preprocessed data stored in preprocessed_data/:

# Run the main analyses
python analysis_claims.py        # Generates Main Text Figures 1-3
python analysis_authors_first.py # Generates Main Text Figures 4-5
python analysis_authors_last.py  # Generates Main Text Figures 6-7, 9
python statistical_analysis.py   # Generates Main Text Figure 10

This generates all paper figures, all tables, all numbers in the text with Wilson confidence interval, all the ddd-ratios for categorical variable comparison with significance test, and the main model, a random effect mixed regression (with Bambi) with diagnostic checks (rhat, posterior predictive checks) ...

Option 2: Full Preprocessing Pipeline

To reproduce the complete preprocessing pipeline:

Download the SQL dump from the ReproSci database (https://reprosci.epfl.ch)
Run preprocess_db.ipynb to process the database
Run preprocess_xlsx.ipynb to extract manual covariates from Excel files
Execute the analysis scripts as above

Data Files store in the repository.

Core Datasets (`processed_data/`)

article_db.csv: Article metadata (journal, year) from ReproSci database
author_db.csv: Author information (sex, etc.) from ReproSci database
claims_db_truncated.csv: Main dataset with one row per claim, merged with article and author data
first_author_claims.csv: First author covariates merged with claim data
last_author_claims.csv: Last author covariates merged with claim data

Data Sources

The analysis uses data from the ReproSci database (https://reprosci.epfl.ch). It is built from an SQL dump, but all dataframes extracted from the database are included in this repository.

Dependencies

PyMC (for Bayesian modeling)
Bambi for model buiding
pandas, numpy (data manipulation)
matplotlib, seaborn (visualization)
scipy (statistical tests)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Version of record repository for A retrospective analysis of 400 publications reveals patterns of irreproducibility across an entire life sciences research field

Overview

Repository Structure

Quick Start

Option 1: Using Preprocessed Data (Recommended)

Option 2: Full Preprocessing Pipeline

Data Files store in the repository.

Core Datasets (`processed_data/`)

Data Sources

Dependencies

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
preprocessed_data		preprocessed_data
.gitignore		.gitignore
README.md		README.md
analysis_authors_first.py		analysis_authors_first.py
analysis_authors_last.py		analysis_authors_last.py
analysis_claims.py		analysis_claims.py
anonymize_authors.py		anonymize_authors.py
plot_info.py		plot_info.py
preprocess_db.ipynb		preprocess_db.ipynb
preprocess_utils.py		preprocess_utils.py
preprocess_xlsx.ipynb		preprocess_xlsx.ipynb
stat_lib.py		stat_lib.py
statistical_analysis.py		statistical_analysis.py
wrangling.py		wrangling.py

jcblemai/drosophila-reproducibility-VOR

Folders and files

Latest commit

History

Repository files navigation

Version of record repository for A retrospective analysis of 400 publications reveals patterns of irreproducibility across an entire life sciences research field

Overview

Repository Structure

Quick Start

Option 1: Using Preprocessed Data (Recommended)

Option 2: Full Preprocessing Pipeline

Data Files store in the repository.

Core Datasets (processed_data/)

Data Sources

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Core Datasets (`processed_data/`)

Packages