censor-board-cuts

Note: Data updates paused as of June 12th, due to breaking changes on the e-Cinepramaan portal

Dataset and related anlysis of modifications or cuts made by the Central Board of Film Certification (CBFC), India.

The dataset consists of two main components:

Raw Data: Raw category and certificate data from the CBFC website, stored in data/raw/
Processed Data: Cleaned up data enhanced with code-based and LLM-based analysis of cuts, stored in data/data.csv

Preview

Modifications (~20MB)
Metadata (~40MB)
Categories (~100MB)
Processed Dataset (~100MB)

Further data is available in the data/ directory.

Data Collection

The following scripts fetch data from the CBFC website:

scripts/certificates/: Film metadata, modifications
scripts/categories/: Film categories

The above scripts incrementally fetch new films and append them to the relevant CSV files. After fetching the data from the CBFC website, code-based analysis of the metadata and modifications is done in scripts/analysis/ and LLM-based analysis is done in scripts/llm/. Next, scripts/imdb/ further enhances the metadata and all the fetched data is joined together using scripts/join/ which saves the final data in data/data.csv.

Data Analysis

The code-based analysis is done by a Python script scripts/analysis/main.py that cleans and processes the raw data:

Standardizes duration formats and attempts to pull out timestamps from the descriptions.
Categorizes modifications based on type (audio, visual, deletion, etc.) and the basic type of content (violence, nudity, etc.) using an LLM.

TODO

Create a dashboard for exploring the data.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
data		data
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

censor-board-cuts

Preview

Data Collection

Data Analysis

TODO

Related Projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

diagram-chasing/censor-board-cuts

Folders and files

Latest commit

History

Repository files navigation

censor-board-cuts

Preview

Data Collection

Data Analysis

TODO

Related Projects

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages