Note: Data updates paused as of June 12th, due to breaking changes on the e-Cinepramaan portal
Dataset and related anlysis of modifications or cuts made by the Central Board of Film Certification (CBFC), India.
The dataset consists of two main components:
- Raw Data: Raw category and certificate data from the CBFC website, stored in
data/raw/
- Processed Data: Cleaned up data enhanced with code-based and LLM-based analysis of cuts, stored in
data/data.csv
- Modifications (~20MB)
- Metadata (~40MB)
- Categories (~100MB)
- Processed Dataset (~100MB)
Further data is available in the data/ directory.
The following scripts fetch data from the CBFC website:
scripts/certificates/
: Film metadata, modificationsscripts/categories/
: Film categories
The above scripts incrementally fetch new films and append them to the relevant CSV files. After fetching the data from the CBFC website, code-based analysis of the metadata and modifications is done in scripts/analysis/
and LLM-based analysis is done in scripts/llm/
. Next, scripts/imdb/
further enhances the metadata and all the fetched data is joined together using scripts/join/
which saves the final data in data/data.csv
.
The code-based analysis is done by a Python script scripts/analysis/main.py
that cleans and processes the raw data:
- Standardizes duration formats and attempts to pull out timestamps from the descriptions.
- Categorizes modifications based on type (audio, visual, deletion, etc.) and the basic type of content (violence, nudity, etc.) using an LLM.
- Create a dashboard for exploring the data.