Subreddit Processing

This repository holds the code to process a subreddit dataset.

Quick start

There could be a lot of files and scripts in the repository, but you only need to look into several files to get to know the data.

unpopularopinion_comments.10000.jsonl. It contains first 10,000 lines of the cleaned comments file of unpopularopinion subreddit. The unsampled cleaned comments file (47,519,950 lines) could be found in Google Drive.
unpopularopinion_submissions.10000.jsonl. It contains first 10,000 lines of the cleaned submissions file of unpopularopinion subreddit. The unsampled cleaned submissions file (2,394,871 lines) could be found in Google Drive.
unpopular_user_summary.tsv. Per user based #post, #comments, #comments_on_unique_posts.

You can find similar, but more comprehensive file structure in chinesefood, because the subreddit of chinesefood is much smaller and the data files could fit better on GitHub.

Here're the explanations on the keys of the comments file and the explanations of the submissions file, by ChatGPT :)

Details of data processing

File structure

Here's the full file structure on my computer. The GitHub version contains all the Python scripts, but not all the data files, due to size limit.

.
├── chinesefood
│   ├── chinesefood_comments                 # raw comments data
│   ├── chinesefood_submissions              # raw submission data
│   ├── chinesefood_comments.jsonl           # processed comments (json line)
│   ├── chinesefood_submissions.jsonl        # processed submissions (json line)
│   ├── chinesefood_comments.db              # processed comments in database
│   ├── chinesefood_submissions.db           # processed submissions in database
│   ├── chinesefood_user_comment_count.tsv   # #comments and #comments_on_unique_posts per user
│   ├── chinesefood_user_post_count.tsv      # #posts per user
│   └── chinesefood_user_summary.tsv         # Summary of user activity (comments & posts)
├── unpopularopinion
│   ├── unpopularopinion_comments             # raw comments data
│   ├── unpopularopinion_submissions          # raw submission data
│   ├── unpopularopinion_comments.jsonl       # processed comments (json line)
│   ├── unpopularopinion_submissions.jsonl    # processed submissions (json line)
│   ├── unpopularopinion_comments.db          # processed comments in database
│   ├── unpopularopinion_submissions.db       # processed submissions in database
│   ├── unpopularopinion_user_comment_count.tsv  # #comments and #comments_on_unique_posts per user
│   ├── unpopularopinion_user_post_count.tsv     # #posts per user
│   ├── unpopularopinion_user_summary.tsv        # Summary of user activity (comments & posts)
│   ├── unpopularopinion_comments.10000.jsonl    # Sample of 10,000 comments
│   └── unpopularopinion_submissions.10000.jsonl # Sample of 10,000 submissions
├── comment-filter-fields-chunk.py           # Script for filtering necessary fields in comments (chunked processing)
├── comment-filter-fields.py                 # Script for filtering necessary fields in comments (full processing)
├── comments-db.py                           # Script to store comments in a database for efficient querying
├── submission-filter-chunk.py               # Script for filtering necessary fields in submissions (chunked processing)
├── submission-filter-fields.py              # Script for filtering necessary fields in submissions (full processing)
├── submissions-db.py                        # Script to store submissions in a database for efficient querying
├── count-comment.py                         # Script to count comments for each user
├── count-submission.py                      # Script to count posts for each user
├── count-summary.py                         # Script to generate a user activity summary from counts
├── user-summary-db.py                      # Script to calculate user activity summary from the database (could be very slow)
├── user-summary.py                         # Script to generate user activity summary from json line files (could face memory capability problems)
├── reddit-1614740ac8c94505e4ecb9d88be8bed7b6afddd4.torrent  # Torrent file for downloading Reddit dataset
└── readme.md

Data Processing

Download raw data

The subreddit dataset is from https://www.reddit.com/r/pushshift/comments/1itme1k/separate_dump_files_for_the_top_40k_subreddits/. You can find the torrent in this file ./reddit-1614740ac8c94505e4ecb9d88be8bed7b6afddd4.torrent in the repository. A torrent downloader is needed to download the files from the torrent.

We can get the <theme>_comments.zst and <theme>_submissions.zst after downloading the selected <theme> from the torrent. Decompress the .zst files, and we can get <theme>_comments (the comments on the subreddit posts) and <theme>_submissions (the subreddit posts). Both the comments and the submissions files are consist of json lines, with each line in a file representing a json, i.e., one commment or one post, respectively.

Clean raw data

There are several challenges in processing the comments and the submissions files.

Size. unpopularopinion_comments takes up 64G, while unpopular_submissions takes up 5.4G.
Not uniformed format. The json lines in one comments file, generally share the same keys, as is verified by ChatGPT by comparing the keys of several lines. However, it's not the case in submission files, where the json lines don't share the same keys.
Redundant keys. There are too many keys to handle in the raw comments and submissions files, which could be around 100 keys in the json file. These keys are not all necessary for our project. We used ChatGPT to choose the necessary fields from the randomly selected lines of comments file and submissions file.

We used Python scripts comments-filter-fields-chunk.py and submission-filter-chunk.py to select the necessary fields from the raw data, and exported the results file to <theme>/<theme>_comments.jsonl and <theme>/<theme>_submissions.jsonl.

After the processing, the size of unpopularopinion_comments.jsonl and unpopularopinion_submissions.jsonl shrinked to 14G and 1.9G, respectively. These files are much smaller in size, much more concise and well-formatted in keys.

Count user comments and posts count

From now on, we work on <theme>/<theme>_comments.jsonl and <theme>/<theme>_submissions.jsonl, rather than the raw data.

We can count #post of each user from <theme>/<theme>_submissions.jsonl and calculate #comments, #comments_on_unique_posts from <theme>/<theme>_comments.jsonl, using count-submission.py and count-comment.py, respectively. After this, we got <theme>/<theme>_user_comment_count.tsv and <theme>/<theme>_user_post_count.tsv

Finally, we run count-summary.py, using the two files generated just now, to get the <theme>/<theme>_user_summary.tsv

Into database

As the size of unpopularopinion_comments.jsonl (14G) and unpopularopinion_submissions.jsonl (1.9G) is still quite big to operate directly in memory, it's a good idea to put them into database for quick retrieval. Running comments-db.py and submissions-db.py and we can get the database version of the comments and submissions (unpopularopinion_comments.db and unpopularopinion_submissions.db), which offers a much more feasible solution for situations where comments and submissions files are too big.

Name	Name	Last commit message	Last commit date
Latest commit nmrenyi file: add results of the size of each torrent Mar 12, 2025 82d9b2f · Mar 12, 2025 History 53 Commits
chinesefood	chinesefood	file: add chinesefood files for toy dataset	Mar 7, 2025
unpopularopinion	unpopularopinion	file: add the unpopular opinion user summary	Mar 7, 2025
.gitignore	.gitignore	config: unignore the user summary of unpopular opinion	Mar 7, 2025
comment-filter-fields-chunk.py	comment-filter-fields-chunk.py	feat: use folder for comment chunk processing	Mar 7, 2025
comment-filter-fields.py	comment-filter-fields.py	initial commit	Mar 6, 2025
comments-db.py	comments-db.py	Revert "feat: simplify comment db"	Mar 7, 2025
count-comment.py	count-comment.py	refactor: rename for uniformity	Mar 7, 2025
count-submission.py	count-submission.py	refactor: rename for uniformity	Mar 7, 2025
count-summary.py	count-summary.py	feat: remove redundant suffix	Mar 7, 2025
read-torrent.py	read-torrent.py	feat: add script to extract the size of each torrent	Mar 12, 2025
readme.md	readme.md	doc: add explanation of data fields	Mar 7, 2025
reddit-1614740ac8c94505e4ecb9d88be8bed7b6afddd4.torrent	reddit-1614740ac8c94505e4ecb9d88be8bed7b6afddd4.torrent	file: add the torrent for reddit dataset	Mar 7, 2025
submission-filter-chunk.py	submission-filter-chunk.py	style: remove redundant comments	Mar 7, 2025
submission-filter-fields.py	submission-filter-fields.py	feat: submission necessary fields extraction	Mar 6, 2025
submissions-db.py	submissions-db.py	Revert "feat: simplify submission id"	Mar 7, 2025
torrent_files.tsv	torrent_files.tsv	file: add results of the size of each torrent	Mar 12, 2025
user-summary-db.py	user-summary-db.py	feat: remove redundant query	Mar 6, 2025
user-summary.py	user-summary.py	feat: print statistics of the input data	Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Subreddit Processing

Quick start

Details of data processing

File structure

Data Processing

Download raw data

Clean raw data

Count user comments and posts count

Into database

About

Languages

nmrenyi/subreddits24

Folders and files

Latest commit

History

Repository files navigation

Subreddit Processing

Quick start

Details of data processing

File structure

Data Processing

Download raw data

Clean raw data

Count user comments and posts count

Into database

About

Resources

Stars

Watchers

Forks

Languages