This repository contains a collection of 10,000 rows of data scraped from StackOverflow discussions related to Python. The dataset offers a unique insight into common questions, programming challenges, and the level of community engagement within the Python section of StackOverflow.
- Scraping Notebook: A Jupyter notebook detailing the process used to scrape StackOverflow discussions.
- 10k Dataset: The raw dataset comprising 10,000 rows of scraped data.
- Categorization Notebook: A Jupyter notebook that categorizes StackOverflow posts into popularity categories based on scraped features.
- 10k Categorized Dataset: The dataset after categorization, based on features such as upvotes, views, and answers.
The dataset includes the following columns:
- link: URL of the discussion.
- upvotes: Number of upvotes (can be negative).
- answers: Number of answers in the discussion.
- views: Number of views for the discussion.
- content: The content of the question, excluding code and post notices.
- code_length: The character length of the code within the question.
- Clone the repository: Get a local copy of the dataset and notebooks for analysis.
- Explore the dataset: Use the provided Jupyter notebooks to understand the scraping and categorization processes.
- Analysis: Leverage the categorized dataset for further analysis, such as identifying trends, common questions, and the impact of different factors on post popularity.
Contributions to improve the dataset, scraping, or categorization methods are welcome. Please submit a pull request or open an issue to discuss potential enhancements.