Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major rewrite to v3 #4

Draft
wants to merge 104 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
104 commits
Select commit Hold shift + click to select a range
de8fc46
Cruft
MarkusShepherd Nov 11, 2024
47415be
Adding scrapy to dependencies
MarkusShepherd Nov 11, 2024
84698f0
Move most old code to v2 dir
MarkusShepherd Nov 11, 2024
dae4907
Exclude v2 from pre-commits
MarkusShepherd Nov 11, 2024
1c4ea90
docker-compose.yaml -> compose.yaml
MarkusShepherd Nov 11, 2024
998d97c
Update shuffle.sh
MarkusShepherd Nov 11, 2024
7ad9f15
Scrapy scaffolding
MarkusShepherd Nov 11, 2024
0a9e205
Comment out middleware
MarkusShepherd Nov 11, 2024
13d4d95
Clean up shit rules
MarkusShepherd Nov 11, 2024
c687da5
First draft of BggSpider
MarkusShepherd Nov 11, 2024
0c40916
Added itemadapter, jmespath, pyyaml requests and w3lib as dependencies
MarkusShepherd Nov 15, 2024
2942d51
{v2 => src}/board_game_scraper/download_bgg_dump.py
MarkusShepherd Nov 15, 2024
02fe6fa
Require itemloaders>=1.1.0 which adds support for JMESPath
MarkusShepherd Nov 15, 2024
9798b41
Added python-dotenv as dependency and corrected BASE_DIR in download_…
MarkusShepherd Nov 16, 2024
dfb39fc
scrapy.responsetypes.Response -> scrapy.http.Response
MarkusShepherd Nov 16, 2024
2b6b225
Added new scrapy-extensions dependency
MarkusShepherd Nov 17, 2024
ca2f7e3
Settings
MarkusShepherd Nov 17, 2024
cfbaa90
scrapy-extensions v1.0.1
MarkusShepherd Nov 17, 2024
3ef4f54
Added attrs as dependency
MarkusShepherd Nov 19, 2024
78253da
Removed redundant TWISTED_REACTOR setting
MarkusShepherd Nov 19, 2024
48658c7
Addressing some minor mypy complaints
MarkusShepherd Nov 19, 2024
3ae1c88
GameItem
MarkusShepherd Nov 19, 2024
d4315d8
UserItem
MarkusShepherd Nov 19, 2024
f85d689
CollectionItem
MarkusShepherd Nov 19, 2024
5496de3
Added FEEDS setting
MarkusShepherd Nov 19, 2024
1c8f64a
Actually start scraping something
MarkusShepherd Nov 19, 2024
e978e3f
Default for scraped_at in items.py
MarkusShepherd Nov 23, 2024
c066072
Use ItemLoader in board_game_scraper/spiders/bgg.py
MarkusShepherd Nov 23, 2024
fe9e61b
Added board_game_scraper.utils
MarkusShepherd Nov 24, 2024
a02012e
Adding some converters to items
MarkusShepherd Nov 24, 2024
8a6e6b8
Added python-dateutil to dependencies
MarkusShepherd Nov 24, 2024
a3db058
Added parse_float and parse_date
MarkusShepherd Nov 24, 2024
67758f4
Added utils.now
MarkusShepherd Nov 24, 2024
e49b22b
Added to_str() and normalize_space() to utils
MarkusShepherd Nov 24, 2024
bab6f24
Move conversion from items to loaders
MarkusShepherd Nov 24, 2024
98cba4a
normalize_space_with_newline for descriptions and comments
MarkusShepherd Nov 24, 2024
9ac1cef
Added cron/games.recommend.curl_hotness.plist
MarkusShepherd Nov 25, 2024
3d1a421
Added cron/games.recommend.download_bgg_dump.plist
MarkusShepherd Nov 25, 2024
2e998ef
Added cron/games.recommend.bgg_ranked_csv.plist
MarkusShepherd Nov 25, 2024
b375fd2
Small changes
MarkusShepherd Nov 26, 2024
7b13ed7
Whitespace
MarkusShepherd Nov 26, 2024
698c63e
Introduced BggGameLoader; added more scraped fields
MarkusShepherd Nov 26, 2024
1d24b94
Added more (all?) input and output processors to GameLoader
MarkusShepherd Nov 26, 2024
02fc4f9
Added UserLoader
MarkusShepherd Nov 26, 2024
7f9a6e5
Added some input processors to CollectionLoader
MarkusShepherd Nov 26, 2024
2ff17c2
Added response_urljoin() to loaders
MarkusShepherd Nov 26, 2024
5a4dbc8
More fields scraped
MarkusShepherd Nov 26, 2024
6c9cb7c
More fields
MarkusShepherd Nov 26, 2024
d11856c
Better CollectionItem scraping
MarkusShepherd Nov 26, 2024
b6df965
Some cleanup
MarkusShepherd Nov 26, 2024
af21eba
Added more-itertools
MarkusShepherd Nov 27, 2024
49ccb45
Collect BGG IDs in chunks
MarkusShepherd Nov 27, 2024
ae5c5f5
Added BggSpider.has_seen_bgg_id()
MarkusShepherd Nov 27, 2024
34969f4
Warning if no state
MarkusShepherd Nov 27, 2024
e5ed156
Added RankingItem and RankingLoader
MarkusShepherd Nov 27, 2024
333e96d
Use RankingLoader in BggSpider
MarkusShepherd Nov 27, 2024
3c4c447
Added SparseJsonLinesItemExporter
MarkusShepherd Nov 28, 2024
55e4b83
Comments and TODOs
MarkusShepherd Nov 28, 2024
3b5b651
Set FEED_EXPORT_BATCH_ITEM_COUNT and add %(batch_id)05d to paths
MarkusShepherd Nov 28, 2024
89914e8
Small changes
MarkusShepherd Nov 28, 2024
f5e8200
Moving around things
MarkusShepherd Nov 28, 2024
3916de5
Added BggSpider.game_requests()
MarkusShepherd Nov 28, 2024
d4b0bf7
Remove BggSpider.sitemap_filter() and batch requests in _parse_sitema…
MarkusShepherd Nov 28, 2024
7d03cd3
Extracted BggSpider.scrape_game_item()
MarkusShepherd Nov 28, 2024
e4642f2
Factored out scrape_ranking_item()
MarkusShepherd Nov 28, 2024
6762f60
Factor out extract_collection_item()
MarkusShepherd Nov 28, 2024
a9ffc1c
Replace assert isinstance() with cast()
MarkusShepherd Nov 28, 2024
9913137
Added contract to _parse_sitemap()
MarkusShepherd Nov 28, 2024
9e35c76
Added bgg_ids to request.meta
MarkusShepherd Nov 28, 2024
2423307
Querying the next pages of a game item; added extract_page_number
MarkusShepherd Nov 28, 2024
75798f6
Remove debug message
MarkusShepherd Nov 28, 2024
27af0e9
Reduce priority of next pages
MarkusShepherd Nov 29, 2024
373d122
Lower priority for higher page numbers; corrected max_page_from_respo…
MarkusShepherd Nov 29, 2024
1350d00
Custom DOWNLOAD_DELAY
MarkusShepherd Nov 29, 2024
dd16e6b
Added BggSpider.collection_request()
MarkusShepherd Nov 29, 2024
0cf6b8b
Make BggSpider.collection_request()
MarkusShepherd Nov 30, 2024
d15f679
Added BggSpider.scrape_ratings, scrape_collections and scrape_users
MarkusShepherd Nov 30, 2024
6f9a5c6
Make scrape_ratings, scrape_collections and scrape_users explicit arg…
MarkusShepherd Nov 30, 2024
c36e4f7
Properly parse_collection()
MarkusShepherd Nov 30, 2024
a425ea7
Move around things; always parse scrape_* correctly
MarkusShepherd Nov 30, 2024
d4f131b
Added BggSpider.user_request(), parse_user() and extract_user_item()
MarkusShepherd Nov 30, 2024
a178c58
Tweak settings
MarkusShepherd Nov 30, 2024
15d335a
game_files and user_files as args
MarkusShepherd Nov 30, 2024
446812a
Stub game_requests_from_files() and user_and_collection_requests_from…
MarkusShepherd Nov 30, 2024
c9f0686
Added utils.files.extract_field_from_jsonlines_file()
MarkusShepherd Nov 30, 2024
3f68199
Added extract_field_from_csv_file()
MarkusShepherd Nov 30, 2024
22df276
Added load_premium_users()
MarkusShepherd Nov 30, 2024
c967b06
Added extract_field_from_files()
MarkusShepherd Nov 30, 2024
aec287e
Moved parse_file_paths() to utils and extract_field_from_files() in *…
MarkusShepherd Nov 30, 2024
45e7244
Scraping num_owned, num_trading, num_wanting, num_wishing, num_commen…
MarkusShepherd Nov 30, 2024
3dbc9aa
Better logging
MarkusShepherd Nov 30, 2024
6b12119
Longer DOWNLOAD_DELAY
MarkusShepherd Nov 30, 2024
9f0d683
Load premium users in BggSpider
MarkusShepherd Dec 1, 2024
bc9a8d3
Contracts
MarkusShepherd Dec 1, 2024
d4bab99
Added iterables.clear_iterable() and clear_list()
MarkusShepherd Dec 1, 2024
1a51f5d
Added utils.ids
MarkusShepherd Dec 1, 2024
75fd39f
Move extract_query_param() and parse_url() to utils.urls and other mi…
MarkusShepherd Dec 1, 2024
84be337
Extract min_players_rec, max_players_rec, min_players_best, max_playe…
MarkusShepherd Dec 1, 2024
5a60b2c
Extract min_age_rec
MarkusShepherd Dec 1, 2024
21f165b
Added language_dependency
MarkusShepherd Dec 1, 2024
3bc44b9
TODO
MarkusShepherd Dec 1, 2024
dafb2d2
Leaner Generators
MarkusShepherd Dec 1, 2024
a25ef6d
Added board_game_scraper.pipelines.LimitImagesPipeline
MarkusShepherd Dec 1, 2024
16057c3
Small corrections
MarkusShepherd Dec 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .cruft.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"template": "https://github.com/woltapp/wolt-python-package-cookiecutter",
"commit": "b25684a2c63387153b83a1cc03ea332be8c8279a",
"checkout": null,
"context": {
"cookiecutter": {
"author_name": "Markus Shepherd",
"author_email": "[email protected]",
"github_username": "recommend-games",
"project_name": "Board Game Scraper",
"project_slug": "board-game-scraper",
"package_name": "board_game_scraper",
"project_short_description": "Scraping data about board games from the web",
"_template": "https://github.com/woltapp/wolt-python-package-cookiecutter"
}
},
"directory": null
}
21 changes: 21 additions & 0 deletions .github/actions/python-poetry-env/action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: 'Setup Python + Poetry environment'
description: 'Setup Python + Poetry environment'

inputs:
python-version:
required: false
description: 'Python version'
default: '3.12'
outputs: {}
runs:
using: 'composite'
steps:
- uses: actions/setup-python@v5
with:
python-version: ${{inputs.python-version}}
- name: Install poetry
run: python -m pip install poetry
shell: bash
- name: Create virtual environment
run: poetry install
shell: bash
10 changes: 10 additions & 0 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
## Description

A short description about the changes in this pull request. If the pull request is related to some issue, mention it
here.

## Checklist

- [ ] Tests covering the new functionality have been added
- [ ] Documentation has been updated OR the change is too minor to be documented
- [ ] Changes are listed in the `CHANGELOG.md` OR changes are insignificant
79 changes: 79 additions & 0 deletions .github/workflows/cookiecutter.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
name: Autoupdate project structure
on:
workflow_dispatch:
schedule:
- cron: "0 0 * * *" # at the end of every day

jobs:
auto-update-project:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: Install dependencies
run: python -m pip install cruft poetry jello tabulate

- name: Update project structure
run: |
cruft update -y

- name: Check if there are changes
id: changes
run: echo "::set-output name=changed::$(git status --porcelain | wc -l)"

- name: apply additional changes and fixes
if: steps.changes.outputs.changed > 0
run: |
poetry lock --no-update # add new dependencies
poetry install
poetry run pre-commit run -a || true # we have to fix other issues manually

- name: Get template versions
id: get_versions
if: steps.changes.outputs.changed > 0
shell: bash
run: |
CURRENT_VERSION=$(git show HEAD:.cruft.json | jello -r "_['commit'][:8]")
NEXT_VERSION=$(jello -r "_['commit'][:8]" < .cruft.json)
echo ::set-output name="current_version::$CURRENT_VERSION"
echo ::set-output name="next_version::$NEXT_VERSION"

- name: Get changelog
id: get_changelog
if: steps.changes.outputs.changed > 0
shell: bash
run: |
TEMPLATE=$(jello -r "_['template']" < .cruft.json)
git clone "$TEMPLATE" /tmp/template
cd /tmp/template
body=$( (echo "Date;Change;Hash"; git log --pretty=format:"%as;%s;%h" ${{ steps.get_versions.outputs.current_version }}..${{ steps.get_versions.outputs.next_version }}) | tabulate --header --format github -s ';' -)
body=$(cat <<EOF
Changes from $TEMPLATE

$body
EOF
)
body="${body//'%'/'%25'}"
body="${body//$'\n'/'%0A'}"
body="${body//$'\r'/'%0D'}"
echo ::set-output name="changelog::$body"

# behaviour if PR already exists: https://github.com/marketplace/actions/create-pull-request#action-behaviour
- name: Create Pull Request
env:
# a PAT is required to be able to update workflows
GITHUB_TOKEN: ${{ secrets.AUTO_UPDATE_GITHUB_TOKEN }}
if: ${{ steps.changes.outputs.changed > 0 && env.GITHUB_TOKEN != 0 }}
uses: peter-evans/create-pull-request@v3
with:
token: ${{ env.GITHUB_TOKEN }}
commit-message: >-
chore: update project structure to ${{ steps.get_versions.outputs.next_version }}
title: "[Actions] Auto-Update cookiecutter template"
body: ${{ steps.get_changelog.outputs.changelog }}
branch: chore/auto-update-project-from-template
delete-branch: true
57 changes: 57 additions & 0 deletions .github/workflows/dependencies.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
name: Autoupdate dependencies
on:
workflow_dispatch:
schedule:
- cron: "0 0 1 * *"

jobs:
auto-update-dependencies:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/python-poetry-env

- name: Install tabulate
run: python -m pip install tabulate

- name: Gather outdated dependencies
id: check_for_outdated_dependencies
run: |
body=$(poetry show -o -n)
echo ::set-output name="body::$body"

- name: Format PR message
if: ${{ steps.check_for_outdated_dependencies.outputs.body != 0 }}
id: get_outdated_dependencies
shell: bash
run: |
body=$(poetry show -o -n | sed 's/(!)//' | awk 'BEGIN {print "Package","Used","Update"}; {print $1,$2,$3}' | tabulate --header --format github -)
body=$(cat <<EOF
The following packages are outdated

$body
EOF
)
body="${body//'%'/'%25'}"
body="${body//$'\n'/'%0A'}"
body="${body//$'\r'/'%0D'}"
echo ::set-output name="body::$body"

- name: Update outdated packages
if: ${{ steps.check_for_outdated_dependencies.outputs.body != 0 }}
run: poetry lock

# behaviour if PR already exists: https://github.com/marketplace/actions/create-pull-request#action-behaviour
- name: Create Pull Request
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
if: ${{ steps.check_for_outdated_dependencies.outputs.body != 0 }}
uses: peter-evans/create-pull-request@v3
with:
token: ${{ env.GITHUB_TOKEN }}
commit-message: >-
chore: update dependencies
title: "[Actions] Auto-Update dependencies"
body: ${{ steps.get_outdated_dependencies.outputs.body }}
branch: chore/update-dependencies
delete-branch: true
52 changes: 52 additions & 0 deletions .github/workflows/draft_release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
name: Draft a release

on:
workflow_dispatch:
inputs:
version:
description: 'The version number (e.g. 1.2.3) OR one of: patch|minor|major|prepatch|preminor|premajor|prerelease'
required: true
default: 'patch'

jobs:
draft-release:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/python-poetry-env
- name: Update version
id: updated_version
shell: bash
run: |
poetry version ${{ github.event.inputs.version }}
version=$(poetry version --short)
echo ::set-output name="version::$version"
- name: Update changelog
id: changelog
shell: bash
run: |
poetry run kacl-cli release ${{ steps.updated_version.outputs.version }} --modify --auto-link
echo "" >> CHANGELOG.md
body=$(poetry run kacl-cli get ${{ steps.updated_version.outputs.version }})
body="${body//'%'/'%25'}"
body="${body//$'\n'/'%0A'}"
body="${body//$'\r'/'%0D'}"
echo ::set-output name="body::$body"
- name: Commit changes
uses: EndBug/add-and-commit@v7
with:
add: 'CHANGELOG.md pyproject.toml'
message: 'Release ${{ steps.updated_version.outputs.version }}'
- name: Create tag
run: |
git tag ${{ steps.updated_version.outputs.version }}
git push origin ${{ steps.updated_version.outputs.version }}
- name: Create a draft release
uses: actions/create-release@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
tag_name: ${{ steps.updated_version.outputs.version }}
release_name: Release ${{ steps.updated_version.outputs.version }}
body: ${{ steps.changelog.outputs.body }}
draft: true
18 changes: 18 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
name: Release

on:
release:
types: [ published ]

jobs:
build-and-publish:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/python-poetry-env
- name: Publish to pypi
run: |
poetry config pypi-token.pypi ${{ secrets.PYPI_TOKEN }}
poetry publish --build --no-interaction
- name: Deploy docs
run: poetry run mkdocs gh-deploy --force
54 changes: 54 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
name: Test

on:
pull_request:
push:
branches:
- "**"

jobs:
actionlint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Download actionlint
run: bash <(curl https://raw.githubusercontent.com/rhysd/actionlint/main/scripts/download-actionlint.bash) 1.6.21
shell: bash
- name: Check workflow files
run: ./actionlint -color
shell: bash

lint-cruft:
name: Check if automatic project update was successful
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Fail if .rej files exist as structure update was not successful
run: test -z "$(find . -iname '*.rej')"

pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/python-poetry-env
- run: poetry run pre-commit run --all-files

test:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12", "3.13"]
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/python-poetry-env
with:
python-version: ${{ matrix.python-version }}
- run: poetry run pytest

docs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/python-poetry-env
- run: poetry run mkdocs build
Loading
Loading