Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] 2023 updates #76

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
*/data
.idea/
*.pyc
*.swp
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
FarmSubsidy.org Scrapers
========================

[FarmSubsidy](http://farmsubsidy.openspending.org/) is a website that collects the payment data of the Common Agriculture Policy (CAP) which represents about a third of the EU budget. It was run by a group of journalists and activists for the past years. In 2013 the [OpenSpending project](http://openspending.org/) of the [Open Knowledge Foundation](http://okfn.org/) took over responsibility of the website.
[FarmSubsidy](https://farmsubsidy.org) is a website that collects the payment data of the Common Agriculture Policy (CAP) which represents about a third of the EU budget. It was run by a group of journalists and activists for the past years. In 2013 the [OpenSpending project](http://openspending.org/) of the [Open Knowledge Foundation](http://okfn.org/) took over responsibility of the website.

The FarmSubsidy data is mostly scraped from member state websites. The old scrapers were working well, but were running in costly and proprietary software. This year we need Free and Open Source scrapers and this repository will collect these scrapers and coordinate the effort.

Expand All @@ -13,7 +13,7 @@ Developer Documentation

Developer documentation for both website and scrapers can be found at http://farmsubsidy.readthedocs.org.

[Member states data sites](http://ec.europa.eu/agriculture/cap-funding/beneficiaries/shared/index_en.htm)
[Member states data sites](https://agriculture.ec.europa.eu/common-agricultural-policy/financing-cap/beneficiaries_en)


[Financial Reports](http://ec.europa.eu/agriculture/cap-funding/financial-reports/index_en.htm)
4 changes: 3 additions & 1 deletion be/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,6 @@ Scraper for Belgium

[Scheme Documentation](http://www.belpa.be/pub/PDF/73_2009_EN.pdf)

Scraper notebook available.
Scraper notebook available: use `./be_scraper.ipynb` (works 2023)

The scraper run might break from time to time due to IP rate limiting and/or too many requests.
243 changes: 167 additions & 76 deletions be/be_scraper.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "code",
"execution_count": 19,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -16,18 +16,25 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"YEAR = 2018\n",
"YEAR = 2021\n",
"LIMIT = 1000\n",
"BASE_URL = 'https://www.belpa.be/wsExportDataTable?limit={limit}&offset={offset}&lg=fr&budget_year=54&sort=none&&sortType=ASC&'"
"# From select dropdown value\n",
"BUDGET_YEARS = {\n",
" 2018: 54,\n",
" 2019: 289,\n",
" 2020: 291,\n",
" 2021: 292,\n",
"}\n",
"BASE_URL = 'https://www.belpa.be/wsExportDataTable?limit={limit}&offset={offset}&lg=fr&budget_year=' + str(BUDGET_YEARS[YEAR]) + '&sort=none&&sortType=ASC&'"
]
},
{
"cell_type": "code",
"execution_count": 31,
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -36,21 +43,51 @@
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0\n",
"1000\n",
"2000\n",
"3000\n",
"4000\n",
"5000\n",
"6000\n",
"7000\n",
"8000\n",
"9000\n",
"10000\n",
"11000\n",
"12000\n",
"13000\n",
"14000\n",
"15000\n",
"16000\n",
"17000\n",
"18000\n",
"19000\n",
"20000\n",
"21000\n",
"22000\n",
"23000\n",
"24000\n",
"25000\n",
"26000\n",
"27000\n",
"28000\n",
"29000\n",
"30000\n",
"31000\n",
"32000\n",
"33000\n",
"34000\n",
"35000\n",
"36000\n",
"37000\n",
"38000\n"
"37000\n"
]
}
],
Expand All @@ -76,7 +113,7 @@
},
{
"cell_type": "code",
"execution_count": 29,
"execution_count": 5,
"metadata": {},
"outputs": [
{
Expand All @@ -100,109 +137,107 @@
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>amount</th>\n",
" <th>country</th>\n",
" <th>currency</th>\n",
" <th>recipient_id</th>\n",
" <th>recipient_location</th>\n",
" <th>recipient_name</th>\n",
" <th>recipient_postcode</th>\n",
" <th>scheme</th>\n",
" <th>recipient_location</th>\n",
" <th>year</th>\n",
" <th>scheme</th>\n",
" <th>amount</th>\n",
" <th>currency</th>\n",
" <th>country</th>\n",
" <th>recipient_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-2505.23</td>\n",
" <td>BE</td>\n",
" <td>BLAISE FERNAND</td>\n",
" <td>BE-4800</td>\n",
" <td>Verviers</td>\n",
" <td>2021</td>\n",
" <td>ii1</td>\n",
" <td>1880.73</td>\n",
" <td>EUR</td>\n",
" <td>BE-2018-8045</td>\n",
" <td>Sombreffe</td>\n",
" <td>ELIARD ETIENNE - COULON SABINE EP.</td>\n",
" <td>5140</td>\n",
" <td>feader</td>\n",
" <td>2018</td>\n",
" <td>BE</td>\n",
" <td>BE-2021-7996</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-2505.23</td>\n",
" <td>BE</td>\n",
" <td>BLAISE FERNAND</td>\n",
" <td>BE-4800</td>\n",
" <td>Verviers</td>\n",
" <td>2021</td>\n",
" <td>ii3</td>\n",
" <td>3364.43</td>\n",
" <td>EUR</td>\n",
" <td>BE-2018-8045</td>\n",
" <td>Sombreffe</td>\n",
" <td>ELIARD ETIENNE - COULON SABINE EP.</td>\n",
" <td>5140</td>\n",
" <td>vb1_6</td>\n",
" <td>2018</td>\n",
" <td>BE</td>\n",
" <td>BE-2021-7996</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>5840.87</td>\n",
" <td>BE</td>\n",
" <td>BLAISE FERNAND</td>\n",
" <td>BE-4800</td>\n",
" <td>Verviers</td>\n",
" <td>2021</td>\n",
" <td>ii4</td>\n",
" <td>1741.11</td>\n",
" <td>EUR</td>\n",
" <td>BE-2018-8045</td>\n",
" <td>Sombreffe</td>\n",
" <td>LOSSON MICHEL</td>\n",
" <td>5140</td>\n",
" <td>ii1</td>\n",
" <td>2018</td>\n",
" <td>BE</td>\n",
" <td>BE-2021-7996</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>241.41</td>\n",
" <td>BE</td>\n",
" <td>BLAISE FERNAND</td>\n",
" <td>BE-4800</td>\n",
" <td>Verviers</td>\n",
" <td>2021</td>\n",
" <td>ii7</td>\n",
" <td>3079.15</td>\n",
" <td>EUR</td>\n",
" <td>BE-2018-8045</td>\n",
" <td>Sombreffe</td>\n",
" <td>LOSSON MICHEL</td>\n",
" <td>5140</td>\n",
" <td>ii10</td>\n",
" <td>2018</td>\n",
" <td>BE</td>\n",
" <td>BE-2021-7996</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3623.39</td>\n",
" <td>BE</td>\n",
" <td>BLAISE FERNAND</td>\n",
" <td>BE-4800</td>\n",
" <td>Verviers</td>\n",
" <td>2021</td>\n",
" <td>ii10</td>\n",
" <td>121.15</td>\n",
" <td>EUR</td>\n",
" <td>BE-2018-8045</td>\n",
" <td>Sombreffe</td>\n",
" <td>LOSSON MICHEL</td>\n",
" <td>5140</td>\n",
" <td>ii3</td>\n",
" <td>2018</td>\n",
" <td>BE</td>\n",
" <td>BE-2021-7996</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" amount country currency recipient_id recipient_location \\\n",
"0 -2505.23 BE EUR BE-2018-8045 Sombreffe \n",
"1 -2505.23 BE EUR BE-2018-8045 Sombreffe \n",
"2 5840.87 BE EUR BE-2018-8045 Sombreffe \n",
"3 241.41 BE EUR BE-2018-8045 Sombreffe \n",
"4 3623.39 BE EUR BE-2018-8045 Sombreffe \n",
" recipient_name recipient_postcode recipient_location year scheme amount \\\n",
"0 BLAISE FERNAND BE-4800 Verviers 2021 ii1 1880.73 \n",
"1 BLAISE FERNAND BE-4800 Verviers 2021 ii3 3364.43 \n",
"2 BLAISE FERNAND BE-4800 Verviers 2021 ii4 1741.11 \n",
"3 BLAISE FERNAND BE-4800 Verviers 2021 ii7 3079.15 \n",
"4 BLAISE FERNAND BE-4800 Verviers 2021 ii10 121.15 \n",
"\n",
" recipient_name recipient_postcode scheme year \n",
"0 ELIARD ETIENNE - COULON SABINE EP. 5140 feader 2018 \n",
"1 ELIARD ETIENNE - COULON SABINE EP. 5140 vb1_6 2018 \n",
"2 LOSSON MICHEL 5140 ii1 2018 \n",
"3 LOSSON MICHEL 5140 ii10 2018 \n",
"4 LOSSON MICHEL 5140 ii3 2018 "
" currency country recipient_id \n",
"0 EUR BE BE-2021-7996 \n",
"1 EUR BE BE-2021-7996 \n",
"2 EUR BE BE-2021-7996 \n",
"3 EUR BE BE-2021-7996 \n",
"4 EUR BE BE-2021-7996 "
]
},
"execution_count": 29,
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\n",
"\n",
"def parse_data(data):\n",
" for amount_key in data['amount'].keys():\n",
" if amount_key.endswith(('_total', '_feaga')):\n",
" if amount_key.endswith(('_total', '_feaga', '_feader')):\n",
" # ignore total and total of feaga (sub feaga amounts are present)\n",
" continue\n",
" scheme = amount_key.replace('field_mnt_', '')\n",
Expand All @@ -224,13 +259,69 @@
" for x in json.load(f):\n",
" yield from parse_data(x)\n",
" \n",
"df = pd.DataFrame(get_data(2018))\n",
"df = pd.DataFrame(get_data(YEAR))\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 30,
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ii4 33446\n",
"ii1 33438\n",
"ii10 28732\n",
"ii3 13186\n",
"iva15 11530\n",
"ii7 9050\n",
"iva18 5366\n",
"iva17 2998\n",
"iva16 2177\n",
"ii6 1820\n",
"iva4 1781\n",
"iii3 1252\n",
"i1 1050\n",
"iva6 438\n",
"vb1_6 340\n",
"iva24 187\n",
"iva9 121\n",
"vb1_2 91\n",
"iva7 87\n",
"iva21 82\n",
"iva1 30\n",
"iva12 25\n",
"vb2_4 20\n",
"iva25 20\n",
"iva2 16\n",
"iva10 11\n",
"iii4 9\n",
"i4 6\n",
"iii2 5\n",
"iii7 4\n",
"iva14 4\n",
"vb3_1 4\n",
"iii10 3\n",
"i7 2\n",
"via_1 1\n",
"iva8 1\n",
"Name: scheme, dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['scheme'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -240,7 +331,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -254,9 +345,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}
Loading