Skip to content

Commit d758ddf

Browse files
authored
Only wikimedia tsv file (#82)
* create only one output file, not 4
1 parent 7e04a45 commit d758ddf

File tree

3 files changed

+23
-64
lines changed

3 files changed

+23
-64
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ jobs:
66
build:
77
runs-on: ubuntu-latest
88
steps:
9-
- uses: actions/checkout@v3
9+
- uses: actions/checkout@v4
1010
- name: Install PostgreSQL
1111
run: |
1212
sudo apt-get update -qq

README.md

Lines changed: 6 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -21,15 +21,12 @@ in the results match the search terms).
2121

2222
Wikipedia publishes [dumps](https://meta.wikimedia.org/wiki/Data_dumps) of their databases once per month.
2323

24-
To run one build you need 150GB of disc space (of which 90GB Postgresql database). The scripts process
25-
39 languages and output 4 files. Runtime is approximately 9 hours on a 4 core, 4GB RAM machine with SSD
24+
To run one build you need 150GB of disc space (of which 90GB is Postgresql database). The scripts process
25+
39 languages and output one file. Runtime is approximately 9 hours on a 4 core, 4GB RAM machine with SSD
2626
discs.
2727

2828
```
29-
334M wikimedia_importance.csv.gz # the primary file
30-
303M wikipedia_importance.sql.gz
31-
216M wikipedia_article.csv.gz
32-
88M wikipedia_redirect.csv.gz
29+
334M wikimedia_importance.tsv.gz
3330
```
3431

3532

@@ -51,7 +48,7 @@ retries (wikidata API being unreliable) was added.
5148

5249
## Output data
5350

54-
`wikimedia_importance.csv.gz` contains about 17 million rows. Number of lines grew 2% between 2022 and 2023.
51+
`wikimedia_importance.tsv.gz` contains about 17 million rows. Number of lines grew 2% between 2022 and 2023.
5552
The file tab delimited, not quoted, is sorted and contains a header row.
5653

5754
| Column | Type |
@@ -84,7 +81,7 @@ Currently 39 languages, English has by far the largest share.
8481
| ... | |
8582
| bg (Bulgarian) | 88,993 |
8683

87-
Examples of `wikimedia_importance.csv.gz` rows:
84+
Examples of `wikimedia_importance.tsv.gz` rows:
8885

8986
* Wikipedia contains redirects, so a single wikidata object can have multiple titles even though. Each title has the same importance score. Redirects to non-existing articles are removed.
9087

@@ -311,7 +308,7 @@ uncommon for an export starting Jan/1st to only be full ready Jan/10th or later.
311308
312309
9. output (0:15h)
313310
314-
Uses `pg_dump` tool to create SQL files. Uses SQL `COPY` command to create CSV files.
311+
Uses `pg_dump` tool to create SQL files. Uses SQL `COPY` command to create TSV file.
315312
316313
317314
License

steps/output.sh

Lines changed: 16 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -88,39 +88,12 @@ echo "WITH from_redirects AS (
8888

8989

9090

91-
# "====================================================================="
92-
echo "Create indexes"
93-
# "====================================================================="
94-
95-
echo "CREATE INDEX wikipedia_article_title_language_idx
96-
ON wikipedia_article
97-
(title, language)
98-
;" | psqlcmd
99-
echo "CREATE INDEX wikipedia_article_wd_page_title_idx
100-
ON wikipedia_article
101-
(wd_page_title)
102-
;" | psqlcmd
103-
echo "CREATE INDEX wikipedia_redirect_language_from_title_idx
104-
ON wikipedia_redirect
105-
(language, from_title)
106-
;" | psqlcmd
107-
108-
10991

11092

11193
# "====================================================================="
112-
echo "Dump tables"
94+
echo "Dump table"
11395
# "====================================================================="
11496

115-
echo "* wikipedia_importance.sql.gz"
116-
117-
pg_dump -d $DATABASE_NAME --no-owner -t wikipedia_article -t wikipedia_redirect | \
118-
grep -v '^SET ' | \
119-
grep -v 'SELECT ' | \
120-
grep -v '\-\- ' | \
121-
sed 's/public\.//' | \
122-
pigz -9 > "$OUTPUT_PATH/wikipedia_importance.sql.gz"
123-
12497

12598
# Temporary table for sorting the output by most popular language. Nominatim assigns
12699
# the wikipedia extra tag to the first language it finds during import and English (en)
@@ -147,34 +120,23 @@ echo "CREATE TABLE top_languages AS
147120

148121

149122

150-
for TABLE in wikipedia_article wikipedia_redirect wikimedia_importance
151-
do
152-
echo "* $TABLE.csv.gz"
153-
154-
SORTCOL="title"
155-
if [[ "$TABLE" == "wikipedia_redirect" ]]; then
156-
SORTCOL="from_title"
157-
fi
123+
echo "* wikimedia_importance.tsv.gz"
158124

159-
{
160-
echo "COPY (SELECT * FROM $TABLE LIMIT 0) TO STDOUT WITH DELIMITER E'\t' CSV HEADER" | \
161-
psqlcmd
162-
echo "COPY (
163-
SELECT w.*
164-
FROM $TABLE w
165-
JOIN top_languages tl ON w.language = tl.language
166-
ORDER BY tl.size DESC, w.$SORTCOL
167-
) TO STDOUT" | \
168-
psqlcmd
169-
} | pigz -9 > "$OUTPUT_PATH/$TABLE.csv.gz"
125+
{
126+
echo "COPY (SELECT * FROM wikimedia_importance LIMIT 0) TO STDOUT WITH DELIMITER E'\t' CSV HEADER" | \
127+
psqlcmd
128+
echo "COPY (
129+
SELECT w.*
130+
FROM wikimedia_importance w
131+
JOIN top_languages tl ON w.language = tl.language
132+
ORDER BY tl.size DESC, w.title
133+
) TO STDOUT" | \
134+
psqlcmd
135+
} | pigz -9 > "$OUTPUT_PATH/wikimedia_importance.tsv.gz"
170136

171-
# default is 600
172-
chmod 644 "$OUTPUT_PATH/$TABLE.csv.gz"
173-
done
137+
# default is 600
138+
chmod 644 "$OUTPUT_PATH/wikimedia_importance.tsv.gz"
174139

175140

176141
du -h $OUTPUT_PATH/*
177-
# 220M wikipedia_article.csv.gz
178-
# 87M wikipedia_redirect.csv.gz
179-
# 305M wikipedia_importance.sql.gz
180-
# 265M wikimedia_importance.csv.gz
142+
# 265M wikimedia_importance.tsv.gz

0 commit comments

Comments
 (0)