You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+6-9Lines changed: 6 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,15 +21,12 @@ in the results match the search terms).
21
21
22
22
Wikipedia publishes [dumps](https://meta.wikimedia.org/wiki/Data_dumps) of their databases once per month.
23
23
24
-
To run one build you need 150GB of disc space (of which 90GB Postgresql database). The scripts process
25
-
39 languages and output 4 files. Runtime is approximately 9 hours on a 4 core, 4GB RAM machine with SSD
24
+
To run one build you need 150GB of disc space (of which 90GB is Postgresql database). The scripts process
25
+
39 languages and output one file. Runtime is approximately 9 hours on a 4 core, 4GB RAM machine with SSD
26
26
discs.
27
27
28
28
```
29
-
334M wikimedia_importance.csv.gz # the primary file
30
-
303M wikipedia_importance.sql.gz
31
-
216M wikipedia_article.csv.gz
32
-
88M wikipedia_redirect.csv.gz
29
+
334M wikimedia_importance.tsv.gz
33
30
```
34
31
35
32
@@ -51,7 +48,7 @@ retries (wikidata API being unreliable) was added.
51
48
52
49
## Output data
53
50
54
-
`wikimedia_importance.csv.gz` contains about 17 million rows. Number of lines grew 2% between 2022 and 2023.
51
+
`wikimedia_importance.tsv.gz` contains about 17 million rows. Number of lines grew 2% between 2022 and 2023.
55
52
The file tab delimited, not quoted, is sorted and contains a header row.
56
53
57
54
| Column | Type |
@@ -84,7 +81,7 @@ Currently 39 languages, English has by far the largest share.
84
81
| ... ||
85
82
| bg (Bulgarian) | 88,993 |
86
83
87
-
Examples of `wikimedia_importance.csv.gz` rows:
84
+
Examples of `wikimedia_importance.tsv.gz` rows:
88
85
89
86
* Wikipedia contains redirects, so a single wikidata object can have multiple titles even though. Each title has the same importance score. Redirects to non-existing articles are removed.
90
87
@@ -311,7 +308,7 @@ uncommon for an export starting Jan/1st to only be full ready Jan/10th or later.
311
308
312
309
9. output (0:15h)
313
310
314
-
Uses `pg_dump` tool to create SQL files. Uses SQL `COPY` command to create CSV files.
311
+
Uses `pg_dump` tool to create SQL files. Uses SQL `COPY` command to create TSV file.
0 commit comments