429 Too many requests #733

uleodolter · 2022-06-08T10:35:30Z

Hi,

I have configured https://github.com/dbpedia/marvin-config to extract german wikipedia. A first run worked for the 20220401 dump.

Today i run again to extract the 20220601 dump, but it only worked partly the extraction framework and after some time only HTTP 429 was returned from https://de.wikipedia.org/w/api.php.

Exception; de; Main Extraction at 00:00.957s for 62 datasets; Main Extraction failed for instance http://de.dbpedia.org/resource/Liste_von_Autoren/J: Server returned HTTP response code: 429 for URL: https://de.wikipedia.org/w/api.php java.io.IOException: Server returned HTTP response code: 429 for URL: https://de.wikipedia.org/w/api.php at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1902) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268) at org.dbpedia.extraction.util.MediaWikiConnector$$anonfun$retrievePage$1.apply$mcVI$sp(MediaWikiConnector.scala:97) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:166) ...

I used the following settings in extractionConfiguration/extraction.de.properties

mwc-apiUrl=https://{{LANG}}.wikipedia.org/w/api.php
mwc-maxRetries=5
mwc-connectMs=4000
mwc-readMs=30000
mwc-sleepFactor=2000

It seems the extraction-framework does not handle this HTTP error properly. I would be great if the Retry-After HTTP header is used to handle such errors. Any suggestions which properties to adjust for this problem?

The text was updated successfully, but these errors were encountered:

jlareck · 2022-06-22T08:45:48Z

Hi, we are currently reworking the abstract extraction

uleodolter · 2022-10-20T18:41:14Z

Any updates on this or workaround for this ? the extraction of german wikipedia worked only once in April 2022.

jlareck · 2022-10-29T18:26:49Z

Hi, yes, we have some updates around text extraction. So, this summer, we had a Google Summer of Code project during which one student upgraded text extraction and it became better (at least we reduced number of 429 errors but still sometimes text extraction process becomes frozen at some point of time). So in this branch there is all related work https://github.com/dbpedia/extraction-framework/tree/celian-gsoc .

During this gsoc project there were implemented two new MediawikiConnectors based on previous one:

https://github.com/dbpedia/extraction-framework/blob/celian-gsoc/core/src/main/scala/org/dbpedia/extraction/util/MediawikiConnectorConfigured.scala - this MediawikiConnector uses current Mediawiki API that we always have used before, but there was added some new configurations so as result number of 429 HTTP errors were reduced. But sometimes extraction doesn't completes and when maybe 70-95% (I am not completly sure in these numbers but when we tested it and compared with datasets that we had in previous releases, the number of extracted pages looks like were almost the same) of pages from dump were extracted then the extraction process just becomes frozen. I recommend you to run extraction only for one language per process (in extraction.text.properties file just write one language).

https://github.com/dbpedia/extraction-framework/blob/celian-gsoc/core/src/main/scala/org/dbpedia/extraction/util/MediaWikiConnectorRest.scala - here is used new REST Mediawiki API. And for this one we still have same problem with frozen process during extraction.

jlareck added the status: fix-required PR related to issue is needed label Jun 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

429 Too many requests #733

429 Too many requests #733

uleodolter commented Jun 8, 2022

jlareck commented Jun 22, 2022

uleodolter commented Oct 20, 2022

jlareck commented Oct 29, 2022 •

edited

Loading

429 Too many requests #733

429 Too many requests #733

Comments

uleodolter commented Jun 8, 2022

jlareck commented Jun 22, 2022

uleodolter commented Oct 20, 2022

jlareck commented Oct 29, 2022 • edited Loading

jlareck commented Oct 29, 2022 •

edited

Loading