You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the export functionality is limited to one page of results. Ideally, it should be possible to download entire result sets, e.g. for "og" that is currently ~300k results on the Clarin Korp instance.
Test of current query API endpoint
Querying for “og” in all available corpora with the paging set to 500K results in a very sluggish download:
For any query, the query results are cached under the key query_data. So in theory the second attempt at downloading the same result set should be fast. We have tried with slightly smaller queries (10K), albeit on a slow connection, and caching does seem to be present.
Hypothetical solution
We want to create a download(s) endpoint which proxies the query endpoint in order to create output in different formats/configurations. In our case, we just want to have a CSV-encoding, but it should be made in such a way as to enable different formats too.
Chunking or entire query downloaded at once?
One consideration is whether we should keep a buffer of the query results and create the download by combining chunks. In the ideal case, we would want to keep a buffer and perhaps associate the long URLs with our partial results with files on disk.
However, we could start out by simply serving a single file (CSV) and see how far that takes us. The CSV format should be a fraction of the QWIK JSON representation used by the frontend.
Plan of action
The first proof-of-concept should be a simple Flask service that runs locally and queries the (unprotected) backend endpoint across the network.
The second step should be putting this onto our Clarin server, perhaps even in the Docker configuration (or locally).
The third step would be forking the korp-backend project, integrating our solution. This forked backend would then have to be the one in use in our Docker setups from then on.
The fourth step is probably merging our solution into the upstream korp-backend repository, however this requires significant coordination with Språkbanken, not to mention updating our version of Korp to match the one used by Språkbanken ahead of any kind of PR.
Other comments
The URL/path of the download endpoint should be 1:1 compatible with the regular Korp search page URL, i.e. we can generate the URL in the frontend using simple string concatenation, making the Korp frontend changes minimal.
The text was updated successfully, but these errors were encountered:
Philip wants the entire breadth of search options to be available, which in effect means that the call to the backend query endpoint must be copied in its entirety, i.e. the API interface must be identical.
Unfortunately, this means that we can't just copy from the URL in the address bar to construct links. Instead, the source of the call to the query endpoint must be tracked down and similar code copy-pasted to construct the path of the download endpoint on that specific search result page.
Currently, the export functionality is limited to one page of results. Ideally, it should be possible to download entire result sets, e.g. for "og" that is currently ~300k results on the Clarin Korp instance.
Test of current query API endpoint
Querying for “og” in all available corpora with the paging set to 500K results in a very sluggish download:
https://alf.hum.ku.dk/korp/backend/query?default_context=1%20sentence&show=sentence,pos,msd,lemma,ref,prefix,suffix&show_struct=text_title&start=0&end=500000&corpus=LSPCONSTRUCTIONEB1,LSPCONSTRUCTIONEB2,LSPCONSTRUCTIONMURO,LSPCONSTRUCTIONSBI,LSPAGRICULTUREJORDBRUGSFORSKNING,LSPCLIMATEAKTUELNATURVIDENSKAB,LSPCLIMATEDMU,LSPCLIMATEHOVEDLAND,LSPCLIMATEOEKRAAD,LSPHEALTH1AKTUELNATURVIDENSKAB,LSPHEALTH1LIBRISSUNDHED,LSPHEALTH1NETPATIENT,LSPHEALTH1REGIONH,LSPHEALTH1SOEFARTSSTYRELSEN,LSPHEALTH1SST,LSPHEALTH2SUNDHEDDK1,LSPHEALTH2SUNDHEDDK2,LSPHEALTH2SUNDHEDDK3,LSPHEALTH2SUNDHEDDK5,LSPNANONANO1,LSPNANONANO2,LSPNANONANO3,LSPNANONANO4,LSPNANOAKTUELNATURVIDENSKAB&cqp=[word%20=%20%22og%22]&query_data=&context=&incremental=true&default_within=sentence&within=
For any query, the query results are cached under the key
query_data
. So in theory the second attempt at downloading the same result set should be fast. We have tried with slightly smaller queries (10K), albeit on a slow connection, and caching does seem to be present.Hypothetical solution
We want to create a download(s) endpoint which proxies the query endpoint in order to create output in different formats/configurations. In our case, we just want to have a CSV-encoding, but it should be made in such a way as to enable different formats too.
Chunking or entire query downloaded at once?
One consideration is whether we should keep a buffer of the query results and create the download by combining chunks. In the ideal case, we would want to keep a buffer and perhaps associate the long URLs with our partial results with files on disk.
However, we could start out by simply serving a single file (CSV) and see how far that takes us. The CSV format should be a fraction of the QWIK JSON representation used by the frontend.
Plan of action
The first proof-of-concept should be a simple Flask service that runs locally and queries the (unprotected) backend endpoint across the network.
The second step should be putting this onto our Clarin server, perhaps even in the Docker configuration (or locally).
The third step would be forking the korp-backend project, integrating our solution. This forked backend would then have to be the one in use in our Docker setups from then on.
The fourth step is probably merging our solution into the upstream korp-backend repository, however this requires significant coordination with Språkbanken, not to mention updating our version of Korp to match the one used by Språkbanken ahead of any kind of PR.
Other comments
The URL/path of the download endpoint should be 1:1 compatible with the regular Korp search page URL, i.e. we can generate the URL in the frontend using simple string concatenation, making the Korp frontend changes minimal.
The text was updated successfully, but these errors were encountered: