@@ -186,7 +186,7 @@ A search provider can be any website that offers some kind of search functionali
186
186
Ideally, you should also look at common prefixes of the URLs of the search results pages (e.g., ` /search ` for Google).
187
187
Narrowing down URL prefixes helps to not crawl too many captures that do not contain search results.
188
188
189
- Refer to the [ import instructions below] ( #imports ) to import providers from the AQL-22 YAML file format.
189
+ Refer to the [ import instructions below] ( #import ) to import providers from the AQL-22 YAML file format.
190
190
191
191
#### Build source pairs
192
192
@@ -214,7 +214,7 @@ aql captures fetch
214
214
215
215
Again, running the command again after adding more source pairs automatically fetches the missing captures.
216
216
217
- #### Parse URLs
217
+ #### Parse SERP URLs
218
218
219
219
Not every capture necessarily points to a search engine result page (SERP).
220
220
But usually, SERPs contain the user query in the URL, so we can filter out non-SERP captures by parsing the URLs.
@@ -234,7 +234,7 @@ aql serps parse url-offset
234
234
All the above commands can be run in parallel, and they can be run multiple times to update the SERP index.
235
235
Already parsed SERPs will be skipped.
236
236
237
- #### Download WARCs
237
+ #### Download SERP WARCs
238
238
239
239
Up to this point, we have only fetched the metadata of the captures, most prominently the URL.
240
240
However, the snippets of the SERPs are not contained in the metadata, but only on the web page.
@@ -245,14 +245,40 @@ aql serps download warc
245
245
```
246
246
247
247
This command will download the contents of each SERP to a WARC file that is stored in the configured S3 bucket.
248
- A pointer to the WARC file is stored in the SERP index so that we can quickly access a specific SERPs contents later.
248
+ A pointer to the WARC file is stored in the SERP index so that we can quickly access a specific SERP's contents later.
249
249
250
- #### Parsing WARCs
250
+ #### Parsing SERP WARCs
251
251
252
- <!-- TODO: Add instructions on how to parse the SERPs' contents from the WARC files. -->
252
+ From the WARC, we can again parse the query as it appears on the SERP.
253
253
254
+ ``` shell
255
+ aql serps parse serp-query
256
+ ```
257
+
258
+ More importantly, we can parse the snippets of the SERP.
259
+
260
+ ``` shell
261
+ aql serps parse serp-snippets
262
+ ```
263
+
264
+ Parsing the snippets from the SERP's WARC contents will also add the SERP's results to a new index.
265
+
266
+ #### Download SERP WARCs
267
+
268
+ To get the full text of each referenced result from the SERP, we need to download a capture of the result from
269
+ the web archive. Intuitively, we would like to download a capture of the result at the exact same time as the SERP
270
+ was captured. But often, web archives crawl the results later or not at all. We therefore search for
271
+ the nearest captures before and after the SERP's timestamp and download these two captures for each result,
272
+ if any could be found.
273
+
274
+ ``` shell
275
+ aql results download warc
276
+ ```
277
+
278
+ This will again download the result's contents to a WARC file that is stored in the configured S3 bucket.
279
+ A pointer to the WARC file is stored in the result index for random access to a specific result's contents.
254
280
255
- ### Imports
281
+ ### Import
256
282
257
283
We support automatically importing providers and parsers from the AQL-22 YAML-file format
258
284
(see [ ` data/selected-services.yaml ` ] ( data/selected-services.yaml ) ).
0 commit comments