Skip to content

Commit a9dcc2e

Browse files
Update documentation
1 parent 219684b commit a9dcc2e

File tree

1 file changed

+33
-7
lines changed

1 file changed

+33
-7
lines changed

README.md

Lines changed: 33 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -186,7 +186,7 @@ A search provider can be any website that offers some kind of search functionali
186186
Ideally, you should also look at common prefixes of the URLs of the search results pages (e.g., `/search` for Google).
187187
Narrowing down URL prefixes helps to not crawl too many captures that do not contain search results.
188188

189-
Refer to the [import instructions below](#imports) to import providers from the AQL-22 YAML file format.
189+
Refer to the [import instructions below](#import) to import providers from the AQL-22 YAML file format.
190190

191191
#### Build source pairs
192192

@@ -214,7 +214,7 @@ aql captures fetch
214214

215215
Again, running the command again after adding more source pairs automatically fetches the missing captures.
216216

217-
#### Parse URLs
217+
#### Parse SERP URLs
218218

219219
Not every capture necessarily points to a search engine result page (SERP).
220220
But usually, SERPs contain the user query in the URL, so we can filter out non-SERP captures by parsing the URLs.
@@ -234,7 +234,7 @@ aql serps parse url-offset
234234
All the above commands can be run in parallel, and they can be run multiple times to update the SERP index.
235235
Already parsed SERPs will be skipped.
236236

237-
#### Download WARCs
237+
#### Download SERP WARCs
238238

239239
Up to this point, we have only fetched the metadata of the captures, most prominently the URL.
240240
However, the snippets of the SERPs are not contained in the metadata, but only on the web page.
@@ -245,14 +245,40 @@ aql serps download warc
245245
```
246246

247247
This command will download the contents of each SERP to a WARC file that is stored in the configured S3 bucket.
248-
A pointer to the WARC file is stored in the SERP index so that we can quickly access a specific SERPs contents later.
248+
A pointer to the WARC file is stored in the SERP index so that we can quickly access a specific SERP's contents later.
249249

250-
#### Parsing WARCs
250+
#### Parsing SERP WARCs
251251

252-
<!-- TODO: Add instructions on how to parse the SERPs' contents from the WARC files. -->
252+
From the WARC, we can again parse the query as it appears on the SERP.
253253

254+
```shell
255+
aql serps parse serp-query
256+
```
257+
258+
More importantly, we can parse the snippets of the SERP.
259+
260+
```shell
261+
aql serps parse serp-snippets
262+
```
263+
264+
Parsing the snippets from the SERP's WARC contents will also add the SERP's results to a new index.
265+
266+
#### Download SERP WARCs
267+
268+
To get the full text of each referenced result from the SERP, we need to download a capture of the result from
269+
the web archive. Intuitively, we would like to download a capture of the result at the exact same time as the SERP
270+
was captured. But often, web archives crawl the results later or not at all. We therefore search for
271+
the nearest captures before and after the SERP's timestamp and download these two captures for each result,
272+
if any could be found.
273+
274+
```shell
275+
aql results download warc
276+
```
277+
278+
This will again download the result's contents to a WARC file that is stored in the configured S3 bucket.
279+
A pointer to the WARC file is stored in the result index for random access to a specific result's contents.
254280

255-
### Imports
281+
### Import
256282

257283
We support automatically importing providers and parsers from the AQL-22 YAML-file format
258284
(see [`data/selected-services.yaml`](data/selected-services.yaml)).

0 commit comments

Comments
 (0)