[Bug]: it is not possible to reference a wacz file back to where it comes from e.g. using the GUI crawl_id #2051

tuehlarsen · 2024-08-27T13:28:37Z

Browsertrix Version

v1.11.3-12f994b

What did you expect to happen? What happened instead?

When you download wacz files using the API you get wacz filenames like "20230225142507561-manual-20230225141525-7c09730b-c08.wacz but it has no refence back to where it comes from e.g. by using the crawl_id and it is not possible to search for the wacz filename or part of it in the browsertrix GUI.

Reproduction instructions

see above

Screenshots / Video

No response

Environment

No response

Additional details

No response

tw4l · 2024-08-27T13:49:22Z

Hi @tuehlarsen, the manual-20230225141525-7c09730b-c08 part of the WACZ fiename should be the crawl id in Browsertrix! You can check the crawl id field in the crawl's Overview tab to verify. A timestamp is added to the beginning in order to provide unique filenames when there are multiple WACZ files per crawl.

It's true that there's currently no way to search in the Archived Items table by this crawl id - that's an oversight that we should likely fix!

tw4l · 2024-08-27T13:51:13Z

It's worth noting that the same crawl id is part of the naming convention for the WARC files within the WACZ as well, but the WARC filenames have additional prefixes such as the first seed URL that the WACZ files don't have (in part I think to keep filenames reasonably small for portability, but we could reassess that).

tuehlarsen · 2024-08-27T14:54:59Z

Ok, if you search for crawl name "dmi.dk" in browsertrix you find the crawl_id: manual-20240718145854-611eb86b-1c5 but all the wacz files are named this way:
20240718154355358-611eb86b-1c5-0.wacz
20240718154755196-611eb86b-1c5-1.wacz
You need the full crawl_id in the file name to use e.g. crawl_id in the API.
Where can I find the missing part manual-20240718145854- in the wacz files?

tw4l · 2024-08-27T15:11:34Z

Ah @tuehlarsen, I forgot that this is actually configurable in the Helm chart! Which explains why what I was seeing on our dev server differed. In chart/values.yaml, take a look at the following setting:

# default template for generate wacz files
# supports following interpolated vars:
# @ts - current timestamp
# @hostname - full hostname
# @hostsuffix - last 14-characters of hostname
# @id - full crawl id
default_crawl_filename_template: "@[email protected]"

The default only includes the timestamp, but you can use the @id variable to include the full crawl id in the filenames :)

tuehlarsen · 2024-08-27T17:33:27Z

crawl_id is allways a part of the included warc.gz filenames. If you sep. by '-' it is allways pos 4-7. We can now figure out to use the API calls with the crawl_id. The only thing missing now is a GUI way to search for the crawl_id.
e.g.
GUI crawl_id: manual-20240718145854-611eb86b-1c5
20240718154355358-611eb86b-1c5-0]$ ls archive
kb-dmi-dk-manual-20240718145854-611eb86b-1c5-20240718145909636-0.warc.gz
...

tuehlarsen added the bug Something isn't working label Aug 27, 2024

webrecorder deleted a comment Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: it is not possible to reference a wacz file back to where it comes from e.g. using the GUI crawl_id #2051

[Bug]: it is not possible to reference a wacz file back to where it comes from e.g. using the GUI crawl_id #2051

tuehlarsen commented Aug 27, 2024

tw4l commented Aug 27, 2024

tw4l commented Aug 27, 2024

tuehlarsen commented Aug 27, 2024 •

edited

Loading

tw4l commented Aug 27, 2024

tuehlarsen commented Aug 27, 2024

[Bug]: it is not possible to reference a wacz file back to where it comes from e.g. using the GUI crawl_id #2051

[Bug]: it is not possible to reference a wacz file back to where it comes from e.g. using the GUI crawl_id #2051

Comments

tuehlarsen commented Aug 27, 2024

Browsertrix Version

What did you expect to happen? What happened instead?

Reproduction instructions

Screenshots / Video

Environment

Additional details

tw4l commented Aug 27, 2024

tw4l commented Aug 27, 2024

tuehlarsen commented Aug 27, 2024 • edited Loading

tw4l commented Aug 27, 2024

tuehlarsen commented Aug 27, 2024

tuehlarsen commented Aug 27, 2024 •

edited

Loading