Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: it is not possible to reference a wacz file back to where it comes from e.g. using the GUI crawl_id #2051

Open
tuehlarsen opened this issue Aug 27, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@tuehlarsen
Copy link

Browsertrix Version

v1.11.3-12f994b

What did you expect to happen? What happened instead?

When you download wacz files using the API you get wacz filenames like "20230225142507561-manual-20230225141525-7c09730b-c08.wacz but it has no refence back to where it comes from e.g. by using the crawl_id and it is not possible to search for the wacz filename or part of it in the browsertrix GUI.

Reproduction instructions

see above

Screenshots / Video

No response

Environment

No response

Additional details

No response

@tuehlarsen tuehlarsen added the bug Something isn't working label Aug 27, 2024
@tw4l
Copy link
Contributor

tw4l commented Aug 27, 2024

Hi @tuehlarsen, the manual-20230225141525-7c09730b-c08 part of the WACZ fiename should be the crawl id in Browsertrix! You can check the crawl id field in the crawl's Overview tab to verify. A timestamp is added to the beginning in order to provide unique filenames when there are multiple WACZ files per crawl.

It's true that there's currently no way to search in the Archived Items table by this crawl id - that's an oversight that we should likely fix!

@tw4l
Copy link
Contributor

tw4l commented Aug 27, 2024

It's worth noting that the same crawl id is part of the naming convention for the WARC files within the WACZ as well, but the WARC filenames have additional prefixes such as the first seed URL that the WACZ files don't have (in part I think to keep filenames reasonably small for portability, but we could reassess that).

@tuehlarsen
Copy link
Author

tuehlarsen commented Aug 27, 2024

Ok, if you search for crawl name "dmi.dk" in browsertrix you find the crawl_id: manual-20240718145854-611eb86b-1c5 but all the wacz files are named this way:
20240718154355358-611eb86b-1c5-0.wacz
20240718154755196-611eb86b-1c5-1.wacz
You need the full crawl_id in the file name to use e.g. crawl_id in the API.
Where can I find the missing part manual-20240718145854- in the wacz files?

@webrecorder webrecorder deleted a comment Aug 27, 2024
@tw4l
Copy link
Contributor

tw4l commented Aug 27, 2024

Ah @tuehlarsen, I forgot that this is actually configurable in the Helm chart! Which explains why what I was seeing on our dev server differed. In chart/values.yaml, take a look at the following setting:

# default template for generate wacz files
# supports following interpolated vars:
# @ts - current timestamp
# @hostname - full hostname
# @hostsuffix - last 14-characters of hostname
# @id - full crawl id
default_crawl_filename_template: "@[email protected]"

The default only includes the timestamp, but you can use the @id variable to include the full crawl id in the filenames :)

@webrecorder webrecorder deleted a comment Aug 27, 2024
@tuehlarsen
Copy link
Author

crawl_id is allways a part of the included warc.gz filenames. If you sep. by '-' it is allways pos 4-7. We can now figure out to use the API calls with the crawl_id. The only thing missing now is a GUI way to search for the crawl_id.
e.g.
GUI crawl_id: manual-20240718145854-611eb86b-1c5
20240718154355358-611eb86b-1c5-0]$ ls archive
kb-dmi-dk-manual-20240718145854-611eb86b-1c5-20240718145909636-0.warc.gz
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Triage
Development

No branches or pull requests

3 participants
@tw4l @tuehlarsen and others