-
-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: it is not possible to reference a wacz file back to where it comes from e.g. using the GUI crawl_id #2051
Comments
Hi @tuehlarsen, the It's true that there's currently no way to search in the Archived Items table by this crawl id - that's an oversight that we should likely fix! |
It's worth noting that the same crawl id is part of the naming convention for the WARC files within the WACZ as well, but the WARC filenames have additional prefixes such as the first seed URL that the WACZ files don't have (in part I think to keep filenames reasonably small for portability, but we could reassess that). |
Ok, if you search for crawl name "dmi.dk" in browsertrix you find the crawl_id: manual-20240718145854-611eb86b-1c5 but all the wacz files are named this way: |
Ah @tuehlarsen, I forgot that this is actually configurable in the Helm chart! Which explains why what I was seeing on our dev server differed. In
The default only includes the timestamp, but you can use the |
crawl_id is allways a part of the included warc.gz filenames. If you sep. by '-' it is allways pos 4-7. We can now figure out to use the API calls with the crawl_id. The only thing missing now is a GUI way to search for the crawl_id. |
Browsertrix Version
v1.11.3-12f994b
What did you expect to happen? What happened instead?
When you download wacz files using the API you get wacz filenames like "20230225142507561-manual-20230225141525-7c09730b-c08.wacz but it has no refence back to where it comes from e.g. by using the crawl_id and it is not possible to search for the wacz filename or part of it in the browsertrix GUI.
Reproduction instructions
see above
Screenshots / Video
No response
Environment
No response
Additional details
No response
The text was updated successfully, but these errors were encountered: