Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document source of various metadata files #1506

Open
steven-sheehy opened this issue Dec 10, 2024 · 4 comments
Open

Document source of various metadata files #1506

steven-sheehy opened this issue Dec 10, 2024 · 4 comments

Comments

@steven-sheehy
Copy link
Contributor

The cht, dat, and metadat folders contain files whose source of truth is unclear. Without knowing the source, it's hard to know if the data needs to be updated. Also, if the source is known then automatic processes could be developed to automatically keep them up to date or alternative sources could be suggested. For example, it might be interesting to replace most of the metadat files with code that scrapes a game database like screenscraper.fr or thegamesdb.net.

This documentation could be in the form of a simple README inside each specific sub-folder or a comment with a URL added to the header of the dat file.

@OctopusButtons
Copy link
Contributor

OctopusButtons commented Dec 30, 2024

files whose source of truth is unclear. Without knowing the source, it's hard to know if the data needs to be updated

I thought the source list in the readme looks good. Is that new from after you posted the issue, or what kind of source do you mean?

But yes I was looking for clearer documentation about exactly which parts make sense for github contribution versus which parts are imported periodically from Redump etc (which would over-write in-house contributions). I'm looking through Histories to try to determine, in order to update Readme.

Case Study Questions
Below questions came up when I tried to find out where Retroarch is getting the name "Virtua Fighter 4 - Evolution [Greatest Hits] (USA)" which then doesn't match any thumbnails.

  • dat documentation says it's "Customized DAT files, maintained by the libretro team"
    • How it functions alongside all other dat files?
      • Answer: It has precedence because it's earlier in the list (maybe and/or earlier in the build script if based on folder order)
    • Is the reason so that we can work around / fix issues or gaps from the Redump etc groups, with our own addenda? If yes, that's great, and I'll PR added documentation, but I'm not sure.
    • What is the purpose or function for example of the SNES dat, is it because No-Intro excludes Virtual Console variants?
      • Answer: I'm surprised to find out that No-Intro doesn't catalog Virtual Console SNES variants (as of 2025) as SNES games (but as Wii digital games, etc) but without stating that policy anywhere. I had assumed the libretro file was made because No-Intro just hadn't logged VC variants yet at the time. Yet I see No-Intro logs VC variants of a GBA game, but not VC SNES examples.
    • Conflicts? If there's conflicting info between dat and metadat (e.g. imported from Redump) which one wins?
      • Answer: earlier items in the dat list take precedence. Hence the in-house dat can be used to over-ride problems from No-Intro etc.
  • metadat/developer connects, for example, Virtua Fighter 4 - Evolution [Greatest Hits] (USA) to serial SLUS-20616 while the redump file instead uses SLUS-20616GH
    • Shouldn't that be covered by the redump metadata (below)? In the case of Virtua Fighter 4 - Evolution [Greatest Hits], the developer dat info is only that name and a serial (which apparently conflicts with the redump "GH" serial). The metadat/developer dat doesn't always have checksum hashes to connect file to info.
  • metadat/redump (link) doesn't have any [Greatest Hits] subtitle/tag for any Virtua Fighter 4. Yet redumps own website lists the "Greatest Hits" version tag as an "Edition" field not in the name. And the metadat/redump file connects "Virtua Fighter 4 - Evolution (USA)" (no [Greatest Hits] title tag) to serial "SLUS-20616GH.
  • How can a person directly view/search the compiled rdb files, to avoid manually looking at every component .dat that went into it?
    • Answer: the .rdbs can be opened with a hex editor, and have no connection with Redis .RDB (oops).
    • How does the final rdb assign the [Greatest Hits] title tag to my Virtua Fighter 4 Evolution file, when the redump database doesn't have the [Greatest Hits] title tag, and when the metadat/developer database has [Greatest Hits] name info but connects it to SLUS-20616 not SLUS-20616GH?
      • Answer: file's serial is used as key field for matching (also see) (not to be confused with metadata) is used as key-field, and/or earlier dat takes precedence in compilation.
  • Pending. Will trace using my file checksum to see which/where dat sequence hypothetically assigns the final RA name and serial, then update comment, and maybe update documentation accordingly if I learn. Since I've now learned so much about the databases, I may not even bother tracing the VF4 issue, I trust that the in-file serial is keying to database entries and I'm not too interested in the GH/non-GH edition aspect.

Big Question: which databases should people contribute to via libretro github databases? Versus which changes should only go through those groups rather than libretro's github dat addenda? It's obvious in cases like metadat/no-intro, metadat/tosec, and metadat/redump, but many others are less straightforward. I'm looking through the Histories and will update the ReadMe with specifics about each dat.

  • Answers:
      1. I'm adding some documentation per-dat that clarifies some cases that are fit for github contributions
      1. smaller squib databases are usually "ad hoc" and made by contributions
      1. a person can study the github History to see which dats are manual contributions
      1. many large ones are imported so any "fix" would get overwritten
      1. the modular nature of the dat system means it's easy to add an additional ad hoc dat (if the build script gets updated accordingly to hit the new folder/files)
      1. confusion arises with deprecated dats that were created with a large import and now somewhat abandoned, so that they could probably become ad hoc / contribution edits without risk of bulk over-writes.

I'll update the documentation, after I understand it well enough. Working on massive readme update with everything my research has turned up.

@steven-sheehy
Copy link
Contributor Author

steven-sheehy commented Dec 30, 2024

I thought the source list in the readme looks good. Is that new from after you posted the issue, or what kind of source do you mean?

Sorry, I should've been more clear. The README mostly documents the source of the dats, metadat/no-intro, metadat/redump, metadat/tosec, and metadat/mame* folders. But not the cht or the rest of the metadat folders. Their source of truth is still unclear to me.

Big Question: Should people contribute to libretro github databases where appropriate? Or do the Redump databases get copied periodically from Redump etc, meaning changes should only go through Redump etc?

I'm not an expert, but I believe Rob periodically syncs the metadat/no-intro, metadat/redump, metadat/tosec, and metadat/mame* folders from the upstream dats using his libretro-dats. See his most recent PR for an example. So don't think you would want to manually update those files and the generated RDB but the rest are fair game.

@OctopusButtons
Copy link
Contributor

OctopusButtons commented Dec 31, 2024

Oh I see what you mean. Though cht specifically I figure is any source / anywhere / any contribution, since cheat codes are a low-stakes side perk in the app. But yeah: I'm going through and researching each sub-database in order to update the per dat bullet list with

  • A) description. (I'm now an expert on the history of "Ukie" because of the elspa dat...)
  • B) sourcing
  • C) clear flag on user-contribution-relevant or not.

If anyone can give any casual info, I'll revise/combine/format additions to Readme.

@OctopusButtons
Copy link
Contributor

OctopusButtons commented Jan 21, 2025

The cht, dat, and metadat folders contain files whose source of truth is unclear.

documentation could be in the form of a simple README inside each specific sub-folder or a comment with a URL added to the header of the dat file.

#1514 has two points intended to help address the same concerns:

  • Added a spec: People's clrmamepro header description should clearly document A) source of the data B) in addition to the contributor/user/builder/scraper source (either in description or other fields). The often-seen "Manufacturer - Console" uninformative description is not a requirement, is not used by RetroArch, and should be used for clear source documentation and supporting info.
  • Added some per-folder/per-dat description for a few cases where my research gained understanding. But usually I could only document the apparent meaning not the source of truth. But it's still mostly anomalous even looking at Histories: in some cases a person added bulk data years ago, it's not clear from where, and the person isn't active. I tagged a few cases that appear to now be user-contributable (rather than routine bulk-ovewrites). But I don't think there's a verification standard for manual data contributions...which personally I'm OK with, since it can be revised / reverted / dealt with if a problem is found. And we probably don't need per-item/entry documentation.

The problem with retroactively adding/editing a clarifying header description or comment is that the contributing builders/scrapers/scripts are probably coded to add the old uninformative header again. So if the .dat headers were manually changed, they'd get over-written in the future, until everyone follows the header description (or comment etc) spec in the proposed readme update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants