Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tab-delimited data file format for tabular data (using Experience.txt as an example) #6399

Merged
merged 8 commits into from
Aug 24, 2023

Conversation

ephphatha
Copy link
Contributor

Intended to support #211, this documents a loose Tab-delimited format similar to IANA TSV and Linear TSV with the following key differences:

  • No escape sequences are defined. This is expected to be handled by the consuming area (e.g. we might decide to move credits to an external file and when loading that data we transform \ t byte sequences to a tab character). Existing Diablo 2 data files use paths with backslash characters that frequently precede t, n, r characters like "act5\nihlathakmusic.wav" which would conflict with using this type of extension across the board.
  • Empty fields are permitted with no special marker value required. There are existing sparse Diablo 2 data files which have empty fields and the \N sequence used by Linear TSV for this purpose commonly appears in paths like "Act5\Nihlathak\nih_act5_q3_after.wav".
  • At least one record is expected which should be treated as a header line. In practice this can consist of a single empty field (it's really up to the consuming area to interpret records as headers and document/enforce restrictions).
  • Records are free to contain any number of fields (mainly so we don't have to keep track of what the expected field count is. When parsing we could just default-initialise any remaining fields if we run into a record separator early).
  • Trailing newlines are permitted. LibreOffice Calc leaves a trailing newline when saving a "Text CSV" file, Google Sheets doesn't when downloading a .tsv file. Diablo 2 datafiles appear to end with a newline. Short of requiring records to contain at least one value I think it's easier to just expect consumers to discard the empty record following a trailing newline in this case.

This allows effectively streaming through files with generic code similar to what DakkJaniels posted in discord: Try it online!

or with more specific parsing like: Try it online!

@ephphatha
Copy link
Contributor Author

Got it far enough to actually load from a file alongside assets, but it needs to be written with the correct includes etc to actually compile on non-msvc compilers...

I'm also not sure if burying files under the assets directory is desirable or we'd rather have another asset include directory specified in the Cmake files so text files can be easier to discover. Something to think about later :)

@AJenbo
Copy link
Member

AJenbo commented Jul 23, 2023

it needs to be written with the correct includes etc to actually compile on non-msvc compilers...

Don't beat your self up like that, it also compiled successfully on GCC 13.1... specifically

@AJenbo
Copy link
Member

AJenbo commented Jul 23, 2023

I'm also not sure if burying files under the assets directory is desirable or we'd rather have another asset include directory specified in the Cmake files so text files can be easier to discover. Something to think about later :)

I have been thinking that maybe it would be wroth it to move the whole assets folder out now that there is a lot more going on in it and things are getting more data driven in general.

@ephphatha
Copy link
Contributor Author

So, uh, can we move to C++17? 😄 The ParseInt helper glebm has in #6175 would be usable but the way from_chars returns a pointer indicating the end of parsing even in out_of_range failure cases is useful. Ultimately it doesn't matter if I scan over fields twice since it's only loaded once when the app starts.

Copy link
Collaborator

@glebm glebm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the .tsv file extension so that files are displayed as tables on GitHub.

Packaging/resources/assets/txtdata/Experience.txt Outdated Show resolved Hide resolved
Source/data/common.hpp Outdated Show resolved Hide resolved
Source/data/common.hpp Outdated Show resolved Hide resolved
Source/utils/stdcompat/charconv.hpp Outdated Show resolved Hide resolved
Source/data/common.hpp Outdated Show resolved Hide resolved
Source/data/common.hpp Outdated Show resolved Hide resolved
Source/playerdat.cpp Outdated Show resolved Hide resolved
Source/playerdat.cpp Outdated Show resolved Hide resolved
Source/playerdat.cpp Outdated Show resolved Hide resolved
@glebm
Copy link
Collaborator

glebm commented Aug 19, 2023

Ah I see, yeah, let's use from_chars directly then but limit the range to uint8_t.
We can indeed use C++17 now but you first have to rebase on top of the development branch and retarget the PR there

@ephphatha
Copy link
Contributor Author

Let's use the .tsv file extension so that files are displayed as tables on GitHub.

Was meaning to comment on this earlier as well, I originally used the exact naming convention used by D2 text files (TitleCase.txt) thinking that was gonna be less controversial 😄 . Using .tsv for nicer formatting on GitHub is a good reason to go with that extension (even though the files don't actually comply with text/tsv).

@AJenbo AJenbo changed the base branch from master to development August 19, 2023 09:38
@ephphatha ephphatha force-pushed the external-data branch 3 times, most recently from c44c9df to d30a1cf Compare August 19, 2023 10:58
@ephphatha
Copy link
Contributor Author

Ah I see, yeah, let's use from_chars directly then but limit the range to uint8_t.

Forgot to comment on this one too. uint8_t level; std::from_chars(begin, end, level) calls std::from_chars<uint8_t>(const char *, const char *, uint8_t &, base = 10), it returns a result with ec::out_of_range if a sequence of digits is encountered that ends up larger than UINT8_T_MAX while also advancing to the first non-digit character. That behaviour is desirable in this instance, level will keep the value 0 set when it was declared (inside the outer loop over the whole record)

@ephphatha ephphatha marked this pull request as draft August 20, 2023 04:47
Source/engine/assets.hpp Outdated Show resolved Hide resolved
Source/playerdat.cpp Outdated Show resolved Hide resolved
@ephphatha ephphatha marked this pull request as ready for review August 20, 2023 07:48
@ephphatha ephphatha force-pushed the external-data branch 2 times, most recently from cb9ed79 to 80e4206 Compare August 20, 2023 09:46
glebm
glebm previously approved these changes Aug 20, 2023
ephphatha and others added 8 commits August 21, 2023 18:52
No real need to persist this value
Also added an iterator based API, though it's not useful for this use-case. Might be nice in the future?

The field/record iterators is single-pass input iterators with shared state.
To avoid rescanning fields unnecessarily parseInt currently can only be called once, it would be possible to make these iterators bidirectional with a bit of extra state (holding onto the start pointer)

Co-authored-by: Gleb Mazovetskiy <[email protected]>
@AJenbo AJenbo merged commit 32c3316 into diasurgical:development Aug 24, 2023
16 of 17 checks passed
@ephphatha ephphatha deleted the external-data branch August 24, 2023 04:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants