Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datasets cannot handle nested json if features is given. #7116

Closed
ljw20180420 opened this issue Aug 20, 2024 · 3 comments
Closed

datasets cannot handle nested json if features is given. #7116

ljw20180420 opened this issue Aug 20, 2024 · 3 comments

Comments

@ljw20180420
Copy link

Describe the bug

I have a json named temp.json.

{"ref1": "ABC", "ref2": "DEF", "cuts":[{"cut1": 3, "cut2": 5}]}

I want to load it.

ds = datasets.load_dataset('json', data_files="./temp.json", features=datasets.Features({
    'ref1': datasets.Value('string'),
    'ref2': datasets.Value('string'),
    'cuts': datasets.Sequence({
        "cut1": datasets.Value("uint16"),
        "cut2": datasets.Value("uint16")
    })
}))

The above code does not work. However, I can load it without giving features.

ds = datasets.load_dataset('json', data_files="./temp.json")

Is it possible to load integers as uint16 to save some memory?

Steps to reproduce the bug

As in the bug description.

Expected behavior

The data are loaded and integers are uint16.

Environment info

Copy-and-paste the text below in your GitHub issue.

  • datasets version: 2.21.0
  • Platform: Linux-5.15.0-118-generic-x86_64-with-glibc2.35
  • Python version: 3.11.9
  • huggingface_hub version: 0.24.5
  • PyArrow version: 17.0.0
  • Pandas version: 2.2.2
  • fsspec version: 2024.5.0
@lhoestq
Copy link
Member

lhoestq commented Aug 22, 2024

Hi ! Sequence has a weird behavior for dictionaries (from tensorflow-datasets), use a regular list instead:

ds = datasets.load_dataset('json', data_files="./temp.json", features=datasets.Features({
    'ref1': datasets.Value('string'),
    'ref2': datasets.Value('string'),
    'cuts': [{
        "cut1": datasets.Value("uint16"),
        "cut2": datasets.Value("uint16")
    }]
}))

@ljw20180420
Copy link
Author

Hi ! Sequence has a weird behavior for dictionaries (from tensorflow-datasets), use a regular list instead:

ds = datasets.load_dataset('json', data_files="./temp.json", features=datasets.Features({
    'ref1': datasets.Value('string'),
    'ref2': datasets.Value('string'),
    'cuts': [{
        "cut1": datasets.Value("uint16"),
        "cut2": datasets.Value("uint16")
    }]
}))

Thank you!

@ljw20180420
Copy link
Author

It works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants