Releases: tensorflow/datasets
Releases · tensorflow/datasets
v4.8.3
Added
Changed
Deprecated
- Python 3.7 support: this version and future version use Python 3.8.
Removed
Fixed
- Flag
ignore_verificationsfrom Hugging Face'sdatasets.load_datasetis
deprecated, and used to cause errors intfds.load(huggingface:foo).
Security
v4.8.2
Deprecated
- Python 3.7 support: this is the last version of TFDS supporting Python 3.7.
Future versions will use Python 3.8.
Fixed
tfds newandtfds buildbetter support the new recommended datasets
organization, where individual datasets have their own package under
datasets/, builder class is calledBuilderand is defined within module
${dsname}_dataset_builder.py.
Security
v4.8.1
Changed
- Added file
valid_tags.txtto not break builds. - TFDS no longer relies on TensorFlow DTypes. We chose NumPy DTypes to keep the
typing expressiveness, while dropping the heavy dependency on TensorFlow. We
migrated all our internal datasets. Please, migrate accordingly:tf.bool:np.bool_tf.string:np.str_tf.int64,tf.int32, etc:np.int64,np.int32, etctf.float64,tf.float32, etc:np.float64,np.float32, etc
v4.8.0
Added
- [API]
DatasetBuilder's description and citations can be specified in
dedicatedREADME.mdandCITATIONS.bibfiles, within the dataset package
(see https://www.tensorflow.org/datasets/add_dataset). - Tags can be associated to Datasets, in the
TAGS.txtfile. For
now, they are only used in the generated documentation. - [API][Experimental] New
ViewBuilderto define datasets as transformations
of existing datasets. Also addstfds.transformwith functionality to apply
transformations. - Loggers are also called on
tfds.as_numpy(...), baseLoggerclass has a
new corresponding method. tfds.core.DatasetBuildercan have a default limit for the number of
simultaneous downloads.tfds.download.DownloadConfigcan override it.tfds.features.Audiosupports storing raw audio data for lazy decoding.- The number of shards can be overridden when preparing a dataset:
builder.download_and_prepare(download_config=tfds.download.DownloadConfig(num_shards=42)).
Alternatively, you can configure the min and max shard size if you want TFDS
to compute the number of shards for you, but want to have control over the
shard sizes.
Changed
Deprecated
Removed
Fixed
Security
v4.7.0
Added
- [API] Added TfDataBuilder that is handy for storing experimental ad hoc TFDS datasets in notebook-like environments such that they can be versioned, described, and easily shared with teammates.
- [API] Added options to create format-specific dataset builders. The new API now includes a number of NLP-specific builders, such as:
- [API] Added
tfds.beam.inc_counterto reducebeam.metrics.Metrics.counterboilerplate - [API] Added options to group together existing TFDS datasets into dataset collections and to perform simple operations over them.
- [Documentation] update, specifically:
- [TFDS CLI] Supports custom config through Json (e.g.
tfds build my_dataset --config='{"name": "my_custom_config", "description": "Abc"}') - New datasets:
- conll2003
- universal_dependency 2.10
- bucc
- i_naturalist2021
- mtnt Machine Translation of Noisy Text.
- placesfull
- tatoeba
- user_libri_audio
- user_libri_text
- xtreme_pos
- yahoo_ltrc
- Updated datasets:
- C4 was updated to version 3.1.
- common_voice was updated to a more recent snapshot.
- wikipedia was updated with the
20220620snapshot.
- New dataset collections, such as xtreme and LongT5
Changed
- The base
Loggerclass expects more information to be passed to theas_datasetmethod. This should only be relevant to people who have implemented and registered customLoggerclass(es). - You can set
DEFAULT_BUILDER_CONFIG_NAMEin aDatasetBuilderto change the default config if it shouldn't be the first builder config defined inBUILDER_CONFIGS.
Deprecated
Removed
Fixed
- Various datasets
- In Linux, when loading a dataset from a directory that is not your home (
~) directory, a new~directory is not created in the current directory (fixes #4117).
Security
v4.6.0
Added
- Support for community datasets on GCS.
- [API]
tfds.builder_from_directoryandtfds.builder_from_directories, see
https://www.tensorflow.org/datasets/external_tfrecord#directly_from_folder. - [API] Dash ("-") support in split names.
- [API]
file_formatargument todownload_and_preparemethod, allowing user
to specify an alternative file format to store prepared data (e.g. "riegeli"). - [API]
file_formattoDatasetInfostring representation. - [API] Expose the return value of Beam pipelines. This allows for users to
read the Beam metrics. - [API] Expose Feature
tf_example_specto public. - [API]
dockwarg onFeatures, to describe a feature. - [Documentation] Features description is shown on TFDS Catalog.
- [Documentation] More metadata about HuggingFace datasets in TFDS catalog.
- [Performance] Parallel load of metadata files.
- [Testing] TFDS tests are now run using GitHub actions - misc improvements such
as caching and sharding. - [Testing] Improvements to MockFs.
- New datasets.
Changed
- [API]
num_shardsis now optional in the shard name.
Removed
- TFDS pathlib API, migrated to a self-contained
etils.epath(see
https://github.com/google/etils).
Fixed
- Various datasets.
- Dataset builders that are defined adhoc (e.g. in Colab).
- Better
DatasetNotFoundErrormessages. - Don't set
deterministicon a global level but locally in interleave, so it
only apply to interleave and not all transformations. - Google drive downloader.
As always, thank you to all contributors!
v4.5.2
v4.5.1
v4.5.0
This is the last version of TFDS supporting 3.6. Future version will use 3.7
-
Better split API:
- Splits can be selected using shards:
split='train[3shard]' - Underscore supported in numbers for better readability:
split='train[:500_000]' - Select the union of all splits with
split='all' tfds.even_splitsis more precise and flexible:- Return splits exactly of the same size when passed
tfds.even_splits('train', n=3, drop_remainder=True) - Works on subsplits
tfds.even_splits('train[:75%]', n=3)or even nested - Can be composed with other splits:
tfds.even_splits('train', n=3)[0] + 'test'
- Return splits exactly of the same size when passed
- Splits can be selected using shards:
-
FeatureConnectors:
- Faster dataset generation (using tfrecords)
- Features now have
serialize_example/deserialize_examplemethods to encode/decode example to proto:example_bytes = features.serialize_example(example_data) Audionow supportsencoding='zlib'for better compression- Features specs exposed in proto for better compatibility with other languages
-
Better testing:
- Mock dataset now supports nested datasets
- Customize the number of sub examples
-
Documentation update:
- Community datasets: https://www.tensorflow.org/datasets/community_catalog/overview
- New guide on TFDS and determinism
-
RLDS:
- Nested datasets features are supported
- New datasets: Robomimic, D4RL Ant Maze, RLU Real World RL, and RLU Atari with ordered episodes
-
Misc:
- Create beam pipeline using TFDS as input with tfds.beam.ReadFromTFDS
- Support setting the file formats in
tfds build --file_format=tfrecord - Typing annotations exposed in
tfds.typing tfds.ReadConfighas a newassert_cardinality=Falseto disable cardinality- Add a tfds.display_progress_bar(True) for functional control
- Support for huge number of shards (>99999)
- DatasetInfo exposes
.release_notes
And of course, new datasets, bug fixes,...
Thank you to all our contributors for improving TFDS!
v4.4.0
API:
- Add
PartialDecodingsupport, to decode only a subset of the features (for performances) - Catalog now expose links to KnowYourData visualisations
tfds.as_numpysupports datasets withNone- Dataset generated with
disable_shuffling=Trueare now read in generation order. - Loading datasets from files now supports custom
tfds.features.FeatureConnector tfds.testing.mock_datanow supports- non-scalar tensors with dtype
tf.string builder_from_filesand path-based community datasets
- non-scalar tensors with dtype
- File format automatically restored (for datasets generated with
tfds.builder(..., file_format=)). - Many new reinforcement learning datasets
- Various bug fixes and internal improvements like:
- Dynamically set number of worker thread during extraction
- Update progression bar during download even if downloads are cached
Dataset creation:
- Add
tfds.features.LabeledImagefor semantic segmentation (like image but with additionalinfo.features['image_label'].namelabel metadata) - Add float32 support for
tfds.features.Image(e.g. for depth map) - All FeatureConnector can now have a
Nonedimension anywhere (previously restricted to the first position). tfds.features.Tensor()can have arbitrary number of dynamic dimension (Tensor(..., shape=(None, None, 3, None)))tfds.features.Tensorcan now be serialised as bytes, instead of float/int values (to allow better compression):Tensor(..., encoding='zlib')- Add script to add TFDS metadata files to existing TF-record (see doc).
- New guide on common implementation gotchas
Thank you all for your support and contribution!