Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf.data.Dataset vs. tf.keras.Sequence #13

Open
b-nils opened this issue Dec 20, 2021 · 6 comments
Open

tf.data.Dataset vs. tf.keras.Sequence #13

b-nils opened this issue Dec 20, 2021 · 6 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@b-nils
Copy link

b-nils commented Dec 20, 2021

Great hack to make use of the tf.data.Dataset object:

def make_gen_callable(_gen):

I am curious whether you have noticed any performance loss or gain (in terms of training duration) in comparison to using TF1, multiprocessing, and tf.keras.Sequence?

@VeeranjaneyuluToka
Copy link

VeeranjaneyuluToka commented Dec 20, 2021

@b-nils I have been using this github to train coco weights (though not succeeded yet), but i found that using tf.data.Dataset api trains faster than that of tf.keras.utils.Sequence. I could notice that using tf.data.Dataset is more than 100% beneficial in training time. However i have not used the combination that you mentioned (TF1, MP, and tf.keras.Sequence) rather i have installed TF2.4.3 and measured the difference in training time between tf.data.Dataset and tf.keras.utils.Seuqnce API.

@b-nils
Copy link
Author

b-nils commented Dec 21, 2021

@VeeranjaneyuluToka thanks for the insights! Do you think the performance might be further improved by applying preprocessing after tf.data.Dataset has been created and/or calling tf.data.Dataset.prefetch()?

@VeeranjaneyuluToka
Copy link

@b-nils Might increase, but i faced an issue when i try to implement data pipeline using tf.data.Dataset API myself prior to refer to this github and i solved that issue in the same manner. If you notice the implementation carefully, he is called repeat() method while passing to model.fit() method as parameter, i tried to do prior to calling fit() method, but it was giving error (not sure why). So i doubt if it straight away works if you call prefetch(), however i feel it is worth to experiment and check if it improves training time further. Let us know also if you get successes in that experimentation.

@alexander-pv alexander-pv added the question Further information is requested label Dec 29, 2021
@alexander-pv
Copy link
Owner

Hi, @b-nils, thanks,

I can also agree with @VeeranjaneyuluToka that with tf.data.Dataset, data processing is faster. However, in Sequence, you can just increase the queue size for the purpose. I added the prefetch option in config.py for tests. You can simply write dataset.repeat().prefetch() to make it work.

Actually, it seems like a good option to implement data processing with pure tf.data.Dataset without any additional queues and etc. Then, probably, it is worth adding a simple generator to read images, and transfer all further processing to tf.data.Dataset. I will label the issue as a possible enhancement.

@alexander-pv alexander-pv added the enhancement New feature or request label Dec 29, 2021
@VeeranjaneyuluToka
Copy link

Hi, @alexander-pv ,

Just to understand a bit more, why did you define prefetch() as an option in config? why it can not be a default behaviour?

@alexander-pv
Copy link
Owner

Hi, @VeeranjaneyuluToka,

For now, It seems to me that for a specific task it is worth setting your configuration of the Sequence queue size and buffer size in prefetch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants