Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random seed doesn't seem to work well #779

Open
kisel4363 opened this issue Sep 23, 2022 · 2 comments
Open

Random seed doesn't seem to work well #779

kisel4363 opened this issue Sep 23, 2022 · 2 comments

Comments

@kisel4363
Copy link

Im new with petastorm and Im facing some issues.
I need to iterate over a dataset getting three equals batches to transform 2 of them to extract some info.
The dataset consist on users ratings movies (like the Movie-Lens dataset). I need to get three batches with the same ratings(rows) to extract each user(in ratings the user could appear repeated) and extract each movie rated. I write this code.

Creating fake dataset and spark converter:

ratings_l = [
    {'uid_dec': 0, 'mid_dec': 6, 'eval': 2.18},
    {'uid_dec': 0, 'mid_dec': 7, 'eval': 3.83},
    {'uid_dec': 0, 'mid_dec': 8, 'eval': 3.94},
    {'uid_dec': 0, 'mid_dec': 9, 'eval': 4.31},
    {'uid_dec': 0, 'mid_dec': 10, 'eval': 4.48},
    {'uid_dec': 0, 'mid_dec': 11, 'eval': 3.74},
    {'uid_dec': 1, 'mid_dec': 6, 'eval': 3.21},
    {'uid_dec': 1, 'mid_dec': 7, 'eval': 2.05},
    {'uid_dec': 1, 'mid_dec': 8, 'eval': 2.24},
    {'uid_dec': 1, 'mid_dec': 9, 'eval': 2.08},
    {'uid_dec': 1, 'mid_dec': 10, 'eval': 4.94},
    {'uid_dec': 1, 'mid_dec': 11, 'eval': 4.22},
    {'uid_dec': 2, 'mid_dec': 6, 'eval': 3.52},
    {'uid_dec': 2, 'mid_dec': 7, 'eval': 2.67},
    {'uid_dec': 2, 'mid_dec': 8, 'eval': 2.69},
    {'uid_dec': 2, 'mid_dec': 9, 'eval': 2.75},
    {'uid_dec': 2, 'mid_dec': 10, 'eval': 4.93},
    {'uid_dec': 2, 'mid_dec': 11, 'eval': 2.9},
    {'uid_dec': 3, 'mid_dec': 6, 'eval': 2.0},
    {'uid_dec': 3, 'mid_dec': 7, 'eval': 2.9},
    {'uid_dec': 3, 'mid_dec': 8, 'eval': 4.74},
    {'uid_dec': 3, 'mid_dec': 9, 'eval': 2.5},
    {'uid_dec': 3, 'mid_dec': 10, 'eval': 2.18},
    {'uid_dec': 3, 'mid_dec': 11, 'eval': 4.93},
    {'uid_dec': 4, 'mid_dec': 6, 'eval': 4.46},
    {'uid_dec': 4, 'mid_dec': 7, 'eval': 2.23},
    {'uid_dec': 4, 'mid_dec': 8, 'eval': 4.42},
    {'uid_dec': 4, 'mid_dec': 9, 'eval': 4.67},
    {'uid_dec': 4, 'mid_dec': 10, 'eval': 2.65},
    {'uid_dec': 4, 'mid_dec': 11, 'eval': 2.11},
    {'uid_dec': 5, 'mid_dec': 6, 'eval': 2.31},
    {'uid_dec': 5, 'mid_dec': 7, 'eval': 2.69},
    {'uid_dec': 5, 'mid_dec': 8, 'eval': 2.41},
    {'uid_dec': 5, 'mid_dec': 9, 'eval': 4.62},
    {'uid_dec': 5, 'mid_dec': 10, 'eval': 3.96},
    {'uid_dec': 5, 'mid_dec': 11, 'eval': 2.23}
]

train_ds = spark.createDataFrame(ratings_l)

conv_train = make_spark_converter(train_ds)

Get three batches from the same converter(hoping they are the same):

epochs = 4
batch_size = 6
with conv_train.make_tf_dataset(batch_size=batch_size, num_epochs=epochs, seed=1) as train, \
     conv_train.make_tf_dataset(batch_size=batch_size, num_epochs=epochs, seed=1) as train1, \
     conv_train.make_tf_dataset(batch_size=batch_size, num_epochs=epochs, seed=1) as train2:
     epoch_eval = True
     for i, (b, b1, b2) in enumerate(zip(train, train1, train2)):
        if i%(36//batch_size) == 0:
            print('==========Epoch==========: {0}'.format(i//(36//batch_size)))
        print('==========Group of Batches  {}:'.format(i%(36//batch_size)))
        print(b[0].numpy())
        print(b1[0].numpy())
        print(b2[0].numpy())

This is the output:

==========Epoch==========: 0
==========Group of Batches 0:
[2.   2.9  4.74 2.5  2.18 4.93]
[2.18 3.83 3.94 4.31 4.48 3.74]
[2.   2.9  4.74 2.5  2.18 4.93]
==========Group of Batches  1: 
[4.46 2.23 4.42 4.67 2.65 2.11]
[3.21 2.05 2.24 2.08 4.94 4.22]
[4.46 2.23 4.42 4.67 2.65 2.11]
==========Group of Batches 2:
[2.31 2.69 2.41 4.62 3.96 2.23]
[3.52 2.67 2.69 2.75 4.93 2.9 ]
[2.31 2.69 2.41 4.62 3.96 2.23]
==========Group of Batches 3:
[2.18 3.83 3.94 4.31 4.48 3.74]
[2.18 3.83 3.94 4.31 4.48 3.74]
[2.18 3.83 3.94 4.31 4.48 3.74]
==========Group of Batches 4:
[3.21 2.05 2.24 2.08 4.94 4.22]
[3.21 2.05 2.24 2.08 4.94 4.22]
[3.21 2.05 2.24 2.08 4.94 4.22]
==========Group of Batches 5:
[3.52 2.67 2.69 2.75 4.93 2.9 ]
[3.52 2.67 2.69 2.75 4.93 2.9 ]
[3.52 2.67 2.69 2.75 4.93 2.9 ]
==========Epoch==========: 1
==========Group of Batches 0:
[2.18 3.83 3.94 4.31 4.48 3.74]
[2.   2.9  4.74 2.5  2.18 4.93]
[2.18 3.83 3.94 4.31 4.48 3.74]
==========Group of Batches 1:
[3.21 2.05 2.24 2.08 4.94 4.22]
[4.46 2.23 4.42 4.67 2.65 2.11]
[3.21 2.05 2.24 2.08 4.94 4.22]
==========Group of Batches 2:
[3.52 2.67 2.69 2.75 4.93 2.9 ]
[2.31 2.69 2.41 4.62 3.96 2.23]
[3.52 2.67 2.69 2.75 4.93 2.9 ]
==========Group of Batches 3:
[2.   2.9  4.74 2.5  2.18 4.93]
[2.   2.9  4.74 2.5  2.18 4.93]
[2.   2.9  4.74 2.5  2.18 4.93]
==========Group of Batches 4:
[4.46 2.23 4.42 4.67 2.65 2.11]
[4.46 2.23 4.42 4.67 2.65 2.11]
[4.46 2.23 4.42 4.67 2.65 2.11]
==========Group of Batches 5:
[2.31 2.69 2.41 4.62 3.96 2.23]
[2.31 2.69 2.41 4.62 3.96 2.23]
[2.31 2.69 2.41 4.62 3.96 2.23]

The question is: Why in some groups the batched are differen, for example in the Epoch1, Group of Batches2 ?.
The expected behavior is that all batches be always the same like in Epoch 1, Group of Batches3,4 and 5.

@selitvin
Copy link
Collaborator

It's likely due to a race before workers. We currently don't have a reordering buffer after the reader (that launches multiple threads or processes to parallelize the work). To test this hypothesis, please pass workers_count=1 to the make_tf_dataset call.

Obviously make_tf_dataset=1 will cause a degradation in the reading speed.

I am not sure I understand though, why do you use three different dataset instances to read different columns? i.e. what prevents you from doing something like this?

with conv_train.make_tf_dataset(batch_size=batch_size, num_epochs=epochs, seed=1) as train:
     epoch_eval = True
     for i, b, in enumerate(train):
        # b should have all three fields: uid_dec, mid_dec and eval...

@kisel4363
Copy link
Author

I´m not trying to read different columns. What I want is to make different transformations over each batch, but I need to make these transformations over the same data, thats why im trying to get three identical batches, in fact, that´s why Im calling make_tf_dataset three times over the same dataset, in each make_tf_dataset call I need to pass a different TransformSpec object.
Could I achieve the desired result in another way?
Thanks for your attention

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants