Random seed doesn't seem to work well #779

kisel4363 · 2022-09-23T18:03:48Z

Im new with petastorm and Im facing some issues.
I need to iterate over a dataset getting three equals batches to transform 2 of them to extract some info.
The dataset consist on users ratings movies (like the Movie-Lens dataset). I need to get three batches with the same ratings(rows) to extract each user(in ratings the user could appear repeated) and extract each movie rated. I write this code.

Creating fake dataset and spark converter:

ratings_l = [
    {'uid_dec': 0, 'mid_dec': 6, 'eval': 2.18},
    {'uid_dec': 0, 'mid_dec': 7, 'eval': 3.83},
    {'uid_dec': 0, 'mid_dec': 8, 'eval': 3.94},
    {'uid_dec': 0, 'mid_dec': 9, 'eval': 4.31},
    {'uid_dec': 0, 'mid_dec': 10, 'eval': 4.48},
    {'uid_dec': 0, 'mid_dec': 11, 'eval': 3.74},
    {'uid_dec': 1, 'mid_dec': 6, 'eval': 3.21},
    {'uid_dec': 1, 'mid_dec': 7, 'eval': 2.05},
    {'uid_dec': 1, 'mid_dec': 8, 'eval': 2.24},
    {'uid_dec': 1, 'mid_dec': 9, 'eval': 2.08},
    {'uid_dec': 1, 'mid_dec': 10, 'eval': 4.94},
    {'uid_dec': 1, 'mid_dec': 11, 'eval': 4.22},
    {'uid_dec': 2, 'mid_dec': 6, 'eval': 3.52},
    {'uid_dec': 2, 'mid_dec': 7, 'eval': 2.67},
    {'uid_dec': 2, 'mid_dec': 8, 'eval': 2.69},
    {'uid_dec': 2, 'mid_dec': 9, 'eval': 2.75},
    {'uid_dec': 2, 'mid_dec': 10, 'eval': 4.93},
    {'uid_dec': 2, 'mid_dec': 11, 'eval': 2.9},
    {'uid_dec': 3, 'mid_dec': 6, 'eval': 2.0},
    {'uid_dec': 3, 'mid_dec': 7, 'eval': 2.9},
    {'uid_dec': 3, 'mid_dec': 8, 'eval': 4.74},
    {'uid_dec': 3, 'mid_dec': 9, 'eval': 2.5},
    {'uid_dec': 3, 'mid_dec': 10, 'eval': 2.18},
    {'uid_dec': 3, 'mid_dec': 11, 'eval': 4.93},
    {'uid_dec': 4, 'mid_dec': 6, 'eval': 4.46},
    {'uid_dec': 4, 'mid_dec': 7, 'eval': 2.23},
    {'uid_dec': 4, 'mid_dec': 8, 'eval': 4.42},
    {'uid_dec': 4, 'mid_dec': 9, 'eval': 4.67},
    {'uid_dec': 4, 'mid_dec': 10, 'eval': 2.65},
    {'uid_dec': 4, 'mid_dec': 11, 'eval': 2.11},
    {'uid_dec': 5, 'mid_dec': 6, 'eval': 2.31},
    {'uid_dec': 5, 'mid_dec': 7, 'eval': 2.69},
    {'uid_dec': 5, 'mid_dec': 8, 'eval': 2.41},
    {'uid_dec': 5, 'mid_dec': 9, 'eval': 4.62},
    {'uid_dec': 5, 'mid_dec': 10, 'eval': 3.96},
    {'uid_dec': 5, 'mid_dec': 11, 'eval': 2.23}
]

train_ds = spark.createDataFrame(ratings_l)

conv_train = make_spark_converter(train_ds)

Get three batches from the same converter(hoping they are the same):

epochs = 4
batch_size = 6
with conv_train.make_tf_dataset(batch_size=batch_size, num_epochs=epochs, seed=1) as train, \
     conv_train.make_tf_dataset(batch_size=batch_size, num_epochs=epochs, seed=1) as train1, \
     conv_train.make_tf_dataset(batch_size=batch_size, num_epochs=epochs, seed=1) as train2:
     epoch_eval = True
     for i, (b, b1, b2) in enumerate(zip(train, train1, train2)):
        if i%(36//batch_size) == 0:
            print('==========Epoch==========: {0}'.format(i//(36//batch_size)))
        print('==========Group of Batches  {}:'.format(i%(36//batch_size)))
        print(b[0].numpy())
        print(b1[0].numpy())
        print(b2[0].numpy())

This is the output:

==========Epoch==========: 0
==========Group of Batches 0:
[2.   2.9  4.74 2.5  2.18 4.93]
[2.18 3.83 3.94 4.31 4.48 3.74]
[2.   2.9  4.74 2.5  2.18 4.93]
==========Group of Batches  1: 
[4.46 2.23 4.42 4.67 2.65 2.11]
[3.21 2.05 2.24 2.08 4.94 4.22]
[4.46 2.23 4.42 4.67 2.65 2.11]
==========Group of Batches 2:
[2.31 2.69 2.41 4.62 3.96 2.23]
[3.52 2.67 2.69 2.75 4.93 2.9 ]
[2.31 2.69 2.41 4.62 3.96 2.23]
==========Group of Batches 3:
[2.18 3.83 3.94 4.31 4.48 3.74]
[2.18 3.83 3.94 4.31 4.48 3.74]
[2.18 3.83 3.94 4.31 4.48 3.74]
==========Group of Batches 4:
[3.21 2.05 2.24 2.08 4.94 4.22]
[3.21 2.05 2.24 2.08 4.94 4.22]
[3.21 2.05 2.24 2.08 4.94 4.22]
==========Group of Batches 5:
[3.52 2.67 2.69 2.75 4.93 2.9 ]
[3.52 2.67 2.69 2.75 4.93 2.9 ]
[3.52 2.67 2.69 2.75 4.93 2.9 ]
==========Epoch==========: 1
==========Group of Batches 0:
[2.18 3.83 3.94 4.31 4.48 3.74]
[2.   2.9  4.74 2.5  2.18 4.93]
[2.18 3.83 3.94 4.31 4.48 3.74]
==========Group of Batches 1:
[3.21 2.05 2.24 2.08 4.94 4.22]
[4.46 2.23 4.42 4.67 2.65 2.11]
[3.21 2.05 2.24 2.08 4.94 4.22]
==========Group of Batches 2:
[3.52 2.67 2.69 2.75 4.93 2.9 ]
[2.31 2.69 2.41 4.62 3.96 2.23]
[3.52 2.67 2.69 2.75 4.93 2.9 ]
==========Group of Batches 3:
[2.   2.9  4.74 2.5  2.18 4.93]
[2.   2.9  4.74 2.5  2.18 4.93]
[2.   2.9  4.74 2.5  2.18 4.93]
==========Group of Batches 4:
[4.46 2.23 4.42 4.67 2.65 2.11]
[4.46 2.23 4.42 4.67 2.65 2.11]
[4.46 2.23 4.42 4.67 2.65 2.11]
==========Group of Batches 5:
[2.31 2.69 2.41 4.62 3.96 2.23]
[2.31 2.69 2.41 4.62 3.96 2.23]
[2.31 2.69 2.41 4.62 3.96 2.23]

The question is: Why in some groups the batched are differen, for example in the Epoch1, Group of Batches2 ?.
The expected behavior is that all batches be always the same like in Epoch 1, Group of Batches3,4 and 5.

The text was updated successfully, but these errors were encountered:

selitvin · 2022-09-23T18:26:09Z

It's likely due to a race before workers. We currently don't have a reordering buffer after the reader (that launches multiple threads or processes to parallelize the work). To test this hypothesis, please pass workers_count=1 to the make_tf_dataset call.

Obviously make_tf_dataset=1 will cause a degradation in the reading speed.

I am not sure I understand though, why do you use three different dataset instances to read different columns? i.e. what prevents you from doing something like this?

with conv_train.make_tf_dataset(batch_size=batch_size, num_epochs=epochs, seed=1) as train:
     epoch_eval = True
     for i, b, in enumerate(train):
        # b should have all three fields: uid_dec, mid_dec and eval...

kisel4363 · 2022-09-23T23:10:13Z

I´m not trying to read different columns. What I want is to make different transformations over each batch, but I need to make these transformations over the same data, thats why im trying to get three identical batches, in fact, that´s why Im calling make_tf_dataset three times over the same dataset, in each make_tf_dataset call I need to pass a different TransformSpec object.
Could I achieve the desired result in another way?
Thanks for your attention

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random seed doesn't seem to work well #779

Random seed doesn't seem to work well #779

kisel4363 commented Sep 23, 2022

selitvin commented Sep 23, 2022

kisel4363 commented Sep 23, 2022

Random seed doesn't seem to work well #779

Random seed doesn't seem to work well #779

Comments

kisel4363 commented Sep 23, 2022

selitvin commented Sep 23, 2022

kisel4363 commented Sep 23, 2022