Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serializable does not remove data implicitly stored in partial functions, leading to scaling with data size and potential privacy breaches #750

Open
ablaom opened this issue Mar 25, 2022 · 6 comments

Comments

@ablaom
Copy link
Member

ablaom commented Mar 25, 2022

edit: New issue title more accurately explains the issue not immediately diagnosed in the original comment below.


This issue is not related to the master or dev branches but to the breaking release branch for-a-0-point-20-release.

After merging #733 (with target for-a-0-point-20-release) which was passing CI, and bringing for-a-0-point-20-release up to date with dev (with a regular merge) I'm getting a new error in tests. File size of serialised objects has become data-dependent. The following is adapted from the failing test:

model = Stack(
       metalearner = FooBarRegressor(lambda=1.),
       model_1 = DeterministicConstantRegressor(),
       model_2=ConstantRegressor())
DeterministicStack(
    resampling = CV(
            nfolds = 6,
            shuffle = false,
            rng = Random._GLOBAL_RNG()),
    metalearner = FooBarRegressor(
            lambda = 1.0),
    model_1 = DeterministicConstantRegressor(),
    model_2 = ConstantRegressor())

filesizes = []
for n in [100, 500, 1000]
       filename = "serialized_temp_$n.jls"
       X, y = make_regression(n, 1)
       mach = machine(model, X, y)
       fit!(mach, verbosity=0)
       MLJBase.save(filename, mach)
       push!(filesizes, filesize(filename))
       rm(filename)
end

julia> filesizes
3-element Vector{Any}:
 28744
 45144
 65144

@olivierlabayle Are you able to reproduce? Any idea how this could have arisen.

@ablaom
Copy link
Member Author

ablaom commented Mar 25, 2022

I've created a test PR referenced above of for-a-0-point-20-release onto dev which hopefully will reproduce the fail.

@ablaom
Copy link
Member Author

ablaom commented Mar 26, 2022

Perhaps the resampling update to Stack is responsible.

@olivierlabayle
Copy link
Collaborator

Perhaps the resampling update to Stack is responsible.

@ablaom Indeed, I've had a quick look. It seems the problem is coming from a source node operation (unfortunately this is an anonymous function) which seems to store a r attribute which is a vector of integers. Probably the rows used. It is probably related to the fact that I have switched to the use to train_test_pairs as you suggest. Do you have any clue that could fasten my search?

@olivierlabayle
Copy link
Collaborator

I think this is be coming from the function selectrows which has to store the data because it is a partial function. In that sense I don't think anything is wrong with the stack, it is rather coming from an edge case, partial functions, not dealt with by the current logic in serializable. I am a bit unsure about what to do about it.

@ablaom
Copy link
Member Author

ablaom commented Mar 28, 2022

Ah-haaa! Yes that makes sense.

@olivierlabayle Thanks for taking time to diagnose this problem. This is an interesting case I had never thought of: data being stored in partial functions.

I think addressing this issue would be pretty difficult in the current design. We may need to simply document the problem for now, but I'll leave this open. I'll update the issue title and opening comment.

To resolve the current fail, I suggest modifying the test with an explicit declaration of resampling in the form of train-test pairs that are iterators and not vectors. Presumably the size of the iterator will not scale with the size of the data, but we must explicitly generate the iterators for each data length. We should cross-reference this issue in the modified test to explain the acrobatics.

@olivierlabayle What do think?

@ablaom ablaom changed the title Issue with serialization serializable does not remove data implicitly stored in partial functions, leading to scaling with data size and potential privacy breaches Mar 28, 2022
@olivierlabayle
Copy link
Collaborator

Ah-haaa! Yes that makes sense.

@olivierlabayle Thanks for taking time to diagnose this problem. This is an interesting case I had never thought of: data being stored in partial functions.

I think addressing this issue would be pretty difficult in the current design. We may need to simply document the problem for now, but I'll leave this open. I'll update the issue title and opening comment.

To resolve the current fail, I suggest modifying the test with an explicit declaration of resampling in the form of train-test pairs that are iterators and not vectors. Presumably the size of the iterator will not scale with the size of the data, but we must explicitly generate the iterators for each data length. We should cross-reference this issue in the modified test to explain the acrobatics.

@olivierlabayle What do think?

I am coming back from holiday tomorrow and can have a look yes! Another solution would be to just perform the test with a simple model, ie not a stack, and that should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants