Skip to content

serializable does not remove data implicitly stored in partial functions, leading to scaling with data size and potential privacy breaches #750

Open
@ablaom

Description

@ablaom

edit: New issue title more accurately explains the issue not immediately diagnosed in the original comment below.


This issue is not related to the master or dev branches but to the breaking release branch for-a-0-point-20-release.

After merging #733 (with target for-a-0-point-20-release) which was passing CI, and bringing for-a-0-point-20-release up to date with dev (with a regular merge) I'm getting a new error in tests. File size of serialised objects has become data-dependent. The following is adapted from the failing test:

model = Stack(
       metalearner = FooBarRegressor(lambda=1.),
       model_1 = DeterministicConstantRegressor(),
       model_2=ConstantRegressor())
DeterministicStack(
    resampling = CV(
            nfolds = 6,
            shuffle = false,
            rng = Random._GLOBAL_RNG()),
    metalearner = FooBarRegressor(
            lambda = 1.0),
    model_1 = DeterministicConstantRegressor(),
    model_2 = ConstantRegressor())

filesizes = []
for n in [100, 500, 1000]
       filename = "serialized_temp_$n.jls"
       X, y = make_regression(n, 1)
       mach = machine(model, X, y)
       fit!(mach, verbosity=0)
       MLJBase.save(filename, mach)
       push!(filesizes, filesize(filename))
       rm(filename)
end

julia> filesizes
3-element Vector{Any}:
 28744
 45144
 65144

@olivierlabayle Are you able to reproduce? Any idea how this could have arisen.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions