Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting limit on number of checkpoints per notebook #55

Open
amangarg96 opened this issue Feb 7, 2019 · 3 comments
Open

Setting limit on number of checkpoints per notebook #55

amangarg96 opened this issue Feb 7, 2019 · 3 comments

Comments

@amangarg96
Copy link

Is there a way to set a limit on the number of checkpoints created per notebook?

The checkpoints do not store the diff but rather the whole notebook as a blob, so a large file could eat up the storage.

@ssanderson
Copy link
Contributor

There isn't a feature to do this at the moment. The closest thing is a feature to set the maximum allowed file size.

@amangarg96
Copy link
Author

What all changes would be required to implement this enhancement?

Correct me if my understanding is wrong- The 'limit' argument could be passed to the PostgresCheckpoints class through the configuration file, and changes can be made to the create_notebook_checkpoint() method, which would delete the oldest checkpoint if limit is getting crossed before creating the new checkpoint.

@ssanderson
Copy link
Contributor

That sounds roughly correct, though some of the details here might be tricky.

Some specific forms of trickiness that are worth worrying about:

  1. Do the update and delete(s) go in the same transaction, or in separate transactions? If separate, what's the correct ordering?
  2. If the deletion logic gets invoked and there's more than 1 extra checkpoint (e.g., because someone was previously running pgcontents without the limit, and then they turned the limit on), what should happen?

Regarding (1), in general, one needs to be careful about performing multiple updates in separate statements within the same transaction, with different kinds of care being needed depending on your transaction isolation level. In this case, because we only ever write checkpoints once, I think it should be safe to do the delete and the insert in the same transaction (i.e., within the same with engine.begin() block), but I'd want to think carefully about it for a bit before adding a feature that could delete user data if I'm wrong.

Regarding (2), the options here are probably "only delete one checkpoint", or "delete until we're back under the limit". My intuition is that "delete until under the limit" is what most people would expect, though I'm open to arguments otherwise. Assuming that's what we'd want here, I think the right SQL to emit would be something like:

DELETE from remote_checkpoints
WHERE id in (
    SELECT id from remote_checkpoints
    ORDER BY last_modified DESCENDING
    OFFSET {max_checkpoints}
);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants