-
Notifications
You must be signed in to change notification settings - Fork 1
RFC for a Better Global Expiry API #66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
LevelDB can also expire/delete data at either an object, or a whole LevelDB file level depending on the settings. | ||
|
||
The [current LevelDB expiry implementation and bucket type proposal](https://github.com/basho/leveldb/wiki/mv-bucket-expiry2#three-properties-three-names-each) uses different values and periods for the TTL on the object, and the TTL for the bucket property. The object TTL uses a `uint64` that represents **milliseconds**, and the bucket property TTL uses a `string` that represents a shorthand time string. There is another intermediary setting that is for the `leveldb::ExpiryModuleOS` that uses minutes for it's ttl setting, the aptly named `expiry_minutes`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- There is no presentation here of leveldb's per object expiry, i.e. specific expiry date set on an individual object. Internally this is millisecond uint64.
- A discussion once occurred about having TS use an object's timestamp to set the "write time" portion of the TTL type code.
- There is support for X-Riak-Meta-Expiry-Base-Seconds property in KV's enterprise edition. Notes are here: https://github.com/basho/leveldb/wiki/mv-bucket-expiry
### Proposal | ||
|
||
Although the two Expiry strategies have the same end-goals, the process by which they do it is different. They overlap in the KV + LevelDB + EE configuration space, but elsewhere they are independent of each other. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it appropriate to say that they also overlap in the clients space? Or does "configuration space" imply clients already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We indeed have different mechanisms to enforce TTL configuration, but we should strive to hide this complexity from the end user. Exposing internal details via configuration file is nit the best idea, so I prefer the Option 2. Having the same units will simplify the configuration. For consistency, I think we should re-use usual way Riak gets time interval configuration - e.g. 30m, 300s, 1h, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, we should probably expose the Cuttlefish "duration" style to both bucket types/bucket properties, and riak.conf. Behind the scenes it would get converted to seconds when it gets placed on an object or needs computed. The cuttlefish duration module already exists, so it shouldn't be a terrible amount of work.
- The `expiration_mode` property can override the default value in the `riak.conf` file. | ||
- Setting `expiration_mode` to `whole_file` will let LevelDB remove entire files of expired records without compaction. | ||
- Setting `expiration_mode` to `per_item` will require LevelDB to do a compaction on the data file to remove expired records. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
above settings originated from this document: https://docs.google.com/document/d/1byxSyUFasGYWpUG2eZzqep6QfL9huCAOJxMk2ENzRTo/edit
The [TTL property](https://github.com/basho/riak_pb/blob/develop/src/riak_kv.proto#L237) on each object is an unsigned 32-bit integer that represents **seconds**, and it's value can range from **0** (immediate expiry) to **2^32**. | ||
|
||
|
||
##### 2\. LevelDB "Bucket" Expiry (KV EE and TS EE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the shipping global expiry need to be mentioned? All config is within riak.conf.
>A duration string consists of series of one or more number/suffix combinations. Example: "2d7h32m" is two days, 7 hours, and 32 minutes. The code converts that example string to 3,332 minutes. The number must be a whole number, no decimal fractions. The valid suffixes are "f" (fortnight), "w" (week), "d" (day), "h" (hour), and "m" minute. | ||
|
||
Notes: | ||
- The `expiration` property can only be changed through a `riak.conf` file change, and it is a read-only bucket property. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty sure expiration can be set to "on/off" at TS table / KV bucket type level. Why do you think it is read-only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expiration flag at the global level is the master on/off switch. If it is off, there is no expiry operational in any situation. This allows the user to easily disable everything expiry in an emergency. The expiration flag at the bucket / table level is read/write. Allows the same "switch" control at the bucket level. But "on" at bucket level is overruled by "off" at global level.
- Setting `expiration_mode` to `per_item` will require LevelDB to do a compaction on the data file to remove expired records. | ||
|
||
|
||
While Sweeper operates using scheduled folds across the data, TS's Expiry uses LevelDB to do expiry. Tombstoning happens whenever someone requests expired data via Get or Query operations, and deletion occurs during regular compactions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think tombstones are written when DEL operation is performed. However, I thought that if expiration is performed by the backend, it does not involve writing tombstones, rather deleting the objects directly. Am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If LevelDB comes across an expired object during a get or put, it will write a LevelDB tombstone, which gets removed during compaction. Sweeper uses Riak tombstones, which can be seen by MDC/etc. I'll update this in the next revision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Leveldb comes across an expired object during a Get or Iterate (fold), the existing object becomes a tombstone. No rewrites. Same expired object quietly disappears per rules for tombstones in compaction ... because after the expiry date, it is a tombstone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matthewvon Does LevelDB just see that expired object as a tombstone based on it's expired TTL (check TTL every time)? Or does it swap in an actual leveldb tombstone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sees expired object as a tombstone. no swap.
|
||
###### Option 2 - Combine | ||
Another (harder) option would be to combine the APIs, but this would necessitate rework on both ends. A combined bucket properties API would then look like: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we confuse TTL per object with TTL per table / bucket type. I do not see the need for TTL per bucket.
Sweeper works at objects level, LevelDB at bucket type or global levels, Bitcask on a global level only.
I believe that having TTL per object does not require expiration mode to 'on' at either global or bucket type level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, TTL per object does require expiration mode to be "on" at global and bucket type. TTL by bucket type is designed for people that have different types of data segregated into buckets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a V2 revision here: 1b8692b#diff-a9cbdc847687d390e707fa7a0b119212
| LevelDB | `expiration_mode` | `expiration_mode` | | ||
|
||
Notes: | ||
- LevelDB's TTL time period setting changes from a string to an integer, which would change that setting in the other locations for LevelDB (see [LevelDB Expiry API](https://github.com/basho/leveldb/wiki/mv-bucket-expiry2#three-properties-three-names-each)). A value of `0` could be the new `unlimited`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value should be string at customer facing level. We can implement it as an integer internally of course if it helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A value of zero was originally a synonym for unlimited. Then zero became the flag for TTL is off, but explicit expiry date is still active. 'unlimited' became MAX_UINT-2. Erik bitched about that, so now unlimited is an independent flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a V2 revision here: 1b8692b#diff-a9cbdc847687d390e707fa7a0b119212
|
||
Notes: | ||
- LevelDB's TTL time period setting changes from a string to an integer, which would change that setting in the other locations for LevelDB (see [LevelDB Expiry API](https://github.com/basho/leveldb/wiki/mv-bucket-expiry2#three-properties-three-names-each)). A value of `0` could be the new `unlimited`. | ||
- We would need a new `sweeper` value for the existing LevelDB `expiration_mode` enum for the overlap case. Sweeper would then need to check for this and for the `expiration` boolean before sweeping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweeper sweeps not only for expiration. Actually, if sweeper "sees" that bucket type uses LevelDB and it has expiration to on, it should not perform expiration. It can still sweep the content - for example, to create the cached list of existing buckets or other background tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, if the bucket is stored in a leveldb backend and has leveldb expiry turned on, it should (perhaps) not expire the object... this again argues for making sure we understand the different use cases and making sure that an end-user can use one, or both, of the expiration options depending on their particular needs. I could imagine using Sweeper expiry in cases were MDC was used, but also using LevelDB expiry as a "fallback" mechanism (or to eliminate the need for the tombstone reaper part of Sweeper - simply write the tombstone with a leveldb expiry of the tombstone grace period and let Level compact it away when it's supposed to go away, for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angrycub What do you see as a proper use case when the two overlap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a V2 revision here: 1b8692b#diff-a9cbdc847687d390e707fa7a0b119212
Related: basho/riak_kv#1642 |
I've added a V2 revision here: 1b8692b#diff-a9cbdc847687d390e707fa7a0b119212 |
|
||
##### 1\. Riak KV Sweeper Expiry | ||
|
||
With the upcoming Sweeper changes to Riak KV 2.5, a user can set a per-object TTL through object or bucket properties. Whenever sweeps are done, an expiry module for Sweeper can check and expire objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing possibly worth mentioning here: if a client tries to get an expired object, Riak will notice that the object is expired and will return not_found
, even if the object has not been swept away yet. The sweep is what actually deletes the expired objects, but to the user it will appear that objects expire exactly when they are set to expire, even if the sweeper hasn't gotten around to actually deleting them yet.
No description provided.