Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VRL function to normalize size units like MiB, KiB, etc. to bytes #49

Closed
jblang opened this issue Jun 28, 2022 · 5 comments · Fixed by #1198
Closed

VRL function to normalize size units like MiB, KiB, etc. to bytes #49

jblang opened this issue Jun 28, 2022 · 5 comments · Fixed by #1198
Labels
type: feature A value-adding code addition that introduce new functionality. vrl: stdlib Changes to the standard library

Comments

@jblang
Copy link

jblang commented Jun 28, 2022

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

I frequently come across log messages containing sizes using human-readable units such as MB, KB, GB, or MiB, KiB, GiB. Normalizing these to bytes is a prerequisite for comparison and statistical analysis.

Attempted Solutions

I can do this using an if statement but it adds a lot of repetitive code to every transformation where I need it.

Proposal

Add a normalize_bytes function to VRL that will take a string specifying a size in human readable capacity units and return an integer number of bytes. The function should:

  • Accept a floating point or integer number for the numeric portion
  • Accept and convert all of the common abbreviations for a particular capacity unit:
    • K, KB, KiB - kilobytes
    • M, MB, MiB - megabytes
    • G, GB, GiB - gigabytes
    • T, TB, TiB - terabytes
    • And so on...
  • Round any fractional bytes obtained from the conversion up to the nearest integer
  • Assume binary units by default (multiples of 1024 instead of 1000)
  • Have an optional parameter (binary) that when set to false uses multiples of 1000 instead of 1024
  • Accept optional whitespace between the number and the unit
  • Treat the units case-insensitively

References

No response

Version

0.22.2

@jblang jblang added the type: feature A value-adding code addition that introduce new functionality. label Jun 28, 2022
@jszwedko jszwedko added the vrl: stdlib Changes to the standard library label Jun 28, 2022
@hhromic
Copy link
Contributor

hhromic commented Jun 29, 2022

This is a very nice idea! However I would make sure to use the correct unit definitions and symbols.

From https://en.wikipedia.org/wiki/Byte#Multiple-byte_units

More than one system exists to define larger units based on the byte. Some systems are based on powers of 10; other systems are based on powers of 2. Nomenclature for these systems has been the subject of confusion. Systems based on powers of 10 reliably use standard SI prefixes (kilo, mega, giga, ...) and their corresponding symbols (k, M, G, ...). Systems based on powers of 2, however, might use binary prefixes (kibi, mebi, gibi, ...) and their corresponding symbols (Ki, Mi, Gi, ...) or they might use the prefixes K, M, and G, creating ambiguity.

The proposed binary parameter should be then more precisely named base=[2,10] or units=[binary,si] and used only for the ambiguous unit names. That is, for example MiB should always be converted using power of 2 and MB should be converted depending on the given base/units parameter.

@jblang
Copy link
Author

jblang commented Jun 29, 2022

It's much more important to me that the feature exist at all so I don't have strong opinions about what to call the parameter. I do still think the base should always default to 2, because that's what most people, other than hard drive marketeers, mean whether they include an i in the abbreviation or not. Also it should be overridable regardless. Even though it might make one's inner pedant cringe to use base 10 for a unit that is specifically defined as binary, the added flexibility is not a bad thing.

@jblang
Copy link
Author

jblang commented Jul 29, 2022

I figured out a workaround for this until a function is added to VRL. For any fields I want to be normalized, I put them in a nested object called .normbytes. For example:

transforms:
  output_large_partitions:
    type: remap
    inputs:
      - route.large_partition
    source: |-
      . |= parse_regex!(.message, r'Writing large partition (?P<keyspace>[^/]+)/(?P<table>[^:]+):(?P<partition_key>.+?) \((?P<size>[^) ]+)\)?( bytes)? to sstable (?P<sstable>[^)]*)\)?')
      .event_type = "large_partition"
      .normbytes.size = del(.size)

Then at the end of my pipeline I have this transform, so that any events with a .normbytes field will get the subfields normalized and then attached to the top-level event:

  normalize_bytes:
    type: remap
    inputs:
      - output_*
      - route*._unmatched
    source: |-
      if exists(.normbytes) {
        normbytes = object!(del(.normbytes))
        for_each(normbytes) -> |key, value| {
          p = parse_regex!(value, r'(?P<value>[\d.]+) *((?P<unit>[KMGT])iB|bytes)?')
          value = to_float!(p.value)
          factor = 
            if p.unit == "K" { 
              1024 
            } else if p.unit == "M" { 
              1048576 
            } else if p.unit == "G" { 
              1073741824
            } else if p.unit == "T" { 
              1099511627776
            } else {
              1
            }
          value = ceil(value * factor)
          . = set!(., [key], value)
        }
      }

I think this technique is general enough to allow for creating any user-defined functions needed.

@fuchsnj fuchsnj transferred this issue from vectordotdev/vector Mar 28, 2023
@PetrHeinz
Copy link

I would propose to call it parse_bytes, similar to existing parse_duration (vectordotdev/vector#4186).

Also, I'd like to add I've seen byte amounts formatted also as eg. 123.45 Mi.

@titaneric
Copy link
Contributor

I am interested in implementing this feature!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature A value-adding code addition that introduce new functionality. vrl: stdlib Changes to the standard library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants