VRL function to normalize size units like MiB, KiB, etc. to bytes #49

jblang · 2022-06-28T19:19:40Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

I frequently come across log messages containing sizes using human-readable units such as MB, KB, GB, or MiB, KiB, GiB. Normalizing these to bytes is a prerequisite for comparison and statistical analysis.

Attempted Solutions

I can do this using an if statement but it adds a lot of repetitive code to every transformation where I need it.

Proposal

Add a normalize_bytes function to VRL that will take a string specifying a size in human readable capacity units and return an integer number of bytes. The function should:

Accept a floating point or integer number for the numeric portion
Accept and convert all of the common abbreviations for a particular capacity unit:
- K, KB, KiB - kilobytes
- M, MB, MiB - megabytes
- G, GB, GiB - gigabytes
- T, TB, TiB - terabytes
- And so on...
Round any fractional bytes obtained from the conversion up to the nearest integer
Assume binary units by default (multiples of 1024 instead of 1000)
Have an optional parameter (binary) that when set to false uses multiples of 1000 instead of 1024
Accept optional whitespace between the number and the unit
Treat the units case-insensitively

References

No response

Version

0.22.2

The text was updated successfully, but these errors were encountered:

hhromic · 2022-06-29T09:38:15Z

This is a very nice idea! However I would make sure to use the correct unit definitions and symbols.

From https://en.wikipedia.org/wiki/Byte#Multiple-byte_units

More than one system exists to define larger units based on the byte. Some systems are based on powers of 10; other systems are based on powers of 2. Nomenclature for these systems has been the subject of confusion. Systems based on powers of 10 reliably use standard SI prefixes (kilo, mega, giga, ...) and their corresponding symbols (k, M, G, ...). Systems based on powers of 2, however, might use binary prefixes (kibi, mebi, gibi, ...) and their corresponding symbols (Ki, Mi, Gi, ...) or they might use the prefixes K, M, and G, creating ambiguity.

The proposed binary parameter should be then more precisely named base=[2,10] or units=[binary,si] and used only for the ambiguous unit names. That is, for example MiB should always be converted using power of 2 and MB should be converted depending on the given base/units parameter.

jblang · 2022-06-29T12:52:19Z

It's much more important to me that the feature exist at all so I don't have strong opinions about what to call the parameter. I do still think the base should always default to 2, because that's what most people, other than hard drive marketeers, mean whether they include an i in the abbreviation or not. Also it should be overridable regardless. Even though it might make one's inner pedant cringe to use base 10 for a unit that is specifically defined as binary, the added flexibility is not a bad thing.

jblang · 2022-07-29T17:38:52Z

I figured out a workaround for this until a function is added to VRL. For any fields I want to be normalized, I put them in a nested object called .normbytes. For example:

transforms:
  output_large_partitions:
    type: remap
    inputs:
      - route.large_partition
    source: |-
      . |= parse_regex!(.message, r'Writing large partition (?P<keyspace>[^/]+)/(?P<table>[^:]+):(?P<partition_key>.+?) \((?P<size>[^) ]+)\)?( bytes)? to sstable (?P<sstable>[^)]*)\)?')
      .event_type = "large_partition"
      .normbytes.size = del(.size)

Then at the end of my pipeline I have this transform, so that any events with a .normbytes field will get the subfields normalized and then attached to the top-level event:

  normalize_bytes:
    type: remap
    inputs:
      - output_*
      - route*._unmatched
    source: |-
      if exists(.normbytes) {
        normbytes = object!(del(.normbytes))
        for_each(normbytes) -> |key, value| {
          p = parse_regex!(value, r'(?P<value>[\d.]+) *((?P<unit>[KMGT])iB|bytes)?')
          value = to_float!(p.value)
          factor = 
            if p.unit == "K" { 
              1024 
            } else if p.unit == "M" { 
              1048576 
            } else if p.unit == "G" { 
              1073741824
            } else if p.unit == "T" { 
              1099511627776
            } else {
              1
            }
          value = ceil(value * factor)
          . = set!(., [key], value)
        }
      }

I think this technique is general enough to allow for creating any user-defined functions needed.

PetrHeinz · 2023-07-24T15:19:59Z

I would propose to call it parse_bytes, similar to existing parse_duration (vectordotdev/vector#4186).

Also, I'd like to add I've seen byte amounts formatted also as eg. 123.45 Mi.

titaneric · 2024-12-28T05:41:58Z

I am interested in implementing this feature!

jblang added the type: feature A value-adding code addition that introduce new functionality. label Jun 28, 2022

jszwedko added the vrl: stdlib Changes to the standard library label Jun 28, 2022

fuchsnj transferred this issue from vectordotdev/vector Mar 28, 2023

jszwedko mentioned this issue Jul 24, 2023

New parse_bytes remap function vectordotdev/vector#18073

Closed

titaneric mentioned this issue Dec 28, 2024

feat(stdlib) Add new parse_bytes function #1198

Merged

11 tasks

pront closed this as completed in #1198 Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VRL function to normalize size units like MiB, KiB, etc. to bytes #49

VRL function to normalize size units like MiB, KiB, etc. to bytes #49

jblang commented Jun 28, 2022 •

edited

Loading

hhromic commented Jun 29, 2022 •

edited

Loading

jblang commented Jun 29, 2022 •

edited

Loading

jblang commented Jul 29, 2022

PetrHeinz commented Jul 24, 2023

titaneric commented Dec 28, 2024

VRL function to normalize size units like MiB, KiB, etc. to bytes #49

VRL function to normalize size units like MiB, KiB, etc. to bytes #49

Comments

jblang commented Jun 28, 2022 • edited Loading

A note for the community

Use Cases

Attempted Solutions

Proposal

References

Version

hhromic commented Jun 29, 2022 • edited Loading

jblang commented Jun 29, 2022 • edited Loading

jblang commented Jul 29, 2022

PetrHeinz commented Jul 24, 2023

titaneric commented Dec 28, 2024

jblang commented Jun 28, 2022 •

edited

Loading

hhromic commented Jun 29, 2022 •

edited

Loading

jblang commented Jun 29, 2022 •

edited

Loading