Unnest values in CSV files.
Note: this repository is a stub, a proposal, an idea. Perhaps someday the idea will be implemented.
input.csv file:
country,cities
United States,"New York,Atlanta,Los Angeles"
Germany,"Berlin"
Russia,"Moscow,Novosibirsk"
Command:
csvunnest --column cities --delimiter ',' < input.csv > output.csvwhich writes the following to output.csv file:
country,cities
United States,New York
United States,Atlanta
United States,Los Angeles
Germany,Berlin
Russia,Moscow
Russia,Novosibirsk
It happens very often that a CSV file contains multiple values in a cell, separated with , as shown above, or a space, or a |. Unnesting means duplicating the row so that every copy contains one of the values that were previously nested.
This makes it possible to analyse the data with other CSV capable tools.
- Clean up incoming CSV data for better normalization
- Prepare reports by converting one CSV schema to another
This section is transient and should materialize as a series of concrete tasks.
- Should be written in Rust programming language.
- Must read the source file line by line, not fetching the whole file into RAM.
- For performance, reading and writing should happen in separate threads communicating via a queue with buffer.
- Possibly useful libraries:
csvto read & write data,clapfor command line arguments.