Whole CSV binary documents can be decoded with decode/1,2
.
decode/1
assumes default RFC4180-style
options, that is:
- Fields are separated by commas.
- Fields are optionally enclosed in double quotes.
- Double quotes in enclosed fields are quoted by another double quote.
decode/2
allows using custom options:
#{separator => Separator, % any byte except $\r or $\n (defaul $,)
enclosure => Enclosure, % 'undefined' or any byte except $\r or $\n (default $")
quote => Quote} % 'undefined', 'enclosure', or any byte except $\r or $\n (defaults 'enclosure')
Restrictions for option combinations:
- If
Enclosure
isundefined
(ie, no enclosing),Quote
must be eitherenclosure
orundefined
. - If
Enclosure
is notundefined
,Quote
must also not beundefined
. - If
Enclosure
is notundefined
, it must not be the same asSeparator
.
Lines are separated by \r
, \n
or \r\n
. Empty lines are ignored by the decoder.
The result of decoding is a list of CSV lines, which are lists of CSV fields, which are in turn binaries representing the field values on the respective line.
Assume the following CSV data:
a,b,c
"d,d","e""e","f
f"
In an Erlang binary, this will look like:
1> CsvBinary = <<"a,b,c\r\n\"d,d\",\"e\"\"e\",\"f\r\nf\"\r\n">>.
<<"a,b,c\r\n\"d,d\",\"e\"\"e\",\"f\r\nf\"\r\n">>
Decoded with decode/1
, this will become:
2> hnc_csv:decode(CsvBinary).
[[<<"a">>,<<"b">>,<<"c">>],
[<<"d,d">>,<<"e\"e">>,<<"f\r\nf">>]]
hnc_csv
provides the functions decode_fold/3,4
, decode_filter/2,3
,
decode_map/2,3
, decode_filtermap/2,3
and decode_foreach/2,3
which
allow decoding and processing decoded lines in one operation, much
like the lists
functions foldl/3
, filter/2
, map/2
, filtermap/2
and foreach/2
.
In fact, decode/1,2
is implemented via decode_fold/3,4
.
The decode
family of functions accepts both a raw binary as well as a
Provider
that delivers chunks of raw binary. When given a raw binary,
it is converted into a binary provider for further processing.
A provider is a 0-arity function which, when called, returns either a
tuple where the first element is a chunk of binary data and the second
is a new provider function for the next chunk of data, or the atom
end_of_data
to indicate that the provider has delivered all data.
Providers can be implemented stateless of stateful, usually depending on the characteristics of the underlying data source.
A stateless provider does not change and is not susceptible to external changes to the state of the underlying data source.
A stateful provider on the other hand may change or be susceptible to changes to the state of the underlying data source or both. It is recommended to not (re-)use stateful providers or their underlying data source before, while or after being used in decoding functions, except for any necessary setup before or cleanup after being used.
hnc_csv
comes with two convenience functions, get_binary_provider/1,2
(stateless) and get_file_provider/1,2
(stateful) which return providers for
binaries or files, respectively.
The following is an implementation of a (stateless) custom provider which delivers data taken from a given list of binaries:
-module(example_provider).
-export([get_list_provider/1]).
get_list_provider(L) ->
fun() -> list_provider(L) end.
list_provider([]) ->
end_of_data;
list_provider([Bin|More]) when is_binary(Bin) ->
{Bin, fun() -> list_provider(More) end}.
get_list_provider/1
creates the initial provider, which is a call tolist_provider/1
wrapped in a 0-arity function.list_provider/1
is the actual implementation of the provider, which returns eitherend_of_data
when the list given as argument is exhausted, or otherwise a tuple with the head element of the list as first and a call to itself with the tail of the list wrapped in a 0-arity function as second element.
This provider can then be used as follows, for example to count the lines and fields in the CSV data which the provider delivers:
1> Provider = example_provider:get_list_provider([<<"a,b">>, <<",c\r">>,
<<"\nd,">>, <<"e,f">>,
<<"\r\n">>]).
#Fun<example_provider.0.64990923>
2> hnc_csv:decode_fold(Provider,
fun(Line, {LCnt, FCnt}) -> {LCnt+1, FCnt+length(Line)} end,
{0, 0}).
{2,6}
For more complex scenarios than what the built-in functions provide
for, the functions decode_init/0,1,2
, decode_next_line/1
and
decode_flush/1
can be used together to decode and process CSV
documents incrementally.
decode_init/0,1,2
creates a decoder state to be used in the other functions listed above.decode_next_line/1
decodes and returns the next line, together with an updated state. If the data in the provider backing the state is exhausted, the atomend_of_data
is returned instead of a line.decode_flush/1
returns all as by then unread lines in the given state.
In fact, decode_fold/4
is implemented using those functions.
CSV documents can be encoded with encode/1,2
.
encode/1
assumes default RFC4180-style
options, that is:
- Fields are separated by commas
- Fields are optionally enclosed in double quotes
- Double quotes in enclosed fields are quoted by another double quote
- Lines are separated by
\r\n
encode/2
allows using custom options:
#{separator => Separator, % any byte except $\r and $\n (default $,)
enclosure => Enclosure, % 'undefined' or any byte except $\r or $\n (default $")
quote => Quote, % 'undefined', 'enclosure', or any byte except $\r or $\n (default 'enclosure')
enclose => Enclose, % 'optional' (default), 'never' or 'always'
end_of_line => EndOfLine} % `<<"\r\n">> (default), <<"\n">> or <<"\r">>
Restrictions for option combinations:
- If
Enclose
isnever
(ie, no enclosing),Enclosure
must beundefined
andQuote
must beundefined
orenclosure
. - If
Enclose
isoptional
oralways
,Enclosure
andQuote
must not beundefined
. - If
Enclosure
is notundefined
, it must not be the same asSeparator
.
The input for encoding is a list of CSV lines, which are in turn lists of CSV fields, which are in turn binaries representing the field values.
The result is a CSV binary document consisting of the given CSV lines, in turn consisting of the given CSV fields of a line.
Assume the following CSV structure:
1> Csv = [[<<"a">>,<<"b">>,<<"c">>],
[<<"d,d">>,<<"e\"e">>,<<"f\r\nf">>]].
Encoded with encode/1
, this will become:
2> hnc_csv:encode(Csv).
<<"a,b,c\r\n"
"\"d,d\",\"e\"\"e\",\"f\r\nf\"\r\n">>
- Maria Scott (Maria-12648430)
- Jan Uhlig (juhlig)