-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for NULL values #38
Comments
A brutish way to solve this is to permit the user to specify a regex that determines what values constitute A different way of thinking about this problem is: instead of teaching different parts of |
I really like the idea of a new sub-command that would normalise One problem I often face when importing CSV files into MySQL is that the CSV files have different |
But then it becomes a question of what should be the canonical |
I'm open to use friendly heuristics. |
Perhaps one option is to define the array of missing values for your CSV field in a schema format we're developing that Support for reading a CSV schema might also have a carry-over effects for the other inference-related issues #22 and #28 |
I'm not too keen on going down the schema route. That seems like a lot of added complexity. The point is that there's a certain point at which |
Merging in a request that's related, which is the ability to replace null values with something else. Basically, I want to to be able to run xsv impute and have it replace blank values (or values with an out-of-bounds marker you can specify on the command line, such as "NA"). Obvious choices to fill in with would be a fixed value specified on the command line, or the same value as the previous cell in a row or column. Value: It's hard to do this using sed or such, when a "blank value" could be the strings ,$ ,""$ ,, ,"", and so on. Not sure how far one wants to go with replacement, since it'd be easy to feature creep "well why don't we add the option to fill in blanks with an average" "well why don't we add interpolation" "well why don't we let you put in an arbitrary equation" and so on. Which is obviously not something xsv should care about. Notably though, for this use case just being able to specify a "null" to be a fixed string should be sufficient. If you have multiple different null values you can just have a pipeline of xsv calls that replaces each one with a fixed values. So if the "null" in your dataset could be an empty string, That way you don't need a "canonical" null character, you can translate any dataset to use whatever you consider canonical. |
@icefoxen Does this capture what you're after?
(What I'm trying to do is understand whether the idea I had previously is enough to satisfy your use case. If it is, great. If not, then that means I need to think a bit more about this!) |
@BurntSushi Yeah, that sounds about right. The only thing missing from that would be the ability to say "replace value X with the cell above/to the left of it", which might be a more specific task than you want xsv to handle. Sorry for the repetition, obviously time for more coffee. :-) |
@icefoxen Haha no worries, just wanted to be super clear. The idea of replacing a value is interesting. There is something close to that in |
First, thanks for the great tool!
I was wondering if it would be possible to add some rudimentary support for NULL values?
I want to use
xsv
to get the column types for a CSV file before loading it into a database. This way I would know the correct column types when creating the database table. Whilexsv
is very fast at calculating file column statistics, my files often have int / float columns with NULL values (e.g. "-", ".'", "\N", "NA"), andxsv
reports those as Unicode. It would be nice to have an option like--na-values
, so thatxsv
could skip NULL values when calculating statistics, and could report whether a column contains at least one NULL value.The text was updated successfully, but these errors were encountered: