Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compact printing in cut #381

Open
bkamins opened this issue Feb 25, 2022 · 10 comments
Open

Compact printing in cut #381

bkamins opened this issue Feb 25, 2022 · 10 comments

Comments

@bkamins
Copy link
Member

bkamins commented Feb 25, 2022

Currently we have:

julia> cut(1:10, 3)
10-element CategoricalArray{String,1,UInt32}:
 "Q1: [1.0, 4.0)"
 "Q1: [1.0, 4.0)"
 "Q1: [1.0, 4.0)"
 "Q2: [4.0, 6.999999999999999)"
 "Q2: [4.0, 6.999999999999999)"
 "Q2: [4.0, 6.999999999999999)"
 "Q3: [6.999999999999999, 10.0]"
 "Q3: [6.999999999999999, 10.0]"
 "Q3: [6.999999999999999, 10.0]"
 "Q3: [6.999999999999999, 10.0]"

which is not nice. It would be better to use compact printing.

Though we should make sure to correctly do this case:

julia> cut(1:10^-12:1+10^-11, 3)
11-element CategoricalArray{String,1,UInt32}:
 "Q1: [1.0, 1.0000000000033333)"
 "Q1: [1.0, 1.0000000000033333)"
 "Q1: [1.0, 1.0000000000033333)"
 "Q1: [1.0, 1.0000000000033333)"
 "Q2: [1.0000000000033333, 1.0000000000066667)"
 "Q2: [1.0000000000033333, 1.0000000000066667)"
 "Q2: [1.0000000000033333, 1.0000000000066667)"
 "Q3: [1.0000000000066667, 1.00000000001]"
 "Q3: [1.0000000000066667, 1.00000000001]"
 "Q3: [1.0000000000066667, 1.00000000001]"
 "Q3: [1.0000000000066667, 1.00000000001]"
@andreasnoack
Copy link
Member

I just hit a similar case hwere the many digits made plotting labels look ugly. What about supporting rounding in the cut(array, ngroups) method instead of just changing the printing? I guess it might be confusing if the string here don't reflect the actual cuts.

@nalimilan
Copy link
Member

This is tricky. Rounding could easily change radically the size of the groups if values are very close. AFAIK other packages don't do that. What R's cut does is that it tries with 3 digits by default, and if some breaks end up represented the same it increases the number of digits up to 12. This can be tweaked via the dig.lab argument. This sounds reasonable to me at least for quantiles, as you probably don't care about the exact value in that case. OTOH it could be more surprising when specifying breaks manually.

At the very least we could add that dig.lab argument.

@andreasnoack
Copy link
Member

Rounding could easily change radically the size of the groups if values are very close.

I don't see the issue if it's just an option. Then the user can decide if the rounding is acceptable.

@nalimilan
Copy link
Member

If you want quantiles and rouding ends up creating classes with different sizes, you no longer get quantiles. ;-) So the printing "Q1: ..." would be misleading. Rounding only the display seems less dramatic. Though of course we could support both as options.

@andreasnoack
Copy link
Member

The Qx label doesn't specify the probability of the quantile anyway. Actually, I'd prefer a label without Qx part. I'm mostly interested in the interval information. It is easy to incorrectly think that Q1 is the first quartile.

@nalimilan
Copy link
Member

I added Qx because when allowempty=true some intervals may be identical (in the presence of many duplicates), but we can't have levels with equal names. But that's a corner case so we could stop doing that by default. Still, returning quantiles that are not real quantiles due to rounding would be confusing IMO.

@nalimilan
Copy link
Member

This is related to a discussion we had in 2020 at #245. @bkamins Do you have an opinion? I see several options:

  • Keep adding Qx.
  • Provide a formatter that adds Qx and advise using it when we throw an error because of duplicate intervals.
  • Support an argument doing that.
  • Automatically use that formatter by default when allowempty=true is passed.

@bkamins
Copy link
Member Author

bkamins commented Dec 30, 2024

Given the discussion - I think adding an argument is most flexible.

@nalimilan
Copy link
Member

Can you elaborate why you prefer an argument over a formatter? I find it hard to decide which is best.

@bkamins
Copy link
Member Author

bkamins commented Dec 31, 2024

Ah - now I understand an issue. I understand that you want to predefine a function that you would pass to labels and it would provide a different way of handling this. I think this is OK. For me the key thing is to give some predefined to a user (i.e. do not just say that user could do this themselves by writing a proper formatte function)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants