Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fit!(estimator, X, y, test...) method fails into infinite loops without test... arguments #59

Open
aviatesk opened this issue Jun 10, 2020 · 6 comments

Comments

@aviatesk
Copy link

MRE:

X = DataFrame(A = rand(n), B = categorical(rand(1:10, n)))
y = categorical(rand(0:2, n))

lgbm_model = LGBMClassification(num_class = length(y.pool))
LightGBM.fit!(lgbm_model, Array(X), collect(y))

so I guess you maybe need to introduce some type piracy here

@aviatesk aviatesk changed the title fit!(estimator, X, y, test...) method fails into infinite loops fit!(estimator, X, y, test...) method fails into infinite loops without test... arguments Jun 10, 2020
@yalwan-iqvia
Copy link
Collaborator

I'll need to review our tests for this, surprised it didn't show itself

@yalwan-iqvia
Copy link
Collaborator

P.S. Thanks for reporting so quickly!

@yalwan-iqvia
Copy link
Collaborator

yalwan-iqvia commented Jun 10, 2020

Ok so, I have a comment, which is related (but the bug itself is still valid)
It looks like between [email protected] and [email protected]
The return type of collect(y) changed from an array of ints to a categorical

I can (will) fix the hang (its not an infinite loop) by strongly typing the signature, but you will still not be able to pass collect(y) through directly because it is not a supported type (and as far as I can tell was not previously)
I appreciate this probably is not what you'd hoped for but I'm not sure if collect should do anything for this because

julia> typeof(y)
CategoricalArray{Int64,1,UInt32,Int64,CategoricalValue{Int64,UInt32},Union{}}

julia> typeof(collect(y))
CategoricalArray{Int64,1,UInt32,Int64,CategoricalValue{Int64,UInt32},Union{}}

julia> 

That seems off to me -- maybe its by design by DataFrames authors but you'd need to check with them

Reason why I found this is because we do have some tests which don't pass test... args, which leads me to believe its to do with implementation of iterator protocols or something like this... something specifically about CategoricalArray is allowing it to look like a Tuple{Matrix{TX}, Vector{Ty}} which itself is probably a bug of some sort, but where I don't know

Anyway long story short is that I will emergency fix the hang by strongly typing the passthrough signatures, but there are some other, more subtle issues at play as well. Stand by for 0.3.1 release

@yalwan-iqvia
Copy link
Collaborator

I'm putting a small investigative summary here for future reference (will not close this when fix is released)

# ] add [email protected] [email protected]

using DataFrames, LightGBM

n = 3

X = DataFrame(A = rand(n), B = categorical(rand(1:10, n)))
y = categorical(rand(0:2, n))

lgbm_model = LGBMClassification(num_class = length(y.pool))

XX = Array(X)
yy = Int.(collect(y).refs)

LightGBM.fit!(lgbm_model, XX, yy, (XX, yy)) # test vectors with non categorical ys, LightGBM FFI vom, thats fine

LightGBM.fit!(lgbm_model, XX, y, (XX, yy)) # test vectors with categorical y, stackoverflow error, not fine

LightGBM.fit!(lgbm_model, XX, collect(y), (XX, yy)) # test vectors with categorical y, stackoverflow error, not fine

LightGBM.fit!(lgbm_model, XX, yy) # no test vectors, LightGBM FFI vom, is fine

LightGBM.fit!(lgbm_model, X, yy) # .... throws an error but cant even finish throwing "ERROR: " < nothing else

LightGBM.fit!(lgbm_model, Array(X), collect(y); verbosity=-1) # original reproducing case

@yalwan-iqvia
Copy link
Collaborator

0.3.1 has been released: JuliaRegistries/General#16152

@aviatesk
Copy link
Author

@yalwan-iqvia thank for quick update !
I actually don't have strong opinion on the way to pass data nor on its format.
Just posted because guessing from README and docstring, fit!(estimator, X, y) seems to be better to work.
I know DataFrames.jl's typing is kinda scary, and I'm personally okay with any solution you want to live with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants