-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accessing elements in data structure produced by CountTransformer
is quite slow
#29
Comments
The adjoint is just the conjugate-transpose as a view. So applying it twice returns the original, unwrapped, matrix (in this case sparse). So, what if you take the adjoint of |
Ah thanks, that's a good idea! I've tried it and it is indeed faster. Instead of 245 sec. the What is the reason for producing an adjoint in |
The reason for the adjoint is because it is lazy but we need to observe MLJ's convention that observations are rows. Given that adjoint is lazy, I admit to being puzzled as to why you're still seeing such a slowdown and agree it would be good to understand why. |
Well, being lazy is perhaps a big part of the explanation. |
I've used the
CountTransformer
to produce a word frequency matrix as follows:Then a function
word_count
has been applied toX1
(it aggregates the numbers inX1
for doing Naive Bayes; i.e. each element ofX1
is accessed once).This takes about 245 seconds (on a M1 iMac); the size of
X1
is (33716, 159093).If I produce the word frequency matrix using
TextAnalysis
directly as follows:... then
word_count
runs in about 16.7 sec on matrixX2
. So accessing the elements ofX1
is almost 15 times slower than toX2
.The difference between the two is, that
X2
is a "pure"SparseMatrix
whereasX1
is of typeLinearAlgebra.Adjoint{Int64, SparseMatrixCSC{Int64, Int64}}
. I didn't find any information on how this data structure is represented in Julia.Therefore I have a few questions:
X1
faster (or rather: why is that so slow)?SparseMatrix
fromX1
usingX3 = X1[1:end, 1:end]
. But this takes almost 364 sec. Is there a faster way to get it?With these findings, it is of course not recommendable to use
CountTransformer
for this purpose ... or did I miss something?The text was updated successfully, but these errors were encountered: