Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing elements in data structure produced by CountTransformer is quite slow #29

Open
roland-KA opened this issue Apr 27, 2024 · 4 comments

Comments

@roland-KA
Copy link

I've used the CountTransformer to produce a word frequency matrix as follows:

CountTransformer = @load CountTransformer pkg=MLJText
trans_machine = machine(CountTransformer(), doc_list)
fit!(trans_machine)
X1 = transform(trans_machine, doc_list)

Then a function word_count has been applied to X1 (it aggregates the numbers in X1 for doing Naive Bayes; i.e. each element of X1 is accessed once).

This takes about 245 seconds (on a M1 iMac); the size of X1 is (33716, 159093).

If I produce the word frequency matrix using TextAnalysis directly as follows:

crps = Corpus(TokenDocument.(doc_list))
update_lexicon!(crps)
m = DocumentTermMatrix(crps)
X2 =  dtm(m)

... then word_count runs in about 16.7 sec on matrix X2. So accessing the elements of X1 is almost 15 times slower than to X2.

The difference between the two is, that X2 is a "pure" SparseMatrix whereas X1 is of type LinearAlgebra.Adjoint{Int64, SparseMatrixCSC{Int64, Int64}}. I didn't find any information on how this data structure is represented in Julia.

Therefore I have a few questions:

  • Is there a way to access the elements of X1 faster (or rather: why is that so slow)?
  • I've tried to extract the "pure" SparseMatrix from X1 using X3 = X1[1:end, 1:end]. But this takes almost 364 sec. Is there a faster way to get it?

With these findings, it is of course not recommendable to use CountTransformer for this purpose ... or did I miss something?

@ablaom
Copy link
Member

ablaom commented Apr 28, 2024

I've tried to extract the "pure" SparseMatrix from X1 using X3 = X1[1:end, 1:end]. But this takes almost 364 sec. Is there a faster way to get it?

The adjoint is just the conjugate-transpose as a view. So applying it twice returns the original, unwrapped, matrix (in this case sparse). So, what if you take the adjoint of X1 (adjoint(X1) or X1') and access it with rows and column indices reversed?

@roland-KA
Copy link
Author

Ah thanks, that's a good idea! I've tried it and it is indeed faster. Instead of 245 sec. the word_count finishes within 43 sec. But it is still slower than using the SparseMatrix produced by TextAnalysis (16.7 sec).

What is the reason for producing an adjoint in CountTransformer? Was the CountTransformer easier to implement or are there any applications which prefer this structure for further processing?

@ablaom
Copy link
Member

ablaom commented Jun 9, 2024

The reason for the adjoint is because it is lazy but we need to observe MLJ's convention that observations are rows. Given that adjoint is lazy, I admit to being puzzled as to why you're still seeing such a slowdown and agree it would be good to understand why.

@roland-KA
Copy link
Author

Well, being lazy is perhaps a big part of the explanation. word_count accesses all elements of the matrix. So this is the worst use case of a lazy evaluation.

@github-project-automation github-project-automation bot moved this to priority low / involved in General Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: priority low / involved
Development

No branches or pull requests

2 participants