Accessing elements in data structure produced by `CountTransformer` is quite slow #29

roland-KA · 2024-04-27T17:42:35Z

I've used the CountTransformer to produce a word frequency matrix as follows:

CountTransformer = @load CountTransformer pkg=MLJText
trans_machine = machine(CountTransformer(), doc_list)
fit!(trans_machine)
X1 = transform(trans_machine, doc_list)

Then a function word_count has been applied to X1 (it aggregates the numbers in X1 for doing Naive Bayes; i.e. each element of X1 is accessed once).

This takes about 245 seconds (on a M1 iMac); the size of X1 is (33716, 159093).

If I produce the word frequency matrix using TextAnalysis directly as follows:

crps = Corpus(TokenDocument.(doc_list))
update_lexicon!(crps)
m = DocumentTermMatrix(crps)
X2 =  dtm(m)

... then word_count runs in about 16.7 sec on matrix X2. So accessing the elements of X1 is almost 15 times slower than to X2.

The difference between the two is, that X2 is a "pure" SparseMatrix whereas X1 is of type LinearAlgebra.Adjoint{Int64, SparseMatrixCSC{Int64, Int64}}. I didn't find any information on how this data structure is represented in Julia.

Therefore I have a few questions:

Is there a way to access the elements of X1 faster (or rather: why is that so slow)?
I've tried to extract the "pure" SparseMatrix from X1 using X3 = X1[1:end, 1:end]. But this takes almost 364 sec. Is there a faster way to get it?

With these findings, it is of course not recommendable to use CountTransformer for this purpose ... or did I miss something?

The text was updated successfully, but these errors were encountered:

ablaom · 2024-04-28T22:38:07Z

I've tried to extract the "pure" SparseMatrix from X1 using X3 = X1[1:end, 1:end]. But this takes almost 364 sec. Is there a faster way to get it?

The adjoint is just the conjugate-transpose as a view. So applying it twice returns the original, unwrapped, matrix (in this case sparse). So, what if you take the adjoint of X1 (adjoint(X1) or X1') and access it with rows and column indices reversed?

roland-KA · 2024-04-29T10:57:48Z

Ah thanks, that's a good idea! I've tried it and it is indeed faster. Instead of 245 sec. the word_count finishes within 43 sec. But it is still slower than using the SparseMatrix produced by TextAnalysis (16.7 sec).

What is the reason for producing an adjoint in CountTransformer? Was the CountTransformer easier to implement or are there any applications which prefer this structure for further processing?

ablaom · 2024-06-09T21:22:29Z

The reason for the adjoint is because it is lazy but we need to observe MLJ's convention that observations are rows. Given that adjoint is lazy, I admit to being puzzled as to why you're still seeing such a slowdown and agree it would be good to understand why.

roland-KA · 2024-06-15T11:49:44Z

Well, being lazy is perhaps a big part of the explanation. word_count accesses all elements of the matrix. So this is the worst use case of a lazy evaluation.

github-project-automation bot added this to General Aug 30, 2024

github-project-automation bot moved this to priority low / involved in General Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accessing elements in data structure produced by `CountTransformer` is quite slow #29

Accessing elements in data structure produced by `CountTransformer` is quite slow #29

roland-KA commented Apr 27, 2024

ablaom commented Apr 28, 2024 •

edited

Loading

roland-KA commented Apr 29, 2024

ablaom commented Jun 9, 2024

roland-KA commented Jun 15, 2024

Accessing elements in data structure produced by CountTransformer is quite slow #29

Accessing elements in data structure produced by CountTransformer is quite slow #29

Comments

roland-KA commented Apr 27, 2024

ablaom commented Apr 28, 2024 • edited Loading

roland-KA commented Apr 29, 2024

ablaom commented Jun 9, 2024

roland-KA commented Jun 15, 2024

Accessing elements in data structure produced by `CountTransformer` is quite slow #29

Accessing elements in data structure produced by `CountTransformer` is quite slow #29

ablaom commented Apr 28, 2024 •

edited

Loading