Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainining Omikuji from scipy.sparse.csr_matrix #55

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

CarloNicolini
Copy link

I've adapted an alternative method to train the omikuji model by bypassing disk write for the Python wrapper.

The main work is based on the creation of a new methods in the lib.rs file called load_omikuji_data_set_from_features_labels.

It is designed to take in the three main numpy arrays defining the underlying structure of the scipy.sparse.csr_matrix.
In other words I map the scipy.sparse.csr_matrix.{indices, indptr, data} arrays into Rust vectors, and then I recreate a features matrix together with the labels set, in a way similar to the train_on_data method.

@juhoinkinen
Copy link
Contributor

Just a fellow Omikuji fan dropping by to ask how big speed up you think can be achieved with this? Could you give some measured numbers?

@CarloNicolini
Copy link
Author

@juhoinkinen the speed-up depends if you have to fit many times over a large-dataset, like in the case of a GridSearchCV. In this case you don't incur the I/O costs.
I am gonna measure the real-difference between the two cases and post it here.

@CarloNicolini
Copy link
Author

@juhoinkinen
Training on the EurLEX-4k train from the omikuji repository itself dropped from an average of 10.4 seconds to an average of 7.1 seconds. It's a 15k rows and 5k features on a Macbook Pro M1.
The advantage is when making a large grid search with cross validation and joblib with multiple parallel jobs, one can avoid too much I/O pressure.

I am new to Rust but a lot of C++ background. I've managed to make the stuff work only with float32 data types for features and uint32 for labels. It could be nice to make it a bit more generic though.

@tomtung
Copy link
Owner

tomtung commented Feb 24, 2024

Thanks for the contribution! I've been traveling and don't have my computer with me. I'll try take a look in early March. In the meantime, it would be great of you can make all the tests pass :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants