Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on Threshold Setting and Using Side Information for Binarized Data #157

Open
ankit-singh973 opened this issue Jan 10, 2025 · 10 comments

Comments

@ankit-singh973
Copy link

I am using SMURFF for matrix factorization with already binarized data (values are either 0 or 1). I noticed in one of your notebooks that data is binarized during the training process using a threshold (pIC50 > 6.0). Since my data is already binarized, I am unsure about the correct threshold to set. Could you clarify the following points?

Threshold Setting: Should I still set a threshold in ProbitNoise when my data is already binary (0 or 1)? If yes, what value should it be? If no, how do I handle this?

Side Information: My dataset includes side information for both rows and columns. How should I incorporate this dual side information effectively in my setup? For instance:

Should I use direct = False for both row and column side information?
Is there anything specific I need to modify in my current pipeline?

@tvandera
Copy link
Collaborator

tvandera commented Jan 10, 2025

Hi Ankit,

  • If the data is binary 0 and 1, you can use e.g. 0.5 as a threshold. The threshold is used to divide the values in two partitions.
  • Let me try to make you an example with side-info on both sides.
  • The direct method uses QR decomposition. It is much faster but requires more memory when your side-info is sparse. direct = False uses a solver, and is much slower. I would recommend to always use direct = True, unless you have sparse side info and run out of memory.

@tvandera
Copy link
Collaborator

  • Let me try to make you an example with side-info on both sides.

Here's an example with sideinfo on both sides:

def test_macau(self):

@ankit-singh973
Copy link
Author

When I put threshold=0.5 I got output like this:
Result: { Test data: 1439 [4933 x 3029] (0.01%) Binary classification threshold: 0.50 100.00% positives in test data

The test data is also a sparse matrix and it is in scipy.sparse._coo.coo_matrix format which doesnot store zeros. So due to 100% positives in test data it is unable to calculate AUC, so, do I need to give the test data in other format like numpy.ndarray?

@tvandera
Copy link
Collaborator

I think your Train/Test matrix is sparse, not scarse. Have look at the difference here.

@ankit-singh973
Copy link
Author

ankit-singh973 commented Jan 10, 2025

This is my code and all the data is in scipy.sparse._coo.coo_matrix format. Can you please tell me where do I have to make necessary changes to make it run.

`import smurff
import logging
import pandas as pd
from scipy.sparse import load_npz

logging.basicConfig(level = logging.INFO)

train = load_npz("train_matrix.npz")
test = load_npz("test_matrix.npz")
g_side = load_npz("gene_onto.npz")

c_threshold = 0.5
trainSession = smurff.TrainSession(
priors = ['macau', 'normal'],
num_latent=10,
burnin=10,
nsamples=20, threshold = c_threshold)

trainSession.addTrainAndTest(train, test, smurff.ProbitNoise(c_threshold))
trainSession.addSideInfo(0, g_side, direct = True)
predictions = trainSession.run()
print("AUC = %.2f" % smurff.calc_auc(predictions, c_threshold))`

Even after converting the input data into numpy.ndarray it is showing same "100% positives in the test data".

@tvandera
Copy link
Collaborator

Can you plot me a histogram on the values in train and test?

@tvandera
Copy link
Collaborator

tvandera commented Jan 10, 2025

Can you try:

trainSession.addTrainAndTest(train, test, smurff.ProbitNoise(c_threshold), is_scarce = True):

@ankit-singh973
Copy link
Author

ankit-singh973 commented Jan 10, 2025

Can you plot me a histogram on the values in train and test?

histo

Please note that the scale is log and train and test matrix are numpy.ndarray.

For test :

Number of 0's: 14940618
Number of 1's: 1439

For train :

Number of 0's: 14936299
Number of 1's: 5758

@ankit-singh973
Copy link
Author

Can you try:

trainSession.addTrainAndTest(train, test, smurff.ProbitNoise(c_threshold), is_scarce = True):

It's not working

@tvandera
Copy link
Collaborator

Can you try:

trainSession.addTrainAndTest(train, test, smurff.ProbitNoise(c_threshold), is_scarce = True):

It's not working

Of course not, it should be is_scarce=False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants