Clarification on Threshold Setting and Using Side Information for Binarized Data #157

ankit-singh973 · 2025-01-10T08:44:38Z

I am using SMURFF for matrix factorization with already binarized data (values are either 0 or 1). I noticed in one of your notebooks that data is binarized during the training process using a threshold (pIC50 > 6.0). Since my data is already binarized, I am unsure about the correct threshold to set. Could you clarify the following points?

Threshold Setting: Should I still set a threshold in ProbitNoise when my data is already binary (0 or 1)? If yes, what value should it be? If no, how do I handle this?

Side Information: My dataset includes side information for both rows and columns. How should I incorporate this dual side information effectively in my setup? For instance:

Should I use direct = False for both row and column side information?
Is there anything specific I need to modify in my current pipeline?

The text was updated successfully, but these errors were encountered:

tvandera · 2025-01-10T10:30:02Z

Hi Ankit,

If the data is binary 0 and 1, you can use e.g. 0.5 as a threshold. The threshold is used to divide the values in two partitions.
Let me try to make you an example with side-info on both sides.
The direct method uses QR decomposition. It is much faster but requires more memory when your side-info is sparse. direct = False uses a solver, and is much slower. I would recommend to always use direct = True, unless you have sparse side info and run out of memory.

tvandera · 2025-01-10T10:44:59Z

Let me try to make you an example with side-info on both sides.

Here's an example with sideinfo on both sides:

smurff/python/test/test_macau.py

Line 17 in c5c5d50

def test_macau(self):

ankit-singh973 · 2025-01-10T11:17:03Z

When I put threshold=0.5 I got output like this:
Result: { Test data: 1439 [4933 x 3029] (0.01%) Binary classification threshold: 0.50 100.00% positives in test data

The test data is also a sparse matrix and it is in scipy.sparse._coo.coo_matrix format which doesnot store zeros. So due to 100% positives in test data it is unable to calculate AUC, so, do I need to give the test data in other format like numpy.ndarray?

tvandera · 2025-01-10T11:21:24Z

I think your Train/Test matrix is sparse, not scarse. Have look at the difference here.

ankit-singh973 · 2025-01-10T12:27:22Z

This is my code and all the data is in scipy.sparse._coo.coo_matrix format. Can you please tell me where do I have to make necessary changes to make it run.

`import smurff
import logging
import pandas as pd
from scipy.sparse import load_npz

logging.basicConfig(level = logging.INFO)

train = load_npz("train_matrix.npz")
test = load_npz("test_matrix.npz")
g_side = load_npz("gene_onto.npz")

c_threshold = 0.5
trainSession = smurff.TrainSession(
priors = ['macau', 'normal'],
num_latent=10,
burnin=10,
nsamples=20, threshold = c_threshold)

trainSession.addTrainAndTest(train, test, smurff.ProbitNoise(c_threshold))
trainSession.addSideInfo(0, g_side, direct = True)
predictions = trainSession.run()
print("AUC = %.2f" % smurff.calc_auc(predictions, c_threshold))`

Even after converting the input data into numpy.ndarray it is showing same "100% positives in the test data".

tvandera · 2025-01-10T13:49:41Z

Can you plot me a histogram on the values in train and test?

tvandera · 2025-01-10T13:55:26Z

Can you try:

trainSession.addTrainAndTest(train, test, smurff.ProbitNoise(c_threshold), is_scarce = True):

ankit-singh973 · 2025-01-10T14:22:16Z

Can you plot me a histogram on the values in train and test?

Please note that the scale is log and train and test matrix are numpy.ndarray.

For test :

Number of 0's: 14940618
Number of 1's: 1439

For train :

Number of 0's: 14936299
Number of 1's: 5758

ankit-singh973 · 2025-01-10T14:24:29Z

Can you try:

trainSession.addTrainAndTest(train, test, smurff.ProbitNoise(c_threshold), is_scarce = True):

It's not working

tvandera · 2025-01-10T14:30:12Z

Can you try:

trainSession.addTrainAndTest(train, test, smurff.ProbitNoise(c_threshold), is_scarce = True):

It's not working

Of course not, it should be is_scarce=False

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on Threshold Setting and Using Side Information for Binarized Data #157

Clarification on Threshold Setting and Using Side Information for Binarized Data #157

ankit-singh973 commented Jan 10, 2025

tvandera commented Jan 10, 2025 •

edited

Loading

tvandera commented Jan 10, 2025

ankit-singh973 commented Jan 10, 2025

tvandera commented Jan 10, 2025

ankit-singh973 commented Jan 10, 2025 •

edited

Loading

tvandera commented Jan 10, 2025

tvandera commented Jan 10, 2025 •

edited

Loading

ankit-singh973 commented Jan 10, 2025 •

edited

Loading

ankit-singh973 commented Jan 10, 2025

tvandera commented Jan 10, 2025

Clarification on Threshold Setting and Using Side Information for Binarized Data #157

Clarification on Threshold Setting and Using Side Information for Binarized Data #157

Comments

ankit-singh973 commented Jan 10, 2025

tvandera commented Jan 10, 2025 • edited Loading

tvandera commented Jan 10, 2025

ankit-singh973 commented Jan 10, 2025

tvandera commented Jan 10, 2025

ankit-singh973 commented Jan 10, 2025 • edited Loading

tvandera commented Jan 10, 2025

tvandera commented Jan 10, 2025 • edited Loading

ankit-singh973 commented Jan 10, 2025 • edited Loading

For test :

For train :

ankit-singh973 commented Jan 10, 2025

tvandera commented Jan 10, 2025

tvandera commented Jan 10, 2025 •

edited

Loading

ankit-singh973 commented Jan 10, 2025 •

edited

Loading

tvandera commented Jan 10, 2025 •

edited

Loading

ankit-singh973 commented Jan 10, 2025 •

edited

Loading