Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Main & MIEB] potential issues for multi-label classification #1835

Open
gowitheflow-1998 opened this issue Jan 19, 2025 · 1 comment
Open

Comments

@gowitheflow-1998
Copy link
Contributor

@isaac-chung and I were debugging the weird scores for VOC2007 in mieb (#1792) and found the following potential issues for both mieb and main.

  1. lrap computation. Think lrap is supposed to operate on continuous scores instead of discrete predicted labels which I fixed for MIEB here [mieb] fixing lrap computation for multi-label classification #1834. This change largely smooths the large performance range across models when the samples_per_label is small.
    Was trying to fix the same for main but got a chain of failed tests. Note that a difference is main is using KNeighborsClassifier() without a MultiOutputClassifier which I am not sure works in the expected way either; mieb instead is using MultiOutputClassifier(estimator=LogisticRegression()) which treats each label as a separate binary cls problem. Will be great if someone can confirm the same for main! Also, why KNeighbors?

  2. Logic for under sampling for samples_per_label. Both main and mieb are undersampling with a per single-label logic instead of a per set-label logic which I am not sure is what we want either. i.e.,

if any((label_counter[label] < samples_per_label) for label in y[i]):
    sample_indices.append(i)

With a samples_per_label=8 we can get something like the following, where some labels have much more number of examples than the other. e.g., for an example of label (18, 14), if the number of 18th < 8, even though the 14th is already 40+, this example will still be sampled and add to the training set.

defaultdict(int,
            {14: 49,
             13: 8,
             4: 12,
             10: 8,
             5: 9,
             2: 8,
             11: 8,
             6: 11,
             7: 9,
             18: 8,
             12: 8,
             19: 9,
             17: 9,
             8: 8,
             0: 8,
             3: 8,
             15: 8,
             9: 8,
             16: 8,
             1: 8})

Think this is less of a problem if using MultiOutputClassifier() because it only means that certain binary classifiers are trained with more examples while others are not affected.

  1. The number of unique predicted labels are much fewer than unique groundtruth labels, which can make accuracy much lower because it's currently assessing in a strict set-label perfect matching logic. e.g., [0, 0, 0, 1, 0] will score 0 for [0, 0, 0, 1, 1] which is probably not optimal.

Image

Image

@gowitheflow-1998
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant