Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vanishing gradients on "sparse" relations #21

Open
HashakGik opened this issue Jul 3, 2024 · 0 comments
Open

Vanishing gradients on "sparse" relations #21

HashakGik opened this issue Jul 3, 2024 · 0 comments

Comments

@HashakGik
Copy link

Hi, I am trying to perform neuro-symbolic learning on a very sparse setting (i.e., my target relation is logically false for a large majority of input tuples), and I am experiencing, unsurprisingly, vanishing gradients.

The setup is the traditional one in NeSy, some neural networks predict input probabilities, and then a Scallop program computes the probability of some target relation (to give context, but the question is more general, I am in a decision making setting, where I need to use the inputs to first determine the state of my system and then use that state to choose an action to perform, most actions can be performed only on a small subset of states, hence the sparsity, moreover as the action probability depends on the state probability, they are chained and quickly drop to zero).

I have manually checked the symbolic program for correctness and when inputs are confident enough, the results are the ones expected. With a strong prior on the neural component (i.e., the networks are pre-trained to give high-confidence probabilities, or are joint-learned with a strong supervision signal), Scallop probabilistic inference is working as expected, and training further benefits from backpropagating through the symbolic component (slightly better test-set generalization at the end, but there is a surprisingly high variance in the gradients during training, compared to a simple baseline using a purely neural approach).

However, my objective would be to avoid direct supervision and perform experiments in a distant supervision setting (e.g., providing a signal only on the correct action, or state). In this setting, probabilities quickly drop to zero, as at the beginning of training the neural component is not confident enough to make any "good" prediction, the output relation is almost always "false" (the output after the scallop module is an exactly-zero tensor, not even a tensor with small values), and therefore gradients go to zero as well. This does not happen if I provide supervision before the scallop module, which then receives confident probabilities for meaningful values and produces a non-zero output tensor.

Most of my tests used difftopkproofs with different values for k (< 10 as I do not have a large computational budget), as well as different learning rates. So far I had no success.

Are there any provenances/parameters which can make training easier in a sparse setting?
Thank you.

PS: it may also be possibile that gradient is vanishing because I misunderstood Scallop outputs and I am using the wrong loss. This may explain the high variance I am observing during training.
To be sure, what is the output of Scallopy modules? Are they probabilities, log probabilities or unnormalized logits? For non-scalar relations (e.g. output_mapping={"p": range(10)}), do outputs sum up to 1 (as if they were softmaxed)?

Are different outputs (e.g., output_mapping={"p": range(10), "q": (range(5), range(3)) } ) computed as a single query (i.e. what in prolog would be "?- input, p(X), q(Y,Z)."), or are they independent of each other and depend only on the input (i.e. "?- input, p(X)." followed by "?- input, q(Y,Z).")?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant