Vanishing gradients on "sparse" relations #21

HashakGik · 2024-07-03T08:57:50Z

Hi, I am trying to perform neuro-symbolic learning on a very sparse setting (i.e., my target relation is logically false for a large majority of input tuples), and I am experiencing, unsurprisingly, vanishing gradients.

The setup is the traditional one in NeSy, some neural networks predict input probabilities, and then a Scallop program computes the probability of some target relation (to give context, but the question is more general, I am in a decision making setting, where I need to use the inputs to first determine the state of my system and then use that state to choose an action to perform, most actions can be performed only on a small subset of states, hence the sparsity, moreover as the action probability depends on the state probability, they are chained and quickly drop to zero).

I have manually checked the symbolic program for correctness and when inputs are confident enough, the results are the ones expected. With a strong prior on the neural component (i.e., the networks are pre-trained to give high-confidence probabilities, or are joint-learned with a strong supervision signal), Scallop probabilistic inference is working as expected, and training further benefits from backpropagating through the symbolic component (slightly better test-set generalization at the end, but there is a surprisingly high variance in the gradients during training, compared to a simple baseline using a purely neural approach).

However, my objective would be to avoid direct supervision and perform experiments in a distant supervision setting (e.g., providing a signal only on the correct action, or state). In this setting, probabilities quickly drop to zero, as at the beginning of training the neural component is not confident enough to make any "good" prediction, the output relation is almost always "false" (the output after the scallop module is an exactly-zero tensor, not even a tensor with small values), and therefore gradients go to zero as well. This does not happen if I provide supervision before the scallop module, which then receives confident probabilities for meaningful values and produces a non-zero output tensor.

Most of my tests used difftopkproofs with different values for k (< 10 as I do not have a large computational budget), as well as different learning rates. So far I had no success.

Are there any provenances/parameters which can make training easier in a sparse setting?
Thank you.

PS: it may also be possibile that gradient is vanishing because I misunderstood Scallop outputs and I am using the wrong loss. This may explain the high variance I am observing during training.
To be sure, what is the output of Scallopy modules? Are they probabilities, log probabilities or unnormalized logits? For non-scalar relations (e.g. output_mapping={"p": range(10)}), do outputs sum up to 1 (as if they were softmaxed)?

Are different outputs (e.g., output_mapping={"p": range(10), "q": (range(5), range(3)) } ) computed as a single query (i.e. what in prolog would be "?- input, p(X), q(Y,Z)."), or are they independent of each other and depend only on the input (i.e. "?- input, p(X)." followed by "?- input, q(Y,Z).")?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vanishing gradients on "sparse" relations #21

Vanishing gradients on "sparse" relations #21

HashakGik commented Jul 3, 2024

Vanishing gradients on "sparse" relations #21

Vanishing gradients on "sparse" relations #21

Comments

HashakGik commented Jul 3, 2024