Optimizing evidence representation #998
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements two optimizations to the representation of evidences that significantly decrease memory usage when manipulating large sets of INDRA Statements. The bulk of memory used by INDRA Statements is attributable to the Evidence objects (incl. evidence text) that are attached to them. One approach to decrease memory usage is to define the
__slots__
attribute of Evidence to make sure the set of attributes it can have is pre-defined (rather than variable via a__dict__
attribute). This seemed to make a minor difference in memory usage. Much larger memory savings can be achieved if lists of Evidences attached to a Statement are stored in a serialized, compressed form, and only decompressed and deserialized when being accessed. Based on some experiments, a Statement with 100 pieces of Evidence uses 75% less memory using this PR. On some large assembled corpora that I tried, which have Statements with a mixture of number of Evidences, 80% lower memory usage is typical.Not much of this affects the way INDRA Statements are used, however there is one important difference: when accessing a Statement's evidence (i.e.,
stmt.evidence
) one gets a view of the list evidences rather than a reference to them. So directly manipulatingstmt.evidence
will not result in persistent changes to the Statement. Rather, one has to do something like:to make changes to a Statement's list of Evidences. Some specialized code dealing with Evidence manipulation, as well as some tests needed to be updated. I am still ambivalent about whether this change will cause confusion later, and therefore not sure yet if this PR should be merged.