Dataset | Structure | Size |
---|---|---|
LP50 | doc1 doc2 avg | 50 docs |
The algorithm takes two documents doc1 and doc2 as its input and calculates their similarity as follows:
- For each document, the related set of entities is retrieved. The output of this step are the sets E1 and E2, respectively.
- For each pair of entities (i.e. for the cross product of the sets), the similarity score is computed.
- Only the maximum value is preserved for determining the document similarity evaluation. Therefore, for each entity in E1 the maximum similarity to an entity in E2 is kept and vice versa.
- The similarity score between the two documents is calculated by averaging the sum of all these maximum similarities.
The similarity_function can be customized by the user.
The Document Similarity task simply ignores any missing entities and computes the similarity only on entities that both occur in the gold standard dataset and in the input file.
Metric | Range | Interpretation |
---|---|---|
Pearson correlation coefficient (P_cor) | [-1,1] | Extreme values: correlation, Values close to 0: no correlation |
Spearman correlation coefficient (S_cor) | [-1,1] | Extreme values: correlation, Values close to 0: no correlation |
Harmonic mean of P_cor and S_cor | [-1,1] | Extreme values: correlation, Values close to 0: no correlation |