PaNOSC Search Scoring V2.x #29

nitrosx · 2023-02-20T20:19:37Z

This PR implements incremental weight computation, which should scale better to a higher number of datasets.
It still needs a lot of testing and in-depth review.
I created so other people can work on it, as I do not have time to work on this at the moment.

…g tests and solving testing issues

VKTB · 2023-02-22T16:57:28Z

Hi @nitrosx, posting my findings here as they will hopefully be useful to anyone that may be testing/reviewing/working on this PR.

From your last email, my understanding is that the /compute endpoint does not necessarily have to be used for the weights to be computed anymore. If I understood you correctly this PR changes the logic so that when a new item is inserted or an existing one is updated, the database should automatically update all the components of the weights that are influenced by the update. Also, when a query is sent, the database should compute the weights (on the fly) of the words extracted from the query and present in the items, and return the relevant ones.

Listed below are the things I did and my findings:

I deleted the whole database and started from scratch.
I updated the configuration file to include incrementalWeightsComputation which I set it to True.
Populated the search scoring component with 100 documents by sending a POST request to the /items endpoint which included the documents.
I know that one of the documents has an id pid:123 and a summary which starts like this This proposal is part of a ... so I sent a POST request to the /score endpoint with the following JSON {"query": "This proposal is part of a"}.
I was expecting to get the item (along with a score) back that has an id pid:123 and a summary This proposal is part of a ..., however, as shown below I did not get any items back.

{
    "request": {
        "query": "This proposal is part of a",
        "itemIds": [],
        "group": "",
        "limit": -1
    },
    "query": {
        "query": "This proposal is part of a",
        "terms": [
            "propos",
            "part"
        ]
    },
    "scores": [],
    "dimension": 0,
    "computeInProgress": false,
    "started": "2023-02-22T16:12:22.575992",
    "ended": "2023-02-22T16:12:22.581431"
}

I also tested it by specifying all the parameters in the JSON ({ "query": "This proposal is part of a", "group": "Documents", "limit": 1000, "itemIds": ["pid:123"] }), but I did not get any items back.
I also did not get any items back after modifying the values of some of the fields of the document that has an id pid:123 and a summary This proposal is part of a .... It's worth me mentioning that the PATCH /items/<id> endpoint is not doing what it is expected to do because it updates the entire item rather than updating the values of the fields supplied in the request.
I was constantly checking the database when I was populating it with items and sending score requests and I could only see the items collection in it so no collections for weights etc.

nitrosx · 2023-02-22T17:27:17Z

@VKTB thank you so much for testing the new version and the details.
Would you be able to do the following on your testing environment:

Make a GET on /items and see if you get all your items back
Make a GET on /terms/count and check how many terms have been extracted
Make a GET on /terms and check the output.

You could always connect to the database directly and see if there is any entry in the tf and idf collections

Let me know

VKTB · 2023-02-22T17:36:57Z

@nitrosx Thank you for your reply.

Make a GET on /items and see if you get all your items back

Yes, I can see all the items that I posted to the search scoring component

Make a GET on /terms/count and check how many terms have been extracted

I get a 500 - Internal Server Error

Make a GET on /terms and check the output.

I get an empty list back ([]) presumably because no terms have been created when I inserted the items or modified the item with id id:123?

You could always connect to the database directly and see if there is any entry in the tf and idf collections

Like I said in my previous comment, I can only see the items collection in the database so no collections for weights, tf, idf etc. so not sure why this is the case.

…omputation Swap 2138 incremental weights computation

VKTB · 2023-07-04T13:51:16Z

Hi @nitrosx, I thought I would post my findings here as well in case they are useful to anyone that may be testing/reviewing/working on this PR.

I pulled the latest changes from the v2.x branch and modified the docker-compose.yml file to build from the Dockerfile to ensure that the Docker image uses the latest code changes. I then tried testing the changes but I am getting the following error when I post an item of group Documents to the /items endpoint: 400 Bad Request – An exception of type TypeError occurred. Arguments:\n('string indices must be integers',).

I can see that the item gets added to the items collection in the database but from the entry (see below) in the status collection, the computation seems stuck because it hasn’t changed for the past 30 minutes.

[
  {
    _id: ObjectId("64a3ed5180fa4d2d2668250e"),
    inProgress: true,
    incrementalWeightsComputation: true,
    progressDescription: 'Computing weights TF',
    progressPercent: 0.2
  }
]

nitrosx added 11 commits January 21, 2022 18:22

work in progress of a long day

4ca2345

added configuration entry for incremental weight computation

e2aacfa

Resolving bugs from automated testing

78654ae

work-in-progress: updated code for score computation. Started updatin…

46a22b0

…g tests and solving testing issues

wip of the day

a15a71f

Work in progress of the day

8e84701

Solved issue on partial save of TF weights

2632422

Fixed terms end point

129cf00

work-in-progress to complete incremental computation

e9ef2f8

completed testing of score computation

bbbb8ba

incremental weight computation, alpha version

63ec319

nitrosx and others added 3 commits June 7, 2023 14:25

first incremental version passing all tests

6a9735e

Updated config class

ffc07aa

Merge pull request #30 from panosc-eu/SWAP-2138-incremental-weights-c…

f76607f

…omputation Swap 2138 incremental weights computation

nitrosx added 5 commits September 21, 2023 14:25

solved issues with swagger

b67eedd

fixed incremental computation of TF and IDF

5ed5755

fixed and tested item populating functionalities

9764fcb

fixed idf computation pipeline

0dc5165

added index on weights_idf collection, fixed tests for compute and items

648c3fd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PaNOSC Search Scoring V2.x #29

PaNOSC Search Scoring V2.x #29

nitrosx commented Feb 20, 2023

VKTB commented Feb 22, 2023 •

edited

Loading

nitrosx commented Feb 22, 2023

VKTB commented Feb 22, 2023 •

edited

Loading

VKTB commented Jul 4, 2023

PaNOSC Search Scoring V2.x #29

Are you sure you want to change the base?

PaNOSC Search Scoring V2.x #29

Conversation

nitrosx commented Feb 20, 2023

VKTB commented Feb 22, 2023 • edited Loading

nitrosx commented Feb 22, 2023

VKTB commented Feb 22, 2023 • edited Loading

VKTB commented Jul 4, 2023

VKTB commented Feb 22, 2023 •

edited

Loading

VKTB commented Feb 22, 2023 •

edited

Loading