Extract content urls #4604

akolson · 2024-07-11T11:05:34Z

Summary

Description of the change(s) you made

This pr implements the logic responsible for the extraction of the content urls for content nodes. It also perform code clean-up and optimization tasks

Manual verification steps performed

NA

Does this introduce any tech-debt items?

No

References

Closes #4455

Comments

Contributor's Checklist

PR process:

If this is an important user-facing change, PR or related issue the CHANGELOG label been added to this PR. Note: items with this label will be added to the CHANGELOG at a later time
If this includes an internal dependency change, a link to the diff is provided
The docs label has been added if this introduces a change that needs to be updated in the user docs?
If any Python requirements have changed, the updated requirements.txt files also included in this PR
Opportunities for using Google Analytics here are noted
Migrations are safe for a large db

Studio-specifc:

All user-facing strings are translated properly
The notranslate class been added to elements that shouldn't be translated by Google Chrome's automatic translation feature (e.g. icons, user-generated text)
All UI components are LTR and RTL compliant
Views are organized into pages, components, and layouts directories as described in the docs
Users' storage used is recalculated properly on any changes to main tree files
If there new ways this uses user data that needs to be factored into our Privacy Policy, it has been noted.

Testing:

Code is clean and well-commented
Contributor has fully tested the PR manually
If there are any front-end changes, before/after screenshots are included
Critical user journeys are covered by Gherkin stories
Any new interactions have been added to the QA Sheet
Critical and brittle code paths are covered by unit tests

Reviewer's Checklist

This section is for reviewers to fill out.

Automated test coverage is satisfactory
PR is fully functional
PR has been tested for accessibility regressions
External dependency files were updated if necessary (yarn and pip)
Documentation is updated
Contributor is in AUTHORS.md

contentcuration/automation/migrations/0002_auto_20240711_1938.py

bjester · 2024-07-16T14:26:23Z

contentcuration/contentcuration/utils/recommendations.py

+            nodes = self._extract_data(response)
+            cache = [
+                RecommendationsCache(
+                    request=request,


Duplicating the request for each node returned isn't a scalable way to handle the cache. We should distill the request to something smaller and still represents the uniqueness of each request.

We now generate a unique hash for each request from the params and json body and store that instead

contentcuration/contentcuration/utils/recommendations.py

bjester

Looking improved! I left some comments about the override_threshold and its implications on all-the-things

contentcuration/automation/migrations/0001_initial.py

contentcuration/contentcuration/utils/recommendations.py

contentcuration/automation/models.py

bjester

Looks great. Just some small things regarding the override_threshold handling and the cache

bjester · 2024-07-30T17:17:04Z

contentcuration/contentcuration/utils/recommendations.py

+            request_hash = self._generate_request_hash(request)
+            data = list(
+                RecommendationsCache.objects.filter(request_hash=request_hash)
+                .order_by('rank')


Suggested change

.order_by('rank')

.order_by('override_threshold', 'rank')

bjester · 2024-07-30T17:17:49Z

contentcuration/contentcuration/utils/recommendations.py

-            return self.backend.make_request(embed_topics_request)
+            request_hash = self._generate_request_hash(request)
+            data = list(
+                RecommendationsCache.objects.filter(request_hash=request_hash)


Suggested change

RecommendationsCache.objects.filter(request_hash=request_hash)

RecommendationsCache.objects.filter(request_hash=request_hash, override_threshold=request.params.get('override_threshold', False))

bjester · 2024-07-30T17:35:40Z

contentcuration/contentcuration/utils/recommendations.py

+        nodes = self._extract_data(response)
+        if len(nodes) > 0:
+            node_ids = [node['contentnode_id'] for node in nodes]
+            recommendations = list(ContentNode.objects.filter(id__in=node_ids))


We may not want to preemptively cast this to a list to get all the objects. Returning a queryset is very flexible.

On the frontend, we'll need access to the data, either directly from the API or via the public API, as long as we have the data points required here: https://github.com/learningequality/studio/blob/unstable/contentcuration/contentcuration/frontend/channelEdit/vuex/contentNode/actions.js#L52C55-L52C82

bjester

Some defensiveness when writing to the cache would be worthwhile.

I'm also concerned about the code that builds the recommendations queryset.

bjester · 2024-08-01T14:23:40Z

contentcuration/automation/models.py

+        null=True,
+        blank=True,
+        related_name='recommendations',
+        on_delete=models.SET_NULL,


What's the reasoning for setting this to null on deletion of the node?

In light of your comment here i guess it defeats to purpose setting it null and keeping the record. I think a better approach would be to cascade the deletion?

Yeah I think cascade deletion should be fine.

contentcuration/contentcuration/utils/recommendations.py

bjester · 2024-08-01T14:36:56Z

contentcuration/contentcuration/utils/recommendations.py

+            # Get the channel_id from PublicContentNode based on matching node_id from ContentNode
+            channel_id_subquery = PublicContentNode.objects.filter(
+                self._normalize_uuid(F('id')) == self._normalize_uuid(OuterRef('node_id'))
+            ).values('channel_id')[:1]
+
+            # Get main_tree_id from Channel based on channel_id obtained from channel_id_subquery
+            main_tree_id_subquery = Channel.objects.filter(
+                self._normalize_uuid(F('id')) == self._normalize_uuid(Subquery(channel_id_subquery))
+            ).values('main_tree_id')[:1]
+
+            # Annotate main_tree_id onto ContentNode
+            recommendations = ContentNode.objects.filter(id__in=node_ids).annotate(
+                main_tree_id=Subquery(main_tree_id_subquery)
+            ).values('id', 'node_id', 'main_tree_id', 'parent_id')


This does not look correct to me. Are there tests covering this code?

channel_id_subquery has an OuterRef to node_id but is used in a Channel queryset subquery, which means the OuterRef should apply to the Channel model, which doesn't have a node_id.

Reiterating again that the PublicContentNode.id is the same as ContentNode.node_id, which seems like the intent in the first query.

Made some changes to the query and added tests

bjester · 2024-08-01T22:54:42Z

contentcuration/contentcuration/utils/recommendations.py

+                    override_threshold=override_threshold,
+                ) for node in nodes if node['contentnode_id'] not in existing_cache
+            ]
+            RecommendationsCache.objects.bulk_create(new_cache, ignore_conflicts=True)


Theoretically, having ignore_conflicts should mean you can omit checking existing_cache, because:

On databases that support it (all but Oracle), setting the ignore_conflicts parameter to True tells the database to ignore failure to insert any rows that fail constraints such as duplicate unique values. (ref)

So with the unique_together expression on request_hash and contentnode_id, it should ignore failing to create those on the overridden request.

It would be great to have a unit test to verify the previously described behavior.

I wrote tests for the cache that confirmed the above when ignore_conflicts is True or False. The only difference is that when ignore_conflicts=False, an exception is raised, brining the creation to a halt. So I think we can safely remove the existing_cache check as long as ignore_conflicts=True.

Nice! sounds perfect

akolson · 2024-08-09T23:05:02Z

contentcuration/contentcuration/utils/recommendations.py

+            channel_cte = With(
+                Channel.objects.annotate(
+                    channel_id=self._cast_to_uuid(F('id'))
+                ).filter(
+                    Exists(
+                        PublicContentNode.objects.filter(
+                            id__in=cast_node_ids,
+                            channel_id=OuterRef('channel_id')
+                        )
+                    )
+                ).values(
+                    'main_tree_id',
+                    tree_id=F('main_tree__tree_id'),
+                ).distinct()
+            )
+
+            recommendations = channel_cte.join(
+                ContentNode.objects.filter(node_id__in=node_ids),
+                tree_id=channel_cte.col.tree_id
+            ).with_cte(channel_cte).annotate(
+                main_tree_id=channel_cte.col.main_tree_id
+            ).values(
+                'id',
+                'node_id',
+                'main_tree_id',
+                'parent_id',
+            )


@bjester, we should be good now I think?

bjester

Nice work @akolson!

akolson added 3 commits June 13, 2024 15:03

add docstring to embed methods

a93c0c9

updates embed_content to use methods as building blocks

df86440

Implements extract content urls

b8fdb18

akolson added TAG: new feature DEV: backend work-in-progress labels Jul 11, 2024

akolson added this to the Studio Project: Search Recommendations milestone Jul 11, 2024

akolson added 2 commits July 11, 2024 14:34

Runs recommendations cache migrations

8c10f77

clean up and tests

63de9d9

akolson marked this pull request as ready for review July 15, 2024 18:56

akolson requested a review from bjester July 15, 2024 18:56

bjester requested changes Jul 16, 2024

View reviewed changes

akolson added 2 commits July 16, 2024 18:28

Remove extra migrations

61e8b62

Adds new migration

b920691

rtibbles assigned bjester Jul 16, 2024

implements feedback

8c23f5f

akolson requested a review from bjester July 24, 2024 20:45

bjester reviewed Jul 25, 2024

View reviewed changes

akolson added 6 commits July 26, 2024 14:56

Use values to query dictionary

9dbbc5c

Improves cache implementation

c7c598a

Adjusts docstring description

a8cca1e

Adds unique constraint to cache model

2ebd5ee

Adds indexes for optimized data querying

b7af67f

Adds migrations

50fc1d8

akolson requested a review from bjester July 30, 2024 13:59

bjester reviewed Jul 30, 2024

View reviewed changes

akolson added 2 commits August 1, 2024 16:48

Writes queries to get main_tree_id

0def99e

Reruns migrations

374da3f

akolson requested a review from bjester August 1, 2024 13:55

bjester requested changes Aug 1, 2024

View reviewed changes

bjester reviewed Aug 1, 2024

View reviewed changes

akolson added 3 commits August 7, 2024 17:17

adds more tests

d70e978

Fixes failing tests

bd5c018

adds more tests

fa5253b

akolson requested a review from bjester August 7, 2024 14:47

akolson added 2 commits August 8, 2024 18:25

cte implementation initial commit

cb2f47c

Implements CTE to get recommendations

902875e

akolson commented Aug 9, 2024

View reviewed changes

bjester approved these changes Aug 12, 2024

View reviewed changes

akolson merged commit 3d2f924 into learningequality:search-recommendations Aug 12, 2024
13 checks passed

akolson mentioned this pull request Aug 12, 2024

Update the embed_content in RecommendationsAdapter #4455

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract content urls #4604

Extract content urls #4604

akolson commented Jul 11, 2024 •

edited

Loading

bjester Jul 16, 2024

akolson Jul 24, 2024

bjester left a comment

bjester left a comment

bjester Jul 30, 2024

bjester Jul 30, 2024

bjester Jul 30, 2024

bjester left a comment

bjester Aug 1, 2024

akolson Aug 1, 2024

bjester Aug 1, 2024

bjester Aug 1, 2024 •

edited

Loading

akolson Aug 7, 2024

bjester Aug 1, 2024 •

edited

Loading

bjester Aug 1, 2024

akolson Aug 2, 2024

bjester Aug 5, 2024

akolson Aug 9, 2024

bjester left a comment

	RecommendationsCache.objects.filter(request_hash=request_hash)
	RecommendationsCache.objects.filter(request_hash=request_hash, override_threshold=request.params.get('override_threshold', False))

Extract content urls #4604

Extract content urls #4604

Conversation

akolson commented Jul 11, 2024 • edited Loading

Summary

Description of the change(s) you made

Manual verification steps performed

Does this introduce any tech-debt items?

References

Comments

Contributor's Checklist

Reviewer's Checklist

This section is for reviewers to fill out.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjester left a comment

Choose a reason for hiding this comment

bjester left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjester left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjester Aug 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjester Aug 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjester left a comment

Choose a reason for hiding this comment

akolson commented Jul 11, 2024 •

edited

Loading

bjester Aug 1, 2024 •

edited

Loading

bjester Aug 1, 2024 •

edited

Loading