Fix import channel in Postgresql #12709

jredrejo · 2024-10-08T17:52:04Z

Summary

When importing channel data, psycopg2 execute_values function was used. This function code converts to bytes all the strings in order to have a better performance.
However, converting an utf-8 char to byte results in more than one byte, making some strings unfit in the maximum number of chars of a column limit.

This PR:

Replaces the use of execute_values by executemany
As sqlite does not applies the column char limit, ensures limit is applied to data before is inserted in Postgresql

Note: I had to cherry-pick the commit from #12466 to fix docs builds in GH

References

Closes: #11780

Reviewer guidance

Do tests pass?
In order to test the fix, this channel was failing and can be used in a kolibri installation using PG:
kolibri manage importchannel network --baseurl=https://studio.learningequality.org 07cd1633691b4473b6fda08caf826253

Testing checklist

Contributor has fully tested the PR manually
If there are any front-end changes, before/after screenshots are included
Critical user journeys are covered by Gherkin stories
Critical and brittle code paths are covered by unit tests

PR process

PR has the correct target branch and milestone
PR has 'needs review' or 'work-in-progress' label
If PR is ready for review, a reviewer has been added. (Don't use 'Assignees')
If this is an important user-facing change, PR or related issue has a 'changelog' label
If this includes an internal dependency change, a link to the diff is provided

Reviewer checklist

PR is fully functional
PR has been tested for accessibility regressions
External dependency files were updated if necessary (yarn and pip)
Documentation is updated
Contributor is in AUTHORS.md

…+ add required extension sphinxcontrib.jquery

github-actions · 2024-10-08T18:23:46Z

Build Artifacts

Asset type	Download link
PEX file	kolibri-0.17.3a0.dev0_git.11.g48b65fc8.pex
Windows Installer (EXE)	kolibri-0.17.3a0.dev0+git.11.g48b65fc8-windows-setup-unsigned.exe
Debian Package	kolibri_0.17.3a0.dev0+git.11.g48b65fc8-0ubuntu1_all.deb
Mac Installer (DMG)	kolibri-0.17.3a0.dev0+git.11.g48b65fc8.dmg
Android Package (APK)	kolibri-0.17.3a0.dev0+git.11.g48b65fc8-0.1.4-debug.apk
TAR file	kolibri-0.17.3a0.dev0+git.11.g48b65fc8.tar.gz
WHL file	kolibri-0.17.3a0.dev0+git.11.g48b65fc8-py2.py3-none-any.whl

rtibbles

I am not sure that we are properly setting psycopg2 up with unicode handling as described here? https://www.psycopg.org/docs/usage.html#unicode-handling

This might point to a way to handle this in a way that doesn't cause a huge performance regression.

Also, I think we should add a regression test for the specific case we are fixing here - a importing a unicode string for a node title that is at the max length.

kolibri/core/content/utils/channel_import.py

…n a lot of data needs to be inserted

…length

jredrejo · 2024-10-10T18:32:04Z

Also, I think we should add a regression test for the specific case we are fixing here - a importing a unicode string for a node title that is at the max length.

Done

rtibbles

Just a couple of questions to make sure the tests are doing what they ought - if the answer to my second question is yes, then we might need two test cases, one with utf-8 that is short enough but would overflow if it was converted to bytes, and another where the tag is just too long regardless.

I will manually test this to compare speed before and after.

rtibbles · 2024-10-11T19:36:32Z

kolibri/core/content/fixtures/longdescriptions_content_data.json

+    },
+    {
+      "id": "0c20e2eb254b4070a713da63380ff0a3",
+      "tag_name": "velocidad de reacciones químicas"


This looks like this is still ASCII rather than having any UTF-8 characters?

yep, this test is to check 32 chars are cut to 30, just that. I used a real case from channel 07cd1633691b4473b6fda08caf826253 just for the pleasure of using real data

rtibbles · 2024-10-11T19:37:16Z

kolibri/core/content/utils/channel_import.py

+                ):
+                    max_length = column_obj.type.length
+                    if max_length is not None:
+                        value = value[:max_length] if value is not None else default


Is this to handle cases where even with proper handling of utf-8 the tag name is too long because of slack of checks in generation of the SQLite db?

yes, while testing this, I've seen tag names 40 chars long in sqlite while max_length is 30 in the model.

Do not use execute_values to avoid byte conversion

f16551c

jredrejo added the TODO: needs review Waiting for review label Oct 8, 2024

jredrejo added this to the Kolibri 0.17: Planned Patch 2 milestone Oct 8, 2024

jredrejo requested a review from rtibbles October 8, 2024 17:52

github-actions bot added the DEV: backend Python, databases, networking, filesystem... label Oct 8, 2024

Adds loose pinning of dev docs requirements to ensure correct builds …

f1b712a

…+ add required extension sphinxcontrib.jquery

jredrejo force-pushed the fix_channel_import_in_pg branch from 24e4490 to f1b712a Compare October 8, 2024 18:15

rtibbles requested changes Oct 9, 2024

View reviewed changes

kolibri/core/content/utils/channel_import.py Outdated Show resolved Hide resolved

executemany produces one insert sql per row what can be very slow whe…

4e8c330

…n a lot of data needs to be inserted

jredrejo requested a review from rtibbles October 10, 2024 16:48

Added a postgresql test to check long names are shortened to its max …

130405f

…length

jredrejo force-pushed the fix_channel_import_in_pg branch from dd5dd8a to 130405f Compare October 10, 2024 18:31

rtibbles reviewed Oct 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix import channel in Postgresql #12709

Fix import channel in Postgresql #12709

jredrejo commented Oct 8, 2024 •

edited

Loading

github-actions bot commented Oct 8, 2024 •

edited

Loading

rtibbles left a comment

jredrejo commented Oct 10, 2024

rtibbles left a comment

rtibbles Oct 11, 2024

jredrejo Oct 11, 2024

rtibbles Oct 11, 2024

jredrejo Oct 11, 2024

Fix import channel in Postgresql #12709

Are you sure you want to change the base?

Fix import channel in Postgresql #12709

Conversation

jredrejo commented Oct 8, 2024 • edited Loading

Summary

References

Reviewer guidance

Testing checklist

PR process

Reviewer checklist

github-actions bot commented Oct 8, 2024 • edited Loading

Build Artifacts

rtibbles left a comment

Choose a reason for hiding this comment

jredrejo commented Oct 10, 2024

rtibbles left a comment

Choose a reason for hiding this comment

rtibbles Oct 11, 2024

Choose a reason for hiding this comment

jredrejo Oct 11, 2024

Choose a reason for hiding this comment

rtibbles Oct 11, 2024

Choose a reason for hiding this comment

jredrejo Oct 11, 2024

Choose a reason for hiding this comment

jredrejo commented Oct 8, 2024 •

edited

Loading

github-actions bot commented Oct 8, 2024 •

edited

Loading