Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix import channel in Postgresql #12709

Open
wants to merge 4 commits into
base: release-v0.17.x
Choose a base branch
from

Conversation

jredrejo
Copy link
Member

@jredrejo jredrejo commented Oct 8, 2024

Summary

When importing channel data, psycopg2 execute_values function was used. This function code converts to bytes all the strings in order to have a better performance.
However, converting an utf-8 char to byte results in more than one byte, making some strings unfit in the maximum number of chars of a column limit.

This PR:

  • Replaces the use of execute_values by executemany
  • As sqlite does not applies the column char limit, ensures limit is applied to data before is inserted in Postgresql

Note: I had to cherry-pick the commit from #12466 to fix docs builds in GH

References

Closes: #11780

Reviewer guidance

Do tests pass?
In order to test the fix, this channel was failing and can be used in a kolibri installation using PG:
kolibri manage importchannel network --baseurl=https://studio.learningequality.org 07cd1633691b4473b6fda08caf826253

Testing checklist

  • Contributor has fully tested the PR manually
  • If there are any front-end changes, before/after screenshots are included
  • Critical user journeys are covered by Gherkin stories
  • Critical and brittle code paths are covered by unit tests

PR process

  • PR has the correct target branch and milestone
  • PR has 'needs review' or 'work-in-progress' label
  • If PR is ready for review, a reviewer has been added. (Don't use 'Assignees')
  • If this is an important user-facing change, PR or related issue has a 'changelog' label
  • If this includes an internal dependency change, a link to the diff is provided

Reviewer checklist

  • PR is fully functional
  • PR has been tested for accessibility regressions
  • External dependency files were updated if necessary (yarn and pip)
  • Documentation is updated
  • Contributor is in AUTHORS.md

@jredrejo jredrejo added the TODO: needs review Waiting for review label Oct 8, 2024
@jredrejo jredrejo added this to the Kolibri 0.17: Planned Patch 2 milestone Oct 8, 2024
@jredrejo jredrejo requested a review from rtibbles October 8, 2024 17:52
@github-actions github-actions bot added the DEV: backend Python, databases, networking, filesystem... label Oct 8, 2024
…+ add required extension sphinxcontrib.jquery
Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that we are properly setting psycopg2 up with unicode handling as described here? https://www.psycopg.org/docs/usage.html#unicode-handling

This might point to a way to handle this in a way that doesn't cause a huge performance regression.

Also, I think we should add a regression test for the specific case we are fixing here - a importing a unicode string for a node title that is at the max length.

kolibri/core/content/utils/channel_import.py Outdated Show resolved Hide resolved
@jredrejo
Copy link
Member Author

Also, I think we should add a regression test for the specific case we are fixing here - a importing a unicode string for a node title that is at the max length.

Done

Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of questions to make sure the tests are doing what they ought - if the answer to my second question is yes, then we might need two test cases, one with utf-8 that is short enough but would overflow if it was converted to bytes, and another where the tag is just too long regardless.

I will manually test this to compare speed before and after.

},
{
"id": "0c20e2eb254b4070a713da63380ff0a3",
"tag_name": "velocidad de reacciones químicas"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like this is still ASCII rather than having any UTF-8 characters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, this test is to check 32 chars are cut to 30, just that. I used a real case from channel 07cd1633691b4473b6fda08caf826253 just for the pleasure of using real data

):
max_length = column_obj.type.length
if max_length is not None:
value = value[:max_length] if value is not None else default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to handle cases where even with proper handling of utf-8 the tag name is too long because of slack of checks in generation of the SQLite db?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, while testing this, I've seen tag names 40 chars long in sqlite while max_length is 30 in the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DEV: backend Python, databases, networking, filesystem... TODO: needs review Waiting for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants