Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/vapaaehtoistyö.fi importer #456

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

Cap-JaTu
Copy link
Contributor

Adding new importer for Vapaaehtoistyö.fi volunteering tasks.

Note: Setting-variable VAPAAEHTOISTYOFI_API_KEY is required for successful access to the resource.

dankiki and others added 6 commits April 13, 2021 12:20
The first one is based on memcached and regex. The columns of interest are collected in a separate memcached store which is being populated via cron job. The query string is then parsed and transformed into a regex which is applied to the cache and retrieves the event ids. Second search is based on the Postgres full-text search functionality and hence is language specific. Separate tsvector columns are created in the db for Finnish, Swedish, and English. They are populated and kept up-to-date on the db side, see migration 0080.
urls.py has a small correction of the imports, not related to the main topic of the commit.
KeywordMatcher takes as input the strings of words, mostly without language specification, and has to transform them into a list of Keywords. This commit suggests the following logic: first an exactly matching KeywordLabel is searched for irrespective of the language. If no exact matches found, Postgres full-text search is used to find a label matched by lexeme. The results are ranked with TrigramSimilarity as ft SearchRank is not suitable for ranking matched individual words. If no language is passed we cycle through all the options as specified in FULLTEXT_SEARCH_LANGUAGES and select the best match according to similarity. If no match is found the string is checked for the possibility that it could be split and search for matches is repeated.
@codecov-commenter
Copy link

codecov-commenter commented Apr 30, 2021

Codecov Report

Merging #456 (57a31f2) into master (307369c) will decrease coverage by 0.34%.
The diff coverage is 13.40%.

❗ Current head 57a31f2 differs from pull request most recent head 6bd3a28. Consider uploading reports for the commit 6bd3a28 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #456      +/-   ##
==========================================
- Coverage   22.48%   22.13%   -0.35%     
==========================================
  Files         169      177       +8     
  Lines       11986    12480     +494     
==========================================
+ Hits         2695     2763      +68     
- Misses       9291     9717     +426     
Impacted Files Coverage Δ
...rter/helper/importers/vapaaehtoistyofi/__init__.py 0.00% <0.00%> (ø)
...porter/helper/importers/vapaaehtoistyofi/reader.py 0.00% <0.00%> (ø)
...porter/helper/importers/vapaaehtoistyofi/record.py 0.00% <0.00%> (ø)
events/importer/vapaaehtoistyofi.py 0.00% <0.00%> (ø)
events/keywords.py 0.00% <0.00%> (ø)
.../management/commands/populate_local_event_cache.py 0.00% <0.00%> (ø)
linkedevents/urls.py 81.25% <0.00%> (-6.25%) ⬇️
events/tests/test_event_get.py 10.51% <8.00%> (-0.11%) ⬇️
events/api.py 51.62% <23.43%> (-1.92%) ⬇️
events/tests/conftest.py 66.66% <76.47%> (-0.59%) ⬇️
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 307369c...6bd3a28. Read the comment docs.

Copy link

@nikomakela nikomakela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to request unit tests, because they would have already been for use while reviewing. They would explain the purpose, demo the usage and proof the working.

Some comments added already to #455 which seems to be included in this PR also.

Comment on lines 111 to 133
def to_dict(self):
newrecord = {
"id": self.id,
"organization_id": self.organization_id,
"organization_name": self.organization_name,
"title": self.title,
"address": self.address,
"address_coordinates": self.address_coordinates,

"no_time": self.no_time,
"timestamp_start": self.timestamp_start,
"timestamp_end": self.timestamp_end,

"description": self.description,
"contact_details": self.contact_details,
"themes": self.themes,
"timestamp_publish": self.timestamp_publish,

"status": self.status,
"creator_id": self.creator_id,
"timestamp_inserted": self.timestamp_inserted,
"timestamp_updated": self.timestamp_updated
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only place this to_dict() -method is used?

    def _import_event(self, event_obj):
        event = event_obj.to_dict()

Doesn't __dict__ do the same?

In [50]: r1.__class__
Out[50]: events.importer.helper.importers.vapaaehtoistyofi.record.Record

In [51]: r1.__dict__
Out[51]:
{'id': 'id',
 'organization_id': 'organization_id',
 'organization_name': 'organization_name',
 'title': 'title',
 'address': 'address',
 'address_coordinates': 'address_coordinates',
 'tags': [],
 'no_time': None,
 'timestamp_start': None,
 'timestamp_end': None,
 'description': None,
 'contact_details': None,
 'themes': None,
 'timestamp_publish': None,
 'status': None,
 'creator_id': None,
 'timestamp_inserted': None,
 'timestamp_updated': None}

In [52]: len(r1.__dict__)
Out[52]: 18

In [53]: len(r1.to_dict())
Out[53]: 17

Tags is missing from __dict__, but is it missing in purpose and if it is, it would be better to just delete it from the new instance done with __dict__.

Comment on lines 85 to 107
def __copy__(self) -> object:
newrecord = Record(None)
newrecord.id = self.id
newrecord.organization_id = self.organization_id
newrecord.organization_name = self.organization_name
newrecord.title = self.title
newrecord.address = self.address
newrecord.address_coordinates = self.address_coordinates
newrecord.tags = self.tags

newrecord.no_time = self.no_time
newrecord.timestamp_start = self.timestamp_start
newrecord.timestamp_end = self.timestamp_end

newrecord.description = self.description
newrecord.contact_details = self.contact_details
newrecord.themes = self.themes
newrecord.timestamp_publish = self.timestamp_publish

newrecord.status = self.status
newrecord.creator_id = self.creator_id
newrecord.timestamp_inserted = self.timestamp_inserted
newrecord.timestamp_updated = self.timestamp_updated

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Doesn't self.__copy__() already do the same?

In [50]: r1.__class__
Out[50]: events.importer.helper.importers.vapaaehtoistyofi.record.Record

In [51]: r1.__dict__
Out[51]:
{'id': 'id',
 'organization_id': 'organization_id',
 'organization_name': 'organization_name',
 'title': 'title',
 'address': 'address',
 'address_coordinates': 'address_coordinates',
 'tags': [],
 'no_time': None,
 'timestamp_start': None,
 'timestamp_end': None,
 'description': None,
 'contact_details': None,
 'themes': None,
 'timestamp_publish': None,
 'status': None,
 'creator_id': None,
 'timestamp_inserted': None,
 'timestamp_updated': None}

In [57]: r2 = r1.__copy__()

In [58]: r2.__dict__ == r1.__dict__
Out[58]: True

I could have tried if there would be unit tests for this. ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My Python:

AttributeError: 'Record' object has no attribute '__copy__'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one solved by dropping __copy().

Comment on lines 16 to 17
def __init__(self, json_dict):
if json_dict:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it better to destruct the json before initializing a new class instance or at least use another constructor method like from_json(json_object). So I would map the json values to the class instance with a json serializer. Then the class instance creation would be as simple as Record(**record_dict).

@nikomakela nikomakela requested a review from dankiki April 30, 2021 13:55
@Cap-JaTu
Copy link
Contributor Author

Cap-JaTu commented May 3, 2021

I'll start working with the requested changes.

@quyenlq quyenlq self-requested a review May 7, 2021 11:32
Copy link

@quyenlq quyenlq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes in this PR that belongs to another feature so it's quite hard to understand the whole thing without looking at #443

return self.entries[id]

def _load_entry_api(self, id):
http_client = self._setup_client()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO It'd be better to add the requests.Session() client to the Reader class property, then set it up in the constructor function init(), that way you don't have to call the _setup_client() before making every HTTP call, just use self.client_session. Isn't it the goal of using Session 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 6bd3a28

http_client = self._setup_client()
page = 1
total_records = None
batch_size = 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd get rid of batch_size here, change the while loop to True, then break the loop if len(data['data']['records']) == 0. This could make this shorter and a bit easier to understand

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggestion is heavily opinionated.
I typically tend not to use while True -stucture as it can yield a forever-loop.

response.status_code)

data = response.json()
if 'status' not in data or data['status'] != "ok":
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part repeats quite many times, maybe consider moving it to a separated private function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added check-function in 6bd3a28

raise ValueError("Unknown locale '%s'!" % locale)
return self.TASK_URL_FORMAT % (locale, self.id)

def __copy__(self) -> object:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use of this function in the PR?
For copying objects, there are a copy.copy() or copy.deepcopy() already available in python copy module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__copy__() is already dropped in 65c526d.

@Cap-JaTu
Copy link
Contributor Author

There are some changes in this PR that belongs to another feature so it's quite hard to understand the whole thing without looking at #443

The reason #443 files have been copied here is to get database migration to work at all.

Exactly the same is happening with unit tests, they do not work. At all. Assumption is for manage.py test to run unit tests, but instead an error from module helevents is spat out. Running manage.py test events isn't much better:

psycopg2.errors.UndefinedColumn: column munigeo_administrativedivision.name_zh_hans does not exist
LINE 1: ..._sv", "munigeo_administrativedivision"."name_en", "munigeo_a...

My only conclusion is, that codebase isn't in a shape for unit testing to work for this branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants