Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

last_id is silently not supported for Subject.where() #212

Open
mwalmsley opened this issue Apr 4, 2019 · 2 comments
Open

last_id is silently not supported for Subject.where() #212

mwalmsley opened this issue Apr 4, 2019 · 2 comments
Labels

Comments

@mwalmsley
Copy link
Collaborator

mwalmsley commented Apr 4, 2019

Adding last_id={id} to Subject.where() appears to have no effect and no error.

Test Case

Executing:

`subjects = Subject.where(
scope='project',
project_id='5733'
)

for n in range(10):
s = subjects.next()`

Gives the following result:

<Subject 30091684>
<Subject 30091682>
<Subject 30091673>
<Subject 30091670>
<Subject 30091664>
<Subject 30091662>
<Subject 30091656>
<Subject 30091654>
<Subject 30091645>
<Subject 30091641>

Adding last_id=30091682 gives the same result as above.

@mwalmsley mwalmsley added the bug label Apr 4, 2019
@camallen
Copy link
Contributor

camallen commented Apr 4, 2019

This is due to the subject API resource lacking the optimized last_id support. That was added to speed up the classifications API but it should be ported to the each resource.

Paging through the resource result sets via next / previous links is the standard support for resources and subject does work this way. Does that meet your use case here?

@mwalmsley
Copy link
Collaborator Author

mwalmsley commented Apr 9, 2019

I think that I didn't give enough thought to what I actually needed to accomplish here.

I realised that in order for iteratively downloaded (yay for last_id) classifications to be useful, I need the metadata from the subject to link those classifications back to the science catalog.

classification <-(links.subject, subject_id)-> subject (metadata.science_id, science_id) <- science_catalog

My first thought was to download all new subjects with last_id - but of course, that's not how subjects work! Old subjects can get new classifications.

Paging would work to download all subjects, but doing that daily would be slow and duplicate calls.

My current solution is to get the specific subject for each new classification:

subject_id = classification['links']['subjects'][0] # only works for single-subject projects

subject = get_subject(project_id, subject_id) # assume id is unique

classification['links']['subject'] = subject.raw

save_classification_to_file(classification, save_loc)

and decorate get_subject (which is simply the Python client) with a huge lru_cache, on the assumption that subjects tend to appear repeatedly at similar times (i.e. the currently active subject set).

This saves me having to maintain an up-to-date duplicate database of all subjects, but is a bit slow vs. the optimised classification interface.

I would guess that wanting to get the subject details along with the classification details would be quite useful for others, though I'm not sure how best to implement this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants