Skip to content

batch saving genomes #212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 30 commits into
base: master
Choose a base branch
from
Open

batch saving genomes #212

wants to merge 30 commits into from

Conversation

Xiangs18
Copy link
Contributor

No description provided.

Copy link

codecov bot commented Aug 16, 2024

Codecov Report

Attention: Patch coverage is 97.82609% with 2 lines in your changes missing coverage. Please review.

Project coverage is 80.30%. Comparing base (4819598) to head (a8bced6).

Files with missing lines Patch % Lines
lib/GenomeFileUtil/core/GenomeInterface.py 95.91% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #212      +/-   ##
==========================================
- Coverage   80.88%   80.30%   -0.59%     
==========================================
  Files          11       11              
  Lines        2998     3011      +13     
==========================================
- Hits         2425     2418       -7     
- Misses        573      593      +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Xiangs18
Copy link
Contributor Author

Xiangs18 commented Aug 16, 2024

self note:

a2fabf4 verifies that the refactored code works with the save_one_genome function.
330d6c2 verifies that the refactored code works with the save_genome_mass function.

@Xiangs18 Xiangs18 changed the title [WIP] add save_genomes add save_genomes Aug 17, 2024
@Xiangs18 Xiangs18 changed the title add save_genomes batch saving genomes Aug 17, 2024
@Xiangs18 Xiangs18 requested a review from MrCreosote August 17, 2024 06:22
Copy link
Member

@MrCreosote MrCreosote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked at tests yet, but this is already a lot of comments

EDIT: All comments addressed

@Xiangs18 Xiangs18 requested a review from MrCreosote September 10, 2024 23:50
ws_name_to_id_func: Callable[[str], int]
) -> Dict[str, Any]:
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
f"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I was wrong about this. From https://docs.python.org/3/reference/lexical_analysis.html#formatted-string-literals

Formatted string literals cannot be used as docstrings, even if they do not include expressions.

... which is a bummer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed formatted string literals

Returns:
Dict[str, Any]: A dictionary containing the workspace ID and the processed parameters. The dictionary
has keys '_WSID' and '_INPUTS', where '_WSID' is the workspace ID and '_INPUTS' is a list containing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
has keys '_WSID' and '_INPUTS', where '_WSID' is the workspace ID and '_INPUTS' is a list containing
has keys {_WSID} and {_INPUTS}, where {_WSID} is the workspace ID and {_INPUTS} is a list containing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed formatted string literals.

validate_params_func: Callable[[Dict[str, Any]], None]
) -> None:
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
f"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed formatted string literals.

Comment on lines 548 to 549
- _INPUTS: A list of parameter dictionaries, each of which must be validated by `validate_params_func`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- _WSID: A workspace ID, which must be present and valid.
- _INPUTS: A list of parameter dictionaries, each of which must be validated by `validate_params_func`.
- {_WSID}: A workspace ID, which must be present and valid.
- {_INPUTS}: A list of parameter dictionaries, each of which must be validated by `validate_params_func`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more of these need to be done below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed formatted string literals.

if contains:
self.assertIn(error, str(context.exception))
else:
self.assertEqual(error, str(context.exception))

def check_save_one_genome_output(self, ret, genome_name):
def check_save_one_genome_output(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checker barely does checks anything, but I realize it was that way when you got here

Copy link
Member

@MrCreosote MrCreosote Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could've sworn you spent a bunch of time adding rigorous tests to this module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@MrCreosote MrCreosote Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to the comment below - the tests use the check_save_one_genome_output method as a helper for testing mass saves, so the mass save method isn't actually getting tested if it relies on that function

Copy link
Contributor Author

@Xiangs18 Xiangs18 Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mass saves use the check_save_one_genome_output method as a helper for testing. Why do you think the mass isn't actually being tested? In my opinion, it conducted some testing, but it wasn't tested thoroughly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the question - check_save_one_genome_output is a test method. You wouldn't add tests for that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really confused at this point, because we definitely created a new issue about the checker function and also added it to our TODO list in the GFU parallelization tracking ticket. Basically, I'm asking: how is the crux of the issue mentioned in this comment different from that one - or are they essentially the same?

#212 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is that while we do need to fix that function eventually, it was a poor test helper when we got here and fixing every problem this repo has is out of scope for the mass import project. That being said, any code changes we make or code we add needs to be tested rigorously, and any tests we add need to be rigorous. If we're relying on check_save_one_genome_output for any of those scenarios we're not testing rigorously.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, what needs to be done at this point? If you think we need to write more tests, what exactly do you want to check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue I'm seeing here is that the new test_genomes and test_genomes_hidden tests use it, which means they're hardly testing anything regarding what actually happened during the upload.

If you already have rigorous tests elsewhere (which I think you do) that check the object contents, any associated files in the blobstore, provenance, and everything else that changes in the KBase data systems when genomes are uploaded, you should do the same thing here (and maybe make a general helper function based on those rigorous tests you can reuse here if you don't have one already). If you don't already have rigorous tests LMK and we can go from there

Comment on lines 233 to 245
def test_genomes_with_hidden(self):
self.start_test()
genome_name = 'test_genome_hidden'
inputs = [
{
'name': genome_name,
'data': self.test_genome_data,
'hidden': 1,
}
]
params = {'workspace_id': self.wsID, 'inputs': inputs}
ret = self.genome_interface.save_genome_mass(params)[0]
self.check_save_one_genome_output(ret, genome_name, warnings=[])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't actually test that the genome is hidden

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there's nothing tin check_save_one_genome_output that checks the genome is hidden

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@MrCreosote MrCreosote Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how that comment applies here. The function under test is save_genome_mass, the test just uses check_save_one_genome as a helper

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I know where I can find information about whether it is hidden? The hidden information is not in the return value from save_one_genome.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little annoying; the easiest way, it seems, is to list that one object and see if it shows up. If not, and it does show up with showHidden enabled, it's hidden:

https://ci.kbase.us/services/ws/docs/Workspace.html#funcdefWorkspace.list_objects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_hidden function added: bda89a0

Comment on lines +165 to +170
# check features
self.gi.check_dna_sequence_in_features(genome_obj.genome_data)

# validate genome
genome_obj.genome_data['warnings'] = self.gi.validate_genome(genome_obj.genome_data)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there tests for G2G that exercise these code paths?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we have genbank_upload_full_test.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And there are tests that cause errors to be thown from the check / validate methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check_dna_sequence_in_features function does not raise any errors.

validate_genome will raise an error if the genome size is too large, but I assume this is already covered in genome_size_test.py by other devs.

https://github.com/kbaseapps/GenomeFileUtil/blob/master/test/utility/genome_size_tests.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold on, I'm confused - these two checks are being added to the import mass function, but genbank_upload_full_test isn't changing. How can it test that these functions are called correctly from the mass function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the confusion is probably because it's been 6 months since I looked at this...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you remember why we added these two checks here? They were originally from the GenomeInterface.py file. It's been eight months, and I don't even know why they ended up in GenbankToGenome.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely remember this discussion, but I think they're required checks for genomes and either they were missing from the mass import function or they were moved from somewhere else

@Xiangs18 Xiangs18 requested a review from MrCreosote November 1, 2024 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants