-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CCA reingest script #58
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feedback and requests for changes.
transfers/reingest.py
Outdated
"""Return a pipeline from which to work with. The current implementation | ||
requires a Storage Service instance with just a single pipeline. If | ||
more pipelines are required we will need to pre-configure this. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring is not accurate here since this func doesn't return a pipeline but (at least in the successful case) returns a boolean indicating whether the pipeline with the provided pipeline_uuid
exists. Relatedly, I think that there should be a return False
after the warning so that we are not implicitly returning None
for the failure condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tim had an error running this in a multi-pipeline environment so I have reworked it to this:
def pipeline_exists(amclient, pipeline_uuid):
"""Test whether a pipeline can be found in the storage service."""
try:
pipelines = amclient.get_pipelines()["objects"]
return next((pipeline for pipeline in pipelines
if pipeline["uuid"] == pipeline_uuid), False)
except KeyError:
return False
transfers/reingest.py
Outdated
""" | ||
resp = amclient.get_processing_config(processing_name) | ||
if errors.error_lookup(resp) is resp: | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, for type consistency, I think False
should be returned here as the default case (instead of the implicit None
).
transfers/reingest.py
Outdated
path = aip["current_full_path"] | ||
compressed = find_compressed(path) | ||
if compressed is not None: | ||
if compressed is True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two conditions and double nesting seem unnecessary here. Why not just a single short if compressed:
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworked into PR #57
transfers/reingest.py
Outdated
if compressed is True: | ||
compressed_aips.append(aip) | ||
compressed_uuid_list.append(aip["uuid"]) | ||
return compressed_aips, compressed_uuid_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that returning a dict
from AIP UUIDs to AIPs would make more sense than returning a tuple of two separate lists here. If you do that, then you can get the UUID list as list(aips.keys())
and the AIP list as list(aips.values())
.
In fact, this entire function could be expressed as a single dict comprehension:
return {aip['uuid']: aip for aip in aips
if aip['status'] == 'UPLOADED' and
find_compressed(aip['current_full_path'])}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might also consider just adding a new method to AMClient
which does what your list_compressed_aips
function currently does, e.g., something like get_all_compressed_aips
. That way, it would be easier to use this functionality in other areas of the automation tools, if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
transfers/reingestconfig.json
Outdated
"output_mode":"python" | ||
}, | ||
"database": { | ||
"path": "/home/ross-spencer/git/artefactual/automation-tools/reingest.db" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the paths specific to your dev machine be generalized or at least anonymized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, should the UUID of a specific pipeline ("3f4721fa-6754-43f3-a905-6a2a087839ee"
) be included here?
transfers/reingest.py
Outdated
transfer = None | ||
while not isinstance(transfer, dict): | ||
if latency is not None and isinstance(latency, float): | ||
time.sleep(latency) # ~latency between call and AM actioning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could simplify the condition here to just if latency:
. That way, a float
and an int
latency could both work, and it's clearer.
transfers/reingest.py
Outdated
|
||
LOGGER.info("Attempting to approve transfer following reingest init") | ||
|
||
if transfer["status"] == "USER_INPUT": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to anticipate a possible KeyError
here or else use if transfer.get('status') == ...
transfers/reingest.py
Outdated
LOGGER.info("Approving reingest automatically. Directory to " | ||
"approve %s", transfer_directory) | ||
message = amclient.approve_transfer(transfer_directory) | ||
if 'Approval successful.' in message['message']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a more robust test for a successful approval would be if message.get('error') is None:
because a successful call will always return a dict with no 'error'
key (if I'm reading https://github.com/artefactual/archivematica/blob/stable/1.7.x/src/dashboard/src/components/api/views.py#L364 right)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot!
{"message": "Approval successful.", "uuid": "8dac4935-a813-4345-bd67-633e837515f3"}
transfers/reingest.py
Outdated
A work in progress, with some improvements that can be made to long-running | ||
processes like this over time. | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need from __future__ import print_function
if you're going to use print()
and if we're targetting python 2 and 3 (which we are, I believe).
transfers/reingest.py
Outdated
# We need to make sure that the approval is synchronized with the request | ||
# to reingest. Occasionally a reingest will get stopped at approval when | ||
# it doesn't need to if we just wait a little longer or try again. | ||
for retry in range(approval_retries): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're not using retry
, convention is to name it _
.
cb377cd
to
6e5efb0
Compare
transfers/transferargs.py
Outdated
@@ -60,6 +60,9 @@ def get_parser(doc): | |||
parser.add_argument('--hide', action='store_true', | |||
help='If set, hide the Transfers and SIPs in the ' | |||
'dashboard once they complete.') | |||
parser.add_argument('--delete', action='store_true', help='If set, delete ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrwdunham this file shouldn't have been committed. If I have my git flight rules correct then once this set of reviews is complete, I can rebase and squish everything and then reset this file to its original state as part of that process. Alternatively, I can checkout the previous version and place it into source control that way, resetting an original mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're saying that lines 63-65 here should not be part of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
@jrwdunham this should be ready for further code review now. I've tried to separate it as logically as possible, but looking back over the commits it might be a little difficult to follow. Let me know if I can help improve its clarity if you need it. |
Hi @jrwdunham. Do you know when you might be able to look at the changes I have made following your code-review? It will be useful to get this work complete, but also to get an indication of how much i might have left to tackle following a second review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a handful of new comments, suggestions and questions, but I think this is about ready to merge. I haven't tested it against a dev instance to see that it works as it should, but I'm assuming that @ross-spencer and Tim have been doing that.
@ross-spencer I realize that there is an internal RM issue that provides context for this PR, but maybe you could create a GH issue also; one which succinctly describes the problem that motivates this work, and then link this PR to that issue.
transfers/reingest.py
Outdated
try: | ||
with open(listfile) as aip_list: | ||
userlist = json.loads(aip_list.read().replace("'", '"')) | ||
return userlist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a bit odd to anticipate malformed JSON and fix it in this way. I understand that we are expecting the JSON file to just be an array of UUIDs. However, if this code was used to parse a file containing ['abc', "abc'def"]
it would raise a ValueError
. Maybe we should just add a comment above the json.loads
line that explains to future devs that we are allowing invalid JSON input here.
Maybe also change docstring to 'Load AIPs from a user provided file containing an array of AIP UUIDs.'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Postel's law: never heard of it, cool idea, good retort.
transfers/reingest.py
Outdated
|
||
def load_db(session, aiplist): | ||
"""If we have AIPs to reingest, load the database.""" | ||
if isinstance(aiplist, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should use isinstance(aiplist, six.text_type)
here so that unicode
objects return False
also and so that it's Py2/3 compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Sorry, I did recognise that the six
library was used in parts of our code, I probably need to read up on it and be a bit more consistent with using it.
setattr(amclient, "processing_config", None) | ||
setattr(amclient, "sip_uuid", None) | ||
setattr(amclient, "transfer_directory", None) | ||
setattr(amclient, "transfer_uuid", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make sure the docstring in this function is accurate? You write "We avoid having to use setattr(...)". Do you mean "We avoid having to use getattr(...)"? Small typo in "hereto". Also, why aren't you just using dot notation here, e.g.,
amclient.aip_uuid = None
amclient.package_uuid = None
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can update the doc string. The reason we're not using dot notation here is what should be explained more clearly - if the attribute isn't set in the class via initialisation (or if they're not already member variables) then we get an attribute error. That's what I see when i test it that way. I thought it was clearer, but if there's some doubt it might be worth checking out the behaviour for yourself too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I'm still a bit puzzled by this. As far as I know, setattr(amclient, "sip_uuid", None)
and amclient.sip_uuid = None
are directly equivalent. The only reason to use setattr
is when the name of the attribute is only known from a variable. Is that not true? If Postel would be interjecting here, feel free to ignore :)
transfers/transferargs.py
Outdated
@@ -60,6 +60,9 @@ def get_parser(doc): | |||
parser.add_argument('--hide', action='store_true', | |||
help='If set, hide the Transfers and SIPs in the ' | |||
'dashboard once they complete.') | |||
parser.add_argument('--delete', action='store_true', help='If set, delete ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're saying that lines 63-65 here should not be part of this PR?
6e5efb0
to
236129d
Compare
@jrwdunham i don't know if you want to just check these final two commits - I made your changes with |
Hi @timothyryanwalsh once @jrwdunham has checked over the final Is there someone at CCA I should provide an update to as well? |
Hi @ross-spencer, that's great! I would let @stefanabreitwieser know - she's the interim Archivematica manager while the CCA searches for a new Digital Archivist. |
Noted -- thanks @timothyryanwalsh for the heads up and @ross-spencer and all for your work! |
@@ -567,39 +567,44 @@ Basic usage: `amclient.py <subcommand> [optional arguments] | |||
|
|||
* close-completed-transfers | |||
* purpose: close all completed transfers (those not failed or rejected) | |||
* positional argument: am_api_key - API key for Archivematica dashboard user | |||
* positional argument: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason that we are not backtick-enclosing these positional argument values like we are with the optional ones? Not that you need to fix this in this PR, just asking.
We should be able to auto-generate this portion (### Subcommands and arguments
) of the README from the code (amlientargs.py::SUBCOMMANDS
) itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's definitely something we should aim for in another Issue/PR. I just followed the style already adopted, but auto-generating this text will lead to better methods for playing with the formatting in future.
* optional arguments: | ||
* `--ss-user-name <username>` - Storage Service username (default: `test`) | ||
* `--ss-url <url>` - Storage Service URL (default: `http://127.0.0.1:8000`) | ||
|
||
* aips2dips | ||
* purpose: print all AIPs in the Storage Service along with their | ||
corresponding DIPs | ||
* positional argument: ss_api_key - Storage Service API key | ||
* positional argument: | ||
* ss_api_key - Storage Service API key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe positional argument:
and optional argument:
should be positional argument(s):
and optional argument(s):
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree and reckon we can tackle this as part of Issue #83.
README.md
Outdated
* `--ss-url <url>` - Storage Service URL (default: `http://127.0.0.1:8000`) | ||
|
||
* get-transfer-status | ||
* purpose: print the status of a transfer if it exists in a transfer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reifying "transfer workflow" sounds odd to me. Why are "transfer" and "ingest" workflows discussed here and below? Why not just say "if it exists in a pipeline"? This is probably overly nitpicky so ignore at your leisure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good call. Changed to pipeline.
* optional arguments | ||
* processing-config | ||
* `--am-user-name <username>` - username for Archivematica dashboard user | ||
* `--am-url <url>` - Archivematica URL (default: `http://127.0.0.1`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, not your responsibility here and now, but seeing all this repetition makes me think a) that a lot of it could be removed and b) that we should allow for the specification of a config file so that all but the essential becomes optional. (Remembering now that you have introduced a config file for the reingest module, but it may still apply to amclient.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logged in #84
README.md
Outdated
|
||
The *transfers/reingest.py* script is a module and CLI that provides | ||
functionality for bulk reingest of compressed AIPs. Given a user formatted | ||
list of AIP UUIDs then it can also complete the bulk-reingest of any AIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems odd to use "bulk reingest" and then "bulk-reingest" right afterwards. Is the distinction semantically warranted?
I would remove the "then" and the "also" making it "Given a user formatted
list of AIP UUIDs, it can complete the bulk-reingest of any AIP listed." Why is this "also"? Isn't it just elaborating on the first sentence?
setattr(amclient, "processing_config", None) | ||
setattr(amclient, "sip_uuid", None) | ||
setattr(amclient, "transfer_directory", None) | ||
setattr(amclient, "transfer_uuid", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I'm still a bit puzzled by this. As far as I know, setattr(amclient, "sip_uuid", None)
and amclient.sip_uuid = None
are directly equivalent. The only reason to use setattr
is when the name of the attribute is only known from a variable. Is that not true? If Postel would be interjecting here, feel free to ignore :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ross-spencer This looks good. I've made some more comments/questions/suggestions (and found one real typo), but I think you can merge this after deciding what, if anything, you want to do about those.
README.md
Outdated
``` | ||
|
||
*Reingest.py* is best used via the shell script provided in the *transfers/ | ||
examples/reingest* folder. As it is designed for bulk-reingest, it is also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the linebreak in your path is breaking the markdown and also I think the dirpath should end with a forward slash.
Also, the second sentence doesn't read well. I think changing "an example is provided" to "an example of which is provided" would fix it.
README.md
Outdated
|
||
Configuration of reingest.py is done via *transfers/reingestconfig.json*. You | ||
will need to configure API parameters via `reingestconfig.json` as they will | ||
be used to make calles to the Archivematica Client module `amclient.py` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: calles
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@ross-spencer I think you'll want to squash some of these commits before merging |
Thanks @jrwdunham I will go through this in detail tomorrow evening. On merging, should I also delete these branches:
Per some of the code-review input above, I think we have most of content from both covered. What do you think? |
This is a script for the Canadian Center for Architects. The purpose is to bulk reingest AIPs in Archivematica from a single pipeline. The script can be run using the same mechanism as others in the automation tools repository, i.e. running a crontab task every (x) minutes. In support of this script, PR #57 was submitted to enhance the amclient.py script's capabilities. * Resolves #81 create a bulk-reingest mechanism * Resolves #72 write-mode when opening the PID file
a2b57ae
to
7864cfc
Compare
@ross-spencer the https://github.com/artefactual/automation-tools/commits/dev/reingest-automation branch is tied to the still-open PR #17. Before deleting that branch I think one of us needs to document on that PR 17 why we think closing the PR and deleting the dev branch is justified, given the work you've introduced here. I could do that, or you could, but I think there's a couple of hours of analysis work to do there before we delete that branch. I think it should be okay to delete the dev/issue-10180-reingest-aip branch. I'm having trouble remembering (re-discovering) the context for that work and why it was left in the state it's in. Anyways, I've pulled it to my local git, so I'm comfortable with us deleting it from origin. |
Creates a bulk reingest mechanism
This is a script for the Canadian Center for Architects. The purpose
is to bulk reingest AIPs in Archivematica from a single pipeline.
The script can be run using the same mechanism as others in the automation
tools repository, i.e. running a crontab task every (x) minutes. In support
of this script, PR #57 was submitted to enhance the amclient script's
capabilities.
This branch is based on the dev branch in #57 as the two sets of changes are related. The AM Client commits should be reviewed as part of #57 and this review should focus on commit 19bc9a0
cc. @timothyryanwalsh