Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCA reingest script #58

Merged
merged 1 commit into from
Jun 23, 2018
Merged

CCA reingest script #58

merged 1 commit into from
Jun 23, 2018

Conversation

ross-spencer
Copy link
Contributor

Creates a bulk reingest mechanism

This is a script for the Canadian Center for Architects. The purpose
is to bulk reingest AIPs in Archivematica from a single pipeline.
The script can be run using the same mechanism as others in the automation
tools repository, i.e. running a crontab task every (x) minutes. In support
of this script, PR #57 was submitted to enhance the amclient script's
capabilities.

This branch is based on the dev branch in #57 as the two sets of changes are related. The AM Client commits should be reviewed as part of #57 and this review should focus on commit 19bc9a0

cc. @timothyryanwalsh

Copy link
Contributor

@jrwdunham jrwdunham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feedback and requests for changes.

"""Return a pipeline from which to work with. The current implementation
requires a Storage Service instance with just a single pipeline. If
more pipelines are required we will need to pre-configure this.
"""
Copy link
Contributor

@jrwdunham jrwdunham Apr 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring is not accurate here since this func doesn't return a pipeline but (at least in the successful case) returns a boolean indicating whether the pipeline with the provided pipeline_uuid exists. Relatedly, I think that there should be a return False after the warning so that we are not implicitly returning None for the failure condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tim had an error running this in a multi-pipeline environment so I have reworked it to this:

def pipeline_exists(amclient, pipeline_uuid):
    """Test whether a pipeline can be found in the storage service."""
    try:
        pipelines = amclient.get_pipelines()["objects"]
        return next((pipeline for pipeline in pipelines
            if pipeline["uuid"] == pipeline_uuid), False)
    except KeyError:
        return False

"""
resp = amclient.get_processing_config(processing_name)
if errors.error_lookup(resp) is resp:
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, for type consistency, I think False should be returned here as the default case (instead of the implicit None).

path = aip["current_full_path"]
compressed = find_compressed(path)
if compressed is not None:
if compressed is True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two conditions and double nesting seem unnecessary here. Why not just a single short if compressed:?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked into PR #57

if compressed is True:
compressed_aips.append(aip)
compressed_uuid_list.append(aip["uuid"])
return compressed_aips, compressed_uuid_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that returning a dict from AIP UUIDs to AIPs would make more sense than returning a tuple of two separate lists here. If you do that, then you can get the UUID list as list(aips.keys()) and the AIP list as list(aips.values()).

In fact, this entire function could be expressed as a single dict comprehension:

return {aip['uuid']: aip for aip in aips
        if aip['status'] == 'UPLOADED' and
        find_compressed(aip['current_full_path'])}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might also consider just adding a new method to AMClient which does what your list_compressed_aips function currently does, e.g., something like get_all_compressed_aips. That way, it would be easier to use this functionality in other areas of the automation tools, if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

"output_mode":"python"
},
"database": {
"path": "/home/ross-spencer/git/artefactual/automation-tools/reingest.db"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the paths specific to your dev machine be generalized or at least anonymized?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, should the UUID of a specific pipeline ("3f4721fa-6754-43f3-a905-6a2a087839ee") be included here?

transfer = None
while not isinstance(transfer, dict):
if latency is not None and isinstance(latency, float):
time.sleep(latency) # ~latency between call and AM actioning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could simplify the condition here to just if latency:. That way, a float and an int latency could both work, and it's clearer.


LOGGER.info("Attempting to approve transfer following reingest init")

if transfer["status"] == "USER_INPUT":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to anticipate a possible KeyError here or else use if transfer.get('status') == ...

LOGGER.info("Approving reingest automatically. Directory to "
"approve %s", transfer_directory)
message = amclient.approve_transfer(transfer_directory)
if 'Approval successful.' in message['message']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a more robust test for a successful approval would be if message.get('error') is None: because a successful call will always return a dict with no 'error' key (if I'm reading https://github.com/artefactual/archivematica/blob/stable/1.7.x/src/dashboard/src/components/api/views.py#L364 right)

Copy link
Contributor Author

@ross-spencer ross-spencer May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot!

{"message": "Approval successful.", "uuid": "8dac4935-a813-4345-bd67-633e837515f3"}

A work in progress, with some improvements that can be made to long-running
processes like this over time.
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need from __future__ import print_function if you're going to use print() and if we're targetting python 2 and 3 (which we are, I believe).

# We need to make sure that the approval is synchronized with the request
# to reingest. Occasionally a reingest will get stopped at approval when
# it doesn't need to if we just wait a little longer or try again.
for retry in range(approval_retries):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're not using retry, convention is to name it _.

@qubot qubot force-pushed the dev/cca-11686-reingest-script branch 2 times, most recently from cb377cd to 6e5efb0 Compare May 21, 2018 21:57
@@ -60,6 +60,9 @@ def get_parser(doc):
parser.add_argument('--hide', action='store_true',
help='If set, hide the Transfers and SIPs in the '
'dashboard once they complete.')
parser.add_argument('--delete', action='store_true', help='If set, delete '
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrwdunham this file shouldn't have been committed. If I have my git flight rules correct then once this set of reviews is complete, I can rebase and squish everything and then reset this file to its original state as part of that process. Alternatively, I can checkout the previous version and place it into source control that way, resetting an original mistake.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're saying that lines 63-65 here should not be part of this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

@ross-spencer
Copy link
Contributor Author

@jrwdunham this should be ready for further code review now. I've tried to separate it as logically as possible, but looking back over the commits it might be a little difficult to follow. Let me know if I can help improve its clarity if you need it.

@ross-spencer
Copy link
Contributor Author

Hi @jrwdunham. Do you know when you might be able to look at the changes I have made following your code-review? It will be useful to get this work complete, but also to get an indication of how much i might have left to tackle following a second review.

Copy link
Contributor

@jrwdunham jrwdunham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a handful of new comments, suggestions and questions, but I think this is about ready to merge. I haven't tested it against a dev instance to see that it works as it should, but I'm assuming that @ross-spencer and Tim have been doing that.

@ross-spencer I realize that there is an internal RM issue that provides context for this PR, but maybe you could create a GH issue also; one which succinctly describes the problem that motivates this work, and then link this PR to that issue.

try:
with open(listfile) as aip_list:
userlist = json.loads(aip_list.read().replace("'", '"'))
return userlist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit odd to anticipate malformed JSON and fix it in this way. I understand that we are expecting the JSON file to just be an array of UUIDs. However, if this code was used to parse a file containing ['abc', "abc'def"] it would raise a ValueError. Maybe we should just add a comment above the json.loads line that explains to future devs that we are allowing invalid JSON input here.

Maybe also change docstring to 'Load AIPs from a user provided file containing an array of AIP UUIDs.'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Postel's law: never heard of it, cool idea, good retort.


def load_db(session, aiplist):
"""If we have AIPs to reingest, load the database."""
if isinstance(aiplist, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should use isinstance(aiplist, six.text_type) here so that unicode objects return False also and so that it's Py2/3 compatible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Sorry, I did recognise that the six library was used in parts of our code, I probably need to read up on it and be a bit more consistent with using it.

setattr(amclient, "processing_config", None)
setattr(amclient, "sip_uuid", None)
setattr(amclient, "transfer_directory", None)
setattr(amclient, "transfer_uuid", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make sure the docstring in this function is accurate? You write "We avoid having to use setattr(...)". Do you mean "We avoid having to use getattr(...)"? Small typo in "hereto". Also, why aren't you just using dot notation here, e.g.,

amclient.aip_uuid = None
amclient.package_uuid = None

etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update the doc string. The reason we're not using dot notation here is what should be explained more clearly - if the attribute isn't set in the class via initialisation (or if they're not already member variables) then we get an attribute error. That's what I see when i test it that way. I thought it was clearer, but if there's some doubt it might be worth checking out the behaviour for yourself too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I'm still a bit puzzled by this. As far as I know, setattr(amclient, "sip_uuid", None) and amclient.sip_uuid = None are directly equivalent. The only reason to use setattr is when the name of the attribute is only known from a variable. Is that not true? If Postel would be interjecting here, feel free to ignore :)

@@ -60,6 +60,9 @@ def get_parser(doc):
parser.add_argument('--hide', action='store_true',
help='If set, hide the Transfers and SIPs in the '
'dashboard once they complete.')
parser.add_argument('--delete', action='store_true', help='If set, delete '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're saying that lines 63-65 here should not be part of this PR?

@qubot qubot force-pushed the dev/cca-11686-reingest-script branch from 6e5efb0 to 236129d Compare June 7, 2018 14:53
@ross-spencer
Copy link
Contributor Author

@jrwdunham i don't know if you want to just check these final two commits - I made your changes with six and then updated the README.md as I realise it had become out of date. It is rebased with master so it just needs squishing once the thumbs-up.

@ross-spencer
Copy link
Contributor Author

Hi @timothyryanwalsh once @jrwdunham has checked over the final README.md changes - we'll get this merged into the main branch.

Is there someone at CCA I should provide an update to as well?

@tw4l
Copy link
Contributor

tw4l commented Jun 8, 2018

Hi @ross-spencer, that's great! I would let @stefanabreitwieser know - she's the interim Archivematica manager while the CCA searches for a new Digital Archivist.

@stefanabreitwieser
Copy link

Noted -- thanks @timothyryanwalsh for the heads up and @ross-spencer and all for your work!

@@ -567,39 +567,44 @@ Basic usage: `amclient.py <subcommand> [optional arguments]

* close-completed-transfers
* purpose: close all completed transfers (those not failed or rejected)
* positional argument: am_api_key - API key for Archivematica dashboard user
* positional argument:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that we are not backtick-enclosing these positional argument values like we are with the optional ones? Not that you need to fix this in this PR, just asking.

We should be able to auto-generate this portion (### Subcommands and arguments ) of the README from the code (amlientargs.py::SUBCOMMANDS) itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's definitely something we should aim for in another Issue/PR. I just followed the style already adopted, but auto-generating this text will lead to better methods for playing with the formatting in future.

* optional arguments:
* `--ss-user-name <username>` - Storage Service username (default: `test`)
* `--ss-url <url>` - Storage Service URL (default: `http://127.0.0.1:8000`)

* aips2dips
* purpose: print all AIPs in the Storage Service along with their
corresponding DIPs
* positional argument: ss_api_key - Storage Service API key
* positional argument:
* ss_api_key - Storage Service API key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe positional argument: and optional argument: should be positional argument(s): and optional argument(s):. What do you think?

Copy link
Contributor Author

@ross-spencer ross-spencer Jun 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree and reckon we can tackle this as part of Issue #83.

README.md Outdated
* `--ss-url <url>` - Storage Service URL (default: `http://127.0.0.1:8000`)

* get-transfer-status
* purpose: print the status of a transfer if it exists in a transfer
Copy link
Contributor

@jrwdunham jrwdunham Jun 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reifying "transfer workflow" sounds odd to me. Why are "transfer" and "ingest" workflows discussed here and below? Why not just say "if it exists in a pipeline"? This is probably overly nitpicky so ignore at your leisure.

Copy link
Contributor Author

@ross-spencer ross-spencer Jun 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good call. Changed to pipeline.

* optional arguments
* processing-config
* `--am-user-name <username>` - username for Archivematica dashboard user
* `--am-url <url>` - Archivematica URL (default: `http://127.0.0.1`)
Copy link
Contributor

@jrwdunham jrwdunham Jun 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, not your responsibility here and now, but seeing all this repetition makes me think a) that a lot of it could be removed and b) that we should allow for the specification of a config file so that all but the essential becomes optional. (Remembering now that you have introduced a config file for the reingest module, but it may still apply to amclient.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logged in #84

README.md Outdated

The *transfers/reingest.py* script is a module and CLI that provides
functionality for bulk reingest of compressed AIPs. Given a user formatted
list of AIP UUIDs then it can also complete the bulk-reingest of any AIP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems odd to use "bulk reingest" and then "bulk-reingest" right afterwards. Is the distinction semantically warranted?

I would remove the "then" and the "also" making it "Given a user formatted
list of AIP UUIDs, it can complete the bulk-reingest of any AIP listed." Why is this "also"? Isn't it just elaborating on the first sentence?

setattr(amclient, "processing_config", None)
setattr(amclient, "sip_uuid", None)
setattr(amclient, "transfer_directory", None)
setattr(amclient, "transfer_uuid", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I'm still a bit puzzled by this. As far as I know, setattr(amclient, "sip_uuid", None) and amclient.sip_uuid = None are directly equivalent. The only reason to use setattr is when the name of the attribute is only known from a variable. Is that not true? If Postel would be interjecting here, feel free to ignore :)

Copy link
Contributor

@jrwdunham jrwdunham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ross-spencer This looks good. I've made some more comments/questions/suggestions (and found one real typo), but I think you can merge this after deciding what, if anything, you want to do about those.

README.md Outdated
```

*Reingest.py* is best used via the shell script provided in the *transfers/
examples/reingest* folder. As it is designed for bulk-reingest, it is also
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the linebreak in your path is breaking the markdown and also I think the dirpath should end with a forward slash.

Also, the second sentence doesn't read well. I think changing "an example is provided" to "an example of which is provided" would fix it.

README.md Outdated

Configuration of reingest.py is done via *transfers/reingestconfig.json*. You
will need to configure API parameters via `reingestconfig.json` as they will
be used to make calles to the Archivematica Client module `amclient.py`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: calles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jrwdunham
Copy link
Contributor

@ross-spencer I think you'll want to squash some of these commits before merging

@ross-spencer
Copy link
Contributor Author

Thanks @jrwdunham I will go through this in detail tomorrow evening. On merging, should I also delete these branches:

Per some of the code-review input above, I think we have most of content from both covered. What do you think?

This is a script for the Canadian Center for Architects. The purpose
is to bulk reingest AIPs in Archivematica from a single pipeline.
The script can be run using the same mechanism as others in the
automation tools repository, i.e. running a crontab task every (x)
minutes. In support of this script, PR #57 was submitted to enhance
the amclient.py script's capabilities.

* Resolves #81 create a bulk-reingest mechanism
* Resolves #72 write-mode when opening the PID file
@qubot qubot force-pushed the dev/cca-11686-reingest-script branch from a2b57ae to 7864cfc Compare June 23, 2018 01:28
@qubot qubot merged commit 7864cfc into master Jun 23, 2018
@qubot qubot deleted the dev/cca-11686-reingest-script branch June 23, 2018 01:33
@jrwdunham
Copy link
Contributor

@ross-spencer the https://github.com/artefactual/automation-tools/commits/dev/reingest-automation branch is tied to the still-open PR #17. Before deleting that branch I think one of us needs to document on that PR 17 why we think closing the PR and deleting the dev branch is justified, given the work you've introduced here. I could do that, or you could, but I think there's a couple of hours of analysis work to do there before we delete that branch.

I think it should be okay to delete the dev/issue-10180-reingest-aip branch. I'm having trouble remembering (re-discovering) the context for that work and why it was left in the state it's in. Anyways, I've pulled it to my local git, so I'm comfortable with us deleting it from origin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants