CCA reingest script #58

ross-spencer · 2018-04-16T09:52:17Z

Creates a bulk reingest mechanism

This is a script for the Canadian Center for Architects. The purpose
is to bulk reingest AIPs in Archivematica from a single pipeline.
The script can be run using the same mechanism as others in the automation
tools repository, i.e. running a crontab task every (x) minutes. In support
of this script, PR #57 was submitted to enhance the amclient script's
capabilities.

This branch is based on the dev branch in #57 as the two sets of changes are related. The AM Client commits should be reviewed as part of #57 and this review should focus on commit 19bc9a0

cc. @timothyryanwalsh

jrwdunham

Feedback and requests for changes.

jrwdunham · 2018-04-24T00:06:26Z

transfers/reingest.py

+    """Return a pipeline from which to work with. The current implementation
+       requires a Storage Service instance with just a single pipeline. If
+       more pipelines are required we will need to pre-configure this.
+    """


The docstring is not accurate here since this func doesn't return a pipeline but (at least in the successful case) returns a boolean indicating whether the pipeline with the provided pipeline_uuid exists. Relatedly, I think that there should be a return False after the warning so that we are not implicitly returning None for the failure condition.

Tim had an error running this in a multi-pipeline environment so I have reworked it to this:

def pipeline_exists(amclient, pipeline_uuid): """Test whether a pipeline can be found in the storage service.""" try: pipelines = amclient.get_pipelines()["objects"] return next((pipeline for pipeline in pipelines if pipeline["uuid"] == pipeline_uuid), False) except KeyError: return False

jrwdunham · 2018-04-24T00:13:16Z

transfers/reingest.py

+    """
+    resp = amclient.get_processing_config(processing_name)
+    if errors.error_lookup(resp) is resp:
+        return True


Again, for type consistency, I think False should be returned here as the default case (instead of the implicit None).

jrwdunham · 2018-04-24T00:17:52Z

transfers/reingest.py

+            path = aip["current_full_path"]
+            compressed = find_compressed(path)
+            if compressed is not None:
+                if compressed is True:


The two conditions and double nesting seem unnecessary here. Why not just a single short if compressed:?

Reworked into PR #57

jrwdunham · 2018-04-24T00:19:39Z

transfers/reingest.py

+                if compressed is True:
+                    compressed_aips.append(aip)
+                    compressed_uuid_list.append(aip["uuid"])
+    return compressed_aips, compressed_uuid_list


I think that returning a dict from AIP UUIDs to AIPs would make more sense than returning a tuple of two separate lists here. If you do that, then you can get the UUID list as list(aips.keys()) and the AIP list as list(aips.values()).

In fact, this entire function could be expressed as a single dict comprehension:

return {aip['uuid']: aip for aip in aips if aip['status'] == 'UPLOADED' and find_compressed(aip['current_full_path'])}

You might also consider just adding a new method to AMClient which does what your list_compressed_aips function currently does, e.g., something like get_all_compressed_aips. That way, it would be easier to use this functionality in other areas of the automation tools, if needed.

jrwdunham · 2018-04-24T06:12:10Z

transfers/reingestconfig.json

+    "output_mode":"python"
+  },
+  "database": {
+    "path": "/home/ross-spencer/git/artefactual/automation-tools/reingest.db"


Should the paths specific to your dev machine be generalized or at least anonymized?

Similarly, should the UUID of a specific pipeline ("3f4721fa-6754-43f3-a905-6a2a087839ee") be included here?

jrwdunham · 2018-04-24T22:57:20Z

transfers/reingest.py

+        transfer = None
+        while not isinstance(transfer, dict):
+            if latency is not None and isinstance(latency, float):
+                time.sleep(latency)   # ~latency between call and AM actioning.


You could simplify the condition here to just if latency:. That way, a float and an int latency could both work, and it's clearer.

jrwdunham · 2018-04-24T23:00:05Z

transfers/reingest.py

+
+        LOGGER.info("Attempting to approve transfer following reingest init")
+
+        if transfer["status"] == "USER_INPUT":


I think you need to anticipate a possible KeyError here or else use if transfer.get('status') == ...

jrwdunham · 2018-04-24T23:05:18Z

transfers/reingest.py

+            LOGGER.info("Approving reingest automatically. Directory to "
+                        "approve %s", transfer_directory)
+            message = amclient.approve_transfer(transfer_directory)
+            if 'Approval successful.' in message['message']:


I think a more robust test for a successful approval would be if message.get('error') is None: because a successful call will always return a dict with no 'error' key (if I'm reading https://github.com/artefactual/archivematica/blob/stable/1.7.x/src/dashboard/src/components/api/views.py#L364 right)

Good spot!

{"message": "Approval successful.", "uuid": "8dac4935-a813-4345-bd67-633e837515f3"}

jrwdunham · 2018-04-24T23:14:17Z

transfers/reingest.py

+A work in progress, with some improvements that can be made to long-running
+processes like this over time.
+"""
+


You need from __future__ import print_function if you're going to use print() and if we're targetting python 2 and 3 (which we are, I believe).

jrwdunham · 2018-04-24T23:15:25Z

transfers/reingest.py

+    # We need to make sure that the approval is synchronized with the request
+    # to reingest. Occasionally a reingest will get stopped at approval when
+    # it doesn't need to if we just wait a little longer or try again.
+    for retry in range(approval_retries):


Since you're not using retry, convention is to name it _.

ross-spencer · 2018-05-22T09:28:35Z

transfers/transferargs.py

@@ -60,6 +60,9 @@ def get_parser(doc):
    parser.add_argument('--hide', action='store_true',
                        help='If set, hide the Transfers and SIPs in the '
                             'dashboard once they complete.')
+    parser.add_argument('--delete', action='store_true', help='If set, delete '


@jrwdunham this file shouldn't have been committed. If I have my git flight rules correct then once this set of reviews is complete, I can rebase and squish everything and then reset this file to its original state as part of that process. Alternatively, I can checkout the previous version and place it into source control that way, resetting an original mistake.

So you're saying that lines 63-65 here should not be part of this PR?

ross-spencer · 2018-05-22T09:29:56Z

@jrwdunham this should be ready for further code review now. I've tried to separate it as logically as possible, but looking back over the commits it might be a little difficult to follow. Let me know if I can help improve its clarity if you need it.

ross-spencer · 2018-06-06T08:08:10Z

Hi @jrwdunham. Do you know when you might be able to look at the changes I have made following your code-review? It will be useful to get this work complete, but also to get an indication of how much i might have left to tackle following a second review.

jrwdunham

I have a handful of new comments, suggestions and questions, but I think this is about ready to merge. I haven't tested it against a dev instance to see that it works as it should, but I'm assuming that @ross-spencer and Tim have been doing that.

@ross-spencer I realize that there is an internal RM issue that provides context for this PR, but maybe you could create a GH issue also; one which succinctly describes the problem that motivates this work, and then link this PR to that issue.

jrwdunham · 2018-06-06T21:48:06Z

transfers/reingest.py

+    try:
+        with open(listfile) as aip_list:
+            userlist = json.loads(aip_list.read().replace("'", '"'))
+            return userlist


Seems a bit odd to anticipate malformed JSON and fix it in this way. I understand that we are expecting the JSON file to just be an array of UUIDs. However, if this code was used to parse a file containing ['abc', "abc'def"] it would raise a ValueError. Maybe we should just add a comment above the json.loads line that explains to future devs that we are allowing invalid JSON input here.

Maybe also change docstring to 'Load AIPs from a user provided file containing an array of AIP UUIDs.'

Postel's law: never heard of it, cool idea, good retort.

jrwdunham · 2018-06-06T21:56:03Z

transfers/reingest.py

+
+def load_db(session, aiplist):
+    """If we have AIPs to reingest, load the database."""
+    if isinstance(aiplist, str):


I think you should use isinstance(aiplist, six.text_type) here so that unicode objects return False also and so that it's Py2/3 compatible.

Sounds good. Sorry, I did recognise that the six library was used in parts of our code, I probably need to read up on it and be a bit more consistent with using it.

jrwdunham · 2018-06-06T23:51:13Z

transfers/reingest.py

+    setattr(amclient, "processing_config", None)
+    setattr(amclient, "sip_uuid", None)
+    setattr(amclient, "transfer_directory", None)
+    setattr(amclient, "transfer_uuid", None)


Can you make sure the docstring in this function is accurate? You write "We avoid having to use setattr(...)". Do you mean "We avoid having to use getattr(...)"? Small typo in "hereto". Also, why aren't you just using dot notation here, e.g.,

amclient.aip_uuid = None amclient.package_uuid = None

etc.

I can update the doc string. The reason we're not using dot notation here is what should be explained more clearly - if the attribute isn't set in the class via initialisation (or if they're not already member variables) then we get an attribute error. That's what I see when i test it that way. I thought it was clearer, but if there's some doubt it might be worth checking out the behaviour for yourself too.

I'm sorry, I'm still a bit puzzled by this. As far as I know, setattr(amclient, "sip_uuid", None) and amclient.sip_uuid = None are directly equivalent. The only reason to use setattr is when the name of the attribute is only known from a variable. Is that not true? If Postel would be interjecting here, feel free to ignore :)

jrwdunham · 2018-06-06T23:52:12Z

transfers/transferargs.py

@@ -60,6 +60,9 @@ def get_parser(doc):
    parser.add_argument('--hide', action='store_true',
                        help='If set, hide the Transfers and SIPs in the '
                             'dashboard once they complete.')
+    parser.add_argument('--delete', action='store_true', help='If set, delete '


So you're saying that lines 63-65 here should not be part of this PR?

ross-spencer · 2018-06-07T16:09:37Z

@jrwdunham i don't know if you want to just check these final two commits - I made your changes with six and then updated the README.md as I realise it had become out of date. It is rebased with master so it just needs squishing once the thumbs-up.

ross-spencer · 2018-06-08T08:00:36Z

Hi @timothyryanwalsh once @jrwdunham has checked over the final README.md changes - we'll get this merged into the main branch.

Is there someone at CCA I should provide an update to as well?

tw4l · 2018-06-08T14:20:15Z

Hi @ross-spencer, that's great! I would let @stefanabreitwieser know - she's the interim Archivematica manager while the CCA searches for a new Digital Archivist.

stefanabreitwieser · 2018-06-08T15:36:38Z

Noted -- thanks @timothyryanwalsh for the heads up and @ross-spencer and all for your work!

jrwdunham · 2018-06-18T23:01:23Z

README.md

@@ -567,39 +567,44 @@ Basic usage: `amclient.py <subcommand> [optional arguments]

 * close-completed-transfers
  * purpose: close all completed transfers (those not failed or rejected)
-  * positional argument: am_api_key - API key for Archivematica dashboard user
+  * positional argument:


Is there a reason that we are not backtick-enclosing these positional argument values like we are with the optional ones? Not that you need to fix this in this PR, just asking.

We should be able to auto-generate this portion (### Subcommands and arguments ) of the README from the code (amlientargs.py::SUBCOMMANDS) itself.

I think that's definitely something we should aim for in another Issue/PR. I just followed the style already adopted, but auto-generating this text will lead to better methods for playing with the formatting in future.

jrwdunham · 2018-06-18T23:09:16Z

README.md

  * optional arguments:
    * `--ss-user-name <username>` - Storage Service username (default: `test`)
    * `--ss-url <url>` - Storage Service URL (default: `http://127.0.0.1:8000`)

 * aips2dips
  * purpose: print all AIPs in the Storage Service along with their
  corresponding DIPs
-  * positional argument: ss_api_key - Storage Service API key
+  * positional argument:
+    * ss_api_key - Storage Service API key


Maybe positional argument: and optional argument: should be positional argument(s): and optional argument(s):. What do you think?

I do agree and reckon we can tackle this as part of Issue #83.

jrwdunham · 2018-06-18T23:10:34Z

README.md

+    * `--ss-url <url>` - Storage Service URL (default: `http://127.0.0.1:8000`)
+
+* get-transfer-status
+  * purpose: print the status of a transfer if it exists in a transfer


Reifying "transfer workflow" sounds odd to me. Why are "transfer" and "ingest" workflows discussed here and below? Why not just say "if it exists in a pipeline"? This is probably overly nitpicky so ignore at your leisure.

A good call. Changed to pipeline.

jrwdunham · 2018-06-18T23:32:40Z

README.md

+  * optional arguments
+    * processing-config
+    * `--am-user-name <username>` - username for Archivematica dashboard user
+    * `--am-url <url>` - Archivematica URL (default: `http://127.0.0.1`)


Again, not your responsibility here and now, but seeing all this repetition makes me think a) that a lot of it could be removed and b) that we should allow for the specification of a config file so that all but the essential becomes optional. (Remembering now that you have introduced a config file for the reingest module, but it may still apply to amclient.)

Logged in #84

jrwdunham · 2018-06-18T23:37:36Z

README.md

+
+The *transfers/reingest.py* script is a module and CLI that provides
+functionality for bulk reingest of compressed AIPs. Given a user formatted
+list of AIP UUIDs then it can also complete the bulk-reingest of any AIP


Seems odd to use "bulk reingest" and then "bulk-reingest" right afterwards. Is the distinction semantically warranted?

I would remove the "then" and the "also" making it "Given a user formatted
list of AIP UUIDs, it can complete the bulk-reingest of any AIP listed." Why is this "also"? Isn't it just elaborating on the first sentence?

jrwdunham · 2018-06-18T23:42:40Z

transfers/reingest.py

+    setattr(amclient, "processing_config", None)
+    setattr(amclient, "sip_uuid", None)
+    setattr(amclient, "transfer_directory", None)
+    setattr(amclient, "transfer_uuid", None)


I'm sorry, I'm still a bit puzzled by this. As far as I know, setattr(amclient, "sip_uuid", None) and amclient.sip_uuid = None are directly equivalent. The only reason to use setattr is when the name of the attribute is only known from a variable. Is that not true? If Postel would be interjecting here, feel free to ignore :)

jrwdunham

@ross-spencer This looks good. I've made some more comments/questions/suggestions (and found one real typo), but I think you can merge this after deciding what, if anything, you want to do about those.

jrwdunham · 2018-06-18T23:46:02Z

README.md

+```
+
+*Reingest.py* is best used via the shell script provided in the *transfers/
+examples/reingest* folder. As it is designed for bulk-reingest, it is also


I think the linebreak in your path is breaking the markdown and also I think the dirpath should end with a forward slash.

Also, the second sentence doesn't read well. I think changing "an example is provided" to "an example of which is provided" would fix it.

jrwdunham · 2018-06-18T23:49:33Z

README.md

+
+Configuration of reingest.py is done via *transfers/reingestconfig.json*. You
+will need to configure API parameters via `reingestconfig.json` as they will
+be used to make calles to the Archivematica Client module `amclient.py`


typo: calles.

jrwdunham · 2018-06-19T18:02:09Z

@ross-spencer I think you'll want to squash some of these commits before merging

ross-spencer · 2018-06-20T01:39:14Z

Thanks @jrwdunham I will go through this in detail tomorrow evening. On merging, should I also delete these branches:

Per some of the code-review input above, I think we have most of content from both covered. What do you think?

This is a script for the Canadian Center for Architects. The purpose is to bulk reingest AIPs in Archivematica from a single pipeline. The script can be run using the same mechanism as others in the automation tools repository, i.e. running a crontab task every (x) minutes. In support of this script, PR #57 was submitted to enhance the amclient.py script's capabilities. * Resolves #81 create a bulk-reingest mechanism * Resolves #72 write-mode when opening the PID file

jrwdunham · 2018-06-24T23:59:18Z

@ross-spencer the https://github.com/artefactual/automation-tools/commits/dev/reingest-automation branch is tied to the still-open PR #17. Before deleting that branch I think one of us needs to document on that PR 17 why we think closing the PR and deleting the dev branch is justified, given the work you've introduced here. I could do that, or you could, but I think there's a couple of hours of analysis work to do there before we delete that branch.

I think it should be okay to delete the dev/issue-10180-reingest-aip branch. I'm having trouble remembering (re-discovering) the context for that work and why it was left in the state it's in. Anyways, I've pulled it to my local git, so I'm comfortable with us deleting it from origin.

ross-spencer self-assigned this Apr 16, 2018

ross-spencer requested review from sevein, jraddaoui and jrwdunham April 16, 2018 09:52

jrwdunham suggested changes Apr 24, 2018

View reviewed changes

qubot force-pushed the dev/cca-11686-reingest-script branch 2 times, most recently from cb377cd to 6e5efb0 Compare May 21, 2018 21:57

ross-spencer commented May 22, 2018

View reviewed changes

jrwdunham approved these changes Jun 6, 2018

View reviewed changes

qubot force-pushed the dev/cca-11686-reingest-script branch from 6e5efb0 to 236129d Compare June 7, 2018 14:53

ross-spencer mentioned this pull request Jun 18, 2018

Problem: Cannot automate partial or metadata reingest via API usage artefactual/archivematica#1145

Closed

jrwdunham reviewed Jun 18, 2018

View reviewed changes

jrwdunham approved these changes Jun 18, 2018

View reviewed changes

This was referenced Jun 23, 2018

Problem: It might be more flexible to autogenerate portions of README.md from amclient.py #83

Closed

Problem: Consider ways of removing repetition when calling amclient.py #84

Open

qubot force-pushed the dev/cca-11686-reingest-script branch from a2b57ae to 7864cfc Compare June 23, 2018 01:28

qubot merged commit 7864cfc into master Jun 23, 2018

qubot deleted the dev/cca-11686-reingest-script branch June 23, 2018 01:33

jrwdunham mentioned this pull request Jun 25, 2018

Add example automated reingest script & docs #17

Closed


		LOGGER.info("Attempting to approve transfer following reingest init")

		if transfer["status"] == "USER_INPUT":

CCA reingest script #58

CCA reingest script #58

Conversation

ross-spencer commented Apr 16, 2018

jrwdunham left a comment

Choose a reason for hiding this comment

jrwdunham Apr 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ross-spencer May 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ross-spencer commented May 22, 2018

ross-spencer commented Jun 6, 2018

jrwdunham left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ross-spencer commented Jun 7, 2018

ross-spencer commented Jun 8, 2018

tw4l commented Jun 8, 2018 • edited Loading

stefanabreitwieser commented Jun 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ross-spencer Jun 23, 2018 • edited Loading

Choose a reason for hiding this comment

jrwdunham Jun 18, 2018 • edited Loading

Choose a reason for hiding this comment

ross-spencer Jun 23, 2018 • edited Loading

Choose a reason for hiding this comment

jrwdunham Jun 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrwdunham left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrwdunham commented Jun 19, 2018

ross-spencer commented Jun 20, 2018

jrwdunham commented Jun 24, 2018

jrwdunham Apr 24, 2018 •

edited

Loading

ross-spencer May 14, 2018 •

edited

Loading

tw4l commented Jun 8, 2018 •

edited

Loading

ross-spencer Jun 23, 2018 •

edited

Loading

jrwdunham Jun 18, 2018 •

edited

Loading

ross-spencer Jun 23, 2018 •

edited

Loading

jrwdunham Jun 18, 2018 •

edited

Loading