-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
More docs, fillna for optional fields on image ingestion.
- Loading branch information
1 parent
04f0378
commit a93cc66
Showing
6 changed files
with
156 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
Ingesting results of initial processing | ||
======================================= | ||
|
||
After uploading of images and initial processing is complete, it's time to ingest the results into the Deed Machine's Django component. | ||
|
||
Before ingestion, be sure to create a regular expression for filepath data extraction and any needed supplemental info as outlined in :ref:`starting-a-workflow`. | ||
|
||
1. Gather results of document image uploads into the Django app. Optionally, add supplemental info like missing doc nums that have been provided in a separate csv. | ||
|
||
.. code-block:: bash | ||
python manage.py gather_deed_images --workflow "WI Milwaukee County" | ||
2. Gather list of positive matches for racially restrictive language and join to deed image records in Django app | ||
|
||
.. code-block:: bash | ||
python manage.py gather_image_hits --workflow "WI Milwaukee County" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
.. _starting-a-workflow: | ||
|
||
Starting a workflow | ||
=================== | ||
|
||
In the Deed Machine, each county or other jurisdiction that provides sets of records to be analyzed is represented as a ZooniverWorkflow, or workflow for short. | ||
|
||
1. For each new workflow, start by adding an entry in the Python config dictionary object in ``local_settings.py``. ``local_settings.py`` is ignored by git, so if you have not previously made a ``local_settings.py`` file, do so now, saved to the ``racial_covenants_processor/settings/`` folder. This file is imported at the end of the main settings file, common.py (which should generally not be edited by end users), and settings placed in ``local_settings.py`` will override those settings. | ||
|
||
.. code-block:: python | ||
ZOONIVERSE_QUESTION_LOOKUP = { | ||
'WI Milwauke County': { | ||
}, | ||
'MN Olmsted County': { | ||
... | ||
} | ||
} | ||
2. The folder structure and filenames of the records provided by records custodians can provide necessary and bonus information about each record. For example, folders and filenames can include the document date (``doc_date``), document number (``doc_num``) book and page (``book_id`` and ``page_num``). For each county, you will need to write a regular expression to parse the folder and filenames after they have been uploaded to S3 during the initial processing phase. While it is not strictly necessary to write this regular expression before file upload, it is a good practice to think through whether the folder structure and filenames as delivered will be able to be successfully generalized into a regular expression in order to avoid the need for either exceptionally complex regular expressions or costly re-uploads. | ||
|
||
The best way to build your regular expression is to experiment at Pythex.org with sample paths from the ``s3_path`` field of the CSV files produced by the standalone uploader, which are stored in the ``data`` folder of wherever you have installed the `mp-upload-deed-images-standalone <https://github.com/UMNLibraries/mp-upload-deed-images-standalone>`_ application. | ||
|
||
For example, during the process of ingesting results from S3 into the Deed Machine's database, the following regular expression captures data including the workflow slug, as well as the ``doc_type``, ``batch_id``, ``book_id``, ``doc_num``, and ``split_page_num`` fields that will be saved to the database. | ||
|
||
.. code-block:: python | ||
ZOONIVERSE_QUESTION_LOOKUP = { | ||
'WI Milwauke County': { | ||
... | ||
}, | ||
'MN Olmsted County': { | ||
'deed_image_regex': r'/(?P<workflow_slug>[A-z\-]+)/OlmstedCounty(?P<doc_type>[A-Za-z]+)/(?P<batch_id>[A-Za-z]+)/?(?P<book_id>[A-Za-z\-\d]+)?/(?P<doc_num>[A-Z\d\.]+)(?:_SPLITPAGE_)?(?P<split_page_num>(?<=_SPLITPAGE_)\d+)?', | ||
} | ||
} | ||
In order to facilitate correct pagination, for each image you will need to capture, at minimum: | ||
|
||
- Either a ``doc_num``, or both a ``book_id`` and ``page_num`` | ||
- The ``split_page_num`` generated by the initial processing stage when mult-page TIF files are processed. Note that while SPLITPAGE will not show up in the list of s3_paths in the CSVs generated by the `mp-upload-deed-images-standalone <https://github.com/UMNLibraries/mp-upload-deed-images-standalone>`_ application, they still should be accounted for in your regular expression. This means that regular expressions will almost always need to end with ``(?:_SPLITPAGE_)?(?P<split_page_num>(?<=_SPLITPAGE_)\d+)?``, as shown below. | ||
|
||
4. (Optional) If the required ``doc_num`` or ``book_id``\/``page_num`` combination are not parseable from the folder/filenames, then a suppliemental CSV should be included at the time of ingestion after initial processing. This CSV will allow the Deed Machine to link additional information to each image by using a lookup table based on metadata pulled from the images folder and pathname. | ||
|
||
To add data from one or more supplemental CSV files, add a ``deed_supplemental_info`` list to the ``ZOONIVERSE_QUESTION_LOOKUP`` config object: | ||
|
||
.. code-block:: python | ||
ZOONIVERSE_QUESTION_LOOKUP = { | ||
'WI Milwauke County': { | ||
... | ||
}, | ||
'MN Olmsted County': { | ||
'deed_image_regex': r'/(?P<workflow_slug>[A-z\-]+)/OlmstedCounty(?P<doc_type>[A-Za-z]+)/(?P<batch_id>[A-Za-z]+)/?(?P<book_id>[A-Za-z\-\d]+)?/(?P<doc_num>[A-Z\d\.]+)(?:_SPLITPAGE_)?(?P<split_page_num>(?<=_SPLITPAGE_)\d+)?', | ||
'deed_supplemental_info': [ | ||
{ | ||
'data_csv': '/Users/mcorey/Documents/Deed projects/mn/ramsey/ramsey_recorder_supplemental_info/Abstract_20191106_header.csv', # Absolute path to supplemental CSV | ||
'join_field_deed': 'doc_alt_id', # Join field drawn from imported image path | ||
'join_field_supp': 'itemnum', # Join field in supplemental CSV | ||
'mapping': { | ||
'doc_num': 'mp_doc_num', # deed machine varname: CSV column name | ||
'doc_type': 'landtype' # deed machine varname: CSV column name | ||
} | ||
} | ||
], | ||
} | ||
} | ||
In the example above, the ingestion process will expect each ingested file to include a ``doc_alt_id`` field in the regular expression that matches the value ``itemnum`` in the supplemental spreadsheet. Based on the values in the ``mapping`` section of the ``deed_supplemental_info`` dictionary, the values in the CSV's ``mp_doc_num`` column will be ingested into the Deed Machine's ``doc_num`` field, and likewise values in the ``landtype`` CSV field will be ingested into the Deed Machine's ``doc_type`` field. | ||
|
||
Sample supplemental CSV with matching data: | ||
|
||
+----------+---------+--------------------------------------+---------+----------+----------------+-------------+ | ||
| itemnum | pagecnt | itemname | docnum | landtype | instrumenttype | mp_doc_num | | ||
+==========+=========+======================================+=========+==========+================+=============+ | ||
| 12117219 | 1 | ABSTRACT - 1483219 - - R-CONVERSION | 1483219 | ABSTRACT | R-CONVERSION | A1483219 | | ||
| 12117223 | 1 | ABSTRACT - 1483678 - - R-CONVERSION | 1483678 | ABSTRACT | R-CONVERSION | A1483678 | | ||
| 12117224 | 1 | ABSTRACT - 1483679 - - R-CONVERSION | 1483679 | ABSTRACT | R-CONVERSION | A1483679 | | ||
| 12117228 | 1 | ABSTRACT - 1485353 - - R-CONVERSION | 1485353 | ABSTRACT | R-CONVERSION | A1485353 | | ||
+----------+---------+--------------------------------------+---------+----------+----------------+-------------+ | ||
|
||
|
||
3. Create a Django ZooniverseWorkflow object | ||
|
||
.. code-block:: bash | ||
python manage.py create_workflow --workflow "WI Olmsted County" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
Uploading images/initial processing | ||
=================================== | ||
|
||
Option 1: Upload from local installation | ||
---------------------------------------- | ||
|
||
If the property images to be processed are stored on a drive attached to the same computer where the Deed Machine is installed locally, then the ``upload_deed_images`` command can be run directly from the local installation. | ||
|
||
.. code-block:: bash | ||
python manage.py upload_deed_images --workflow "WI Milwaukee County" | ||
To use this option, the XXXX values must be added to the ``ZOONIVERSE_QUESTION_LOOKUP`` workflow config lookup for this workflow in ``local_settings.py``. | ||
|
||
|
||
Option 2: Upload with the standalone Deed Machine uploader | ||
---------------------------------------------------------- | ||
|
||
(Recommended for large sets of images) | ||
|
||
Often deed images are stored on a local machine or network drive, and it's not feasible or efficient to move them. This standalone uploader is designed to avoid the user having to do a full install on this computer, which is particularly useful when moving millions of files may be time-consuming or present storage issues. | ||
|
||
- `mp-upload-deed-images-standalone <https://github.com/UMNLibraries/mp-upload-deed-images-standalone>`_ | ||
|
||
|
||
Related commands | ||
---------------- | ||
|
||
To go back and re-OCR records that had errors: | ||
|
||
.. code-block:: bash | ||
python manage.py trigger_ocr_cleanup --workflow "WI Milwaukee County" | ||
To re-do the search terms and image optimization steps, while skipping most costly OCR step: | ||
|
||
.. code-block:: bash | ||
python manage.py trigger_lambda_refresh --workflow "WI Milwaukee County" | ||
To delete image files from S3 (Warning: cannot be undone): | ||
|
||
.. code-block:: bash | ||
python manage.py delete_raw_images --workflow "Your workflow here" |