Skip to content

Commit

Permalink
More docs, fillna for optional fields on image ingestion.
Browse files Browse the repository at this point in the history
  • Loading branch information
mikejcorey committed Jun 27, 2024
1 parent 04f0378 commit a93cc66
Show file tree
Hide file tree
Showing 6 changed files with 156 additions and 1 deletion.
4 changes: 4 additions & 0 deletions apps/deed/management/commands/gather_deed_images.py
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,10 @@ def build_django_objects(self, matching_keys, workflow):
deed_pages_df = self.add_merge_fields(deed_pages_df, workflow)
deed_pages_df = self.add_supplemental_info(deed_pages_df, workflow)

# Fill na on optional fields
if 'batch_id' in deed_pages_df.columns:
deed_pages_df[['batch_id']] = deed_pages_df[['batch_id']].fillna('')

# Drop duplicates again just in case
deed_pages_df = deed_pages_df.drop_duplicates(subset=['s3_lookup'])

Expand Down
2 changes: 1 addition & 1 deletion apps/deed/management/commands/gather_image_hits.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ def build_match_report(self, workflow, matching_keys):
report_df['citizen_count'] = 0

report_df['deathcert_count'] = 0
death_certs = ['death certificate', 'certificate of death', 'date of death']
death_certs = ['death certificate', 'certificate of death', 'date of death', 'name of deceased']
for term in death_certs:
if term in report_df.columns:
print(report_df[term].apply(lambda x: self.split_or_1(x)))
Expand Down
4 changes: 4 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,12 @@ The Deed Machine was created at Mapping Prejudice at the University of Minnesota
:maxdepth: 2
:caption: Common workflows

modules/starting-a-workflow.rst
modules/uploading-files.rst
modules/ingesting-hits.rst
modules/downloading-new-results.rst
modules/manual-data-cleaning.rst


.. toctree::
:maxdepth: 2
Expand Down
19 changes: 19 additions & 0 deletions docs/modules/ingesting-hits.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Ingesting results of initial processing
=======================================

After uploading of images and initial processing is complete, it's time to ingest the results into the Deed Machine's Django component.

Before ingestion, be sure to create a regular expression for filepath data extraction and any needed supplemental info as outlined in :ref:`starting-a-workflow`.

1. Gather results of document image uploads into the Django app. Optionally, add supplemental info like missing doc nums that have been provided in a separate csv.

.. code-block:: bash
python manage.py gather_deed_images --workflow "WI Milwaukee County"
2. Gather list of positive matches for racially restrictive language and join to deed image records in Django app

.. code-block:: bash
python manage.py gather_image_hits --workflow "WI Milwaukee County"
86 changes: 86 additions & 0 deletions docs/modules/starting-a-workflow.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
.. _starting-a-workflow:

Starting a workflow
===================

In the Deed Machine, each county or other jurisdiction that provides sets of records to be analyzed is represented as a ZooniverWorkflow, or workflow for short.

1. For each new workflow, start by adding an entry in the Python config dictionary object in ``local_settings.py``. ``local_settings.py`` is ignored by git, so if you have not previously made a ``local_settings.py`` file, do so now, saved to the ``racial_covenants_processor/settings/`` folder. This file is imported at the end of the main settings file, common.py (which should generally not be edited by end users), and settings placed in ``local_settings.py`` will override those settings.

.. code-block:: python
ZOONIVERSE_QUESTION_LOOKUP = {
'WI Milwauke County': {
},
'MN Olmsted County': {
...
}
}
2. The folder structure and filenames of the records provided by records custodians can provide necessary and bonus information about each record. For example, folders and filenames can include the document date (``doc_date``), document number (``doc_num``) book and page (``book_id`` and ``page_num``). For each county, you will need to write a regular expression to parse the folder and filenames after they have been uploaded to S3 during the initial processing phase. While it is not strictly necessary to write this regular expression before file upload, it is a good practice to think through whether the folder structure and filenames as delivered will be able to be successfully generalized into a regular expression in order to avoid the need for either exceptionally complex regular expressions or costly re-uploads.

The best way to build your regular expression is to experiment at Pythex.org with sample paths from the ``s3_path`` field of the CSV files produced by the standalone uploader, which are stored in the ``data`` folder of wherever you have installed the `mp-upload-deed-images-standalone <https://github.com/UMNLibraries/mp-upload-deed-images-standalone>`_ application.

For example, during the process of ingesting results from S3 into the Deed Machine's database, the following regular expression captures data including the workflow slug, as well as the ``doc_type``, ``batch_id``, ``book_id``, ``doc_num``, and ``split_page_num`` fields that will be saved to the database.

.. code-block:: python
ZOONIVERSE_QUESTION_LOOKUP = {
'WI Milwauke County': {
...
},
'MN Olmsted County': {
'deed_image_regex': r'/(?P<workflow_slug>[A-z\-]+)/OlmstedCounty(?P<doc_type>[A-Za-z]+)/(?P<batch_id>[A-Za-z]+)/?(?P<book_id>[A-Za-z\-\d]+)?/(?P<doc_num>[A-Z\d\.]+)(?:_SPLITPAGE_)?(?P<split_page_num>(?<=_SPLITPAGE_)\d+)?',
}
}
In order to facilitate correct pagination, for each image you will need to capture, at minimum:

- Either a ``doc_num``, or both a ``book_id`` and ``page_num``
- The ``split_page_num`` generated by the initial processing stage when mult-page TIF files are processed. Note that while SPLITPAGE will not show up in the list of s3_paths in the CSVs generated by the `mp-upload-deed-images-standalone <https://github.com/UMNLibraries/mp-upload-deed-images-standalone>`_ application, they still should be accounted for in your regular expression. This means that regular expressions will almost always need to end with ``(?:_SPLITPAGE_)?(?P<split_page_num>(?<=_SPLITPAGE_)\d+)?``, as shown below.

4. (Optional) If the required ``doc_num`` or ``book_id``\/``page_num`` combination are not parseable from the folder/filenames, then a suppliemental CSV should be included at the time of ingestion after initial processing. This CSV will allow the Deed Machine to link additional information to each image by using a lookup table based on metadata pulled from the images folder and pathname.

To add data from one or more supplemental CSV files, add a ``deed_supplemental_info`` list to the ``ZOONIVERSE_QUESTION_LOOKUP`` config object:

.. code-block:: python
ZOONIVERSE_QUESTION_LOOKUP = {
'WI Milwauke County': {
...
},
'MN Olmsted County': {
'deed_image_regex': r'/(?P<workflow_slug>[A-z\-]+)/OlmstedCounty(?P<doc_type>[A-Za-z]+)/(?P<batch_id>[A-Za-z]+)/?(?P<book_id>[A-Za-z\-\d]+)?/(?P<doc_num>[A-Z\d\.]+)(?:_SPLITPAGE_)?(?P<split_page_num>(?<=_SPLITPAGE_)\d+)?',
'deed_supplemental_info': [
{
'data_csv': '/Users/mcorey/Documents/Deed projects/mn/ramsey/ramsey_recorder_supplemental_info/Abstract_20191106_header.csv', # Absolute path to supplemental CSV
'join_field_deed': 'doc_alt_id', # Join field drawn from imported image path
'join_field_supp': 'itemnum', # Join field in supplemental CSV
'mapping': {
'doc_num': 'mp_doc_num', # deed machine varname: CSV column name
'doc_type': 'landtype' # deed machine varname: CSV column name
}
}
],
}
}
In the example above, the ingestion process will expect each ingested file to include a ``doc_alt_id`` field in the regular expression that matches the value ``itemnum`` in the supplemental spreadsheet. Based on the values in the ``mapping`` section of the ``deed_supplemental_info`` dictionary, the values in the CSV's ``mp_doc_num`` column will be ingested into the Deed Machine's ``doc_num`` field, and likewise values in the ``landtype`` CSV field will be ingested into the Deed Machine's ``doc_type`` field.

Sample supplemental CSV with matching data:

+----------+---------+--------------------------------------+---------+----------+----------------+-------------+
| itemnum | pagecnt | itemname | docnum | landtype | instrumenttype | mp_doc_num |
+==========+=========+======================================+=========+==========+================+=============+
| 12117219 | 1 | ABSTRACT - 1483219 - - R-CONVERSION | 1483219 | ABSTRACT | R-CONVERSION | A1483219 |
| 12117223 | 1 | ABSTRACT - 1483678 - - R-CONVERSION | 1483678 | ABSTRACT | R-CONVERSION | A1483678 |
| 12117224 | 1 | ABSTRACT - 1483679 - - R-CONVERSION | 1483679 | ABSTRACT | R-CONVERSION | A1483679 |
| 12117228 | 1 | ABSTRACT - 1485353 - - R-CONVERSION | 1485353 | ABSTRACT | R-CONVERSION | A1485353 |
+----------+---------+--------------------------------------+---------+----------+----------------+-------------+


3. Create a Django ZooniverseWorkflow object

.. code-block:: bash
python manage.py create_workflow --workflow "WI Olmsted County"
42 changes: 42 additions & 0 deletions docs/modules/uploading-files.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
Uploading images/initial processing
===================================

Option 1: Upload from local installation
----------------------------------------

If the property images to be processed are stored on a drive attached to the same computer where the Deed Machine is installed locally, then the ``upload_deed_images`` command can be run directly from the local installation.

.. code-block:: bash
python manage.py upload_deed_images --workflow "WI Milwaukee County"
To use this option, the XXXX values must be added to the ``ZOONIVERSE_QUESTION_LOOKUP`` workflow config lookup for this workflow in ``local_settings.py``.


Option 2: Upload with the standalone Deed Machine uploader
----------------------------------------------------------

(Recommended for large sets of images)

Often deed images are stored on a local machine or network drive, and it's not feasible or efficient to move them. This standalone uploader is designed to avoid the user having to do a full install on this computer, which is particularly useful when moving millions of files may be time-consuming or present storage issues.

- `mp-upload-deed-images-standalone <https://github.com/UMNLibraries/mp-upload-deed-images-standalone>`_


Related commands
----------------

To go back and re-OCR records that had errors:

.. code-block:: bash
python manage.py trigger_ocr_cleanup --workflow "WI Milwaukee County"
To re-do the search terms and image optimization steps, while skipping most costly OCR step:

.. code-block:: bash
python manage.py trigger_lambda_refresh --workflow "WI Milwaukee County"
To delete image files from S3 (Warning: cannot be undone):

.. code-block:: bash
python manage.py delete_raw_images --workflow "Your workflow here"

0 comments on commit a93cc66

Please sign in to comment.