Skip to content

MonumentLab/national-monument-audit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

National Monument Audit

Under construction until Spring 2021

Front-end interface is here. And full documentation of technical process is here.

This repository will contain:

  1. Scripts that process monument data from a number of sources
  2. The latest processed aggregate monument dataset
  3. An interface for exploring that dataset

Code repository organization

  1. ./app/ directory contains the files (HTML, CSS, JS, JSON) for the front-end user interface. The files in ./app/data/ are generated by running the Python scripts outlined further down in this document.
  2. ./config/ directory contains configuration files in JSON format. This is used during the data processing step (via Python scripts) outlined further down in this document.
    1. data-model.json is the main configuration file that contains all the logic for generating the monument study set.
      1. "fields" key contains a list of all the fields that will be indexed by the search engine
      2. "fieldsForEntities" contains the fields that should be used for extracting PEOPLE and EVENT entities from
      3. "conditionalFieldsForEntities" is the same as above, but only includes entities that are also part of the object's name
      4. "types" is a list of rules for determining an object's group (e.g. Marker, Building, Monument, etc) and a monument's type (e.g. pyramid, bust, obelisk, etc) when applicable.
    2. ingest/ directory contains one JSON file per data source that is processed. Each one of these files contains the logic for how the source data set should be parsed and mapped to the study set fields.
  3. ./data/ directory contains all the data (before, during, and after processing):
    1. ./compiled/ the output data from the data processing (as a result of running Python scripts outlined below)
    2. ./preprocessed/ cached data (e.g. geocoded locations) used during the data processing; this reduces the amount of time it takes to re-run the scripts.
    3. ./vendor/ directory contains all the raw data collected from the data sources. This also contains pre-processed data as a result of custom scripts written for parsing a particular data source (see "./scripts/" description below)
    4. corrections.csv contains manual corrections to the data
    5. entities_add.csv contains entities (PEOPLE or EVENTS) that should be included but were not automatically extracted from the entity recognition process
    6. entities_aliases.csv contains a list of aliases for entities (PEOPLE or EVENTS), e.g. "Cristoforo Colombo" refers to the same person as "Christopher Columbus"
  4. ./lib/ contains custom Python libraries for use in Python scripts
  5. ./scripts/ contains custom scripts that were used to preprocess data sources. This includes scripts for downloading and parsing websites.

For running scripts

You must be using Python 3.x for running the scripts in this repository (developed using 3.6.8). To install requirements:

pip install -r requirements.txt

If you will be doing name entity extraction, you must also download the corpus:

python -m spacy download en_core_web_sm

If you are doing entity linking, you should also unzip ./data/wikidata.zip into ./data/wikidata/. This contains pre-processed Wikidata used for entity linking. If you don't do this, entity linking will be done from scratch and will take a long time to process.

Adding new data sources

  1. Create a new .json file in folder ./config/ingest/. You can copy the contents of an existing .json file as a template

    1. You can view existing configuration files
  2. Run python run.py -entities. This will re-process all the data and update the compiled data files as well as the data for the app. The -entities flag is added to add a step to search for new PERSON and EVENT entities in the new data. This can take a while. If you are making only minor tweaks to the data (i.e. nothing that requires reprocessing entities), you can omit that flag.

    1. Behind the scenes this runs a number of scripts sequentially:

      python ingest.py
      python extract_entities.py
      python normalize_entities.py
      python resolve_entities.py
      python visualize_entities.py
      python ingest.py
      
    2. ingest.py is the central script that contains all the logic for transforming the source data into the compiled study set and interface. The next four scripts contain the steps for doing entity recognition and entity linking (to Wikidata entries). ingest.py must be run again after processing entities since entities are used when determining if an object is a monument or not.

    3. You can run each/any of these manually. Note that extract_entities.py takes a very long time and only needed if you need to re-analyze the data to extract named entities or if you edited data/entities_add.csv (a .csv file that manually adds names to look for in the data.) If you only edited data/entities_aliases.csv (alternative names for named entities), you can skip extract_entities.py and start with normalize_entities.py

  3. Next you will need to index the data for the search interface; you can do this by running:

    python index.py -out "search-index/documents-2020-01-01/"
    

    The output folder name can be anything; I usually use the current date. If you do not pass in a folder name, it will put it in a folder search-index/documents-latest/ (note, the script will always create a back-up directory at search-index/backup/YYYY-MM-DD-HH-MM/).

    Optionally you can add the path to the previous index output to look for deletions (otherwise, no documents will ever be deleted; only updated or added). You should include this parameter if you made changes that would remove records, otherwise, there will be stale/outdated records in the search index. (If there are stale records in the index, however, you can run python index.py -clean and go to the next step.)

    python index.py -out "search-index/documents-2020-01-01/" -prev "search-index/documents-2019-12-01/"
    

    This will generate a number of batch json files in the output directory. Each individual batch file should be under 5MB; otherwise AWS will reject it. You can adjust the batch size by increasing or decreasing the records per batch file, e.g.:

    python index.py -out "search-index/documents-2020-01-01/" -prev "search-index/documents-2019-12-01/" -batchsize 2000
    
  4. If you received no error in the previous step, you can now upload the batch files to the AWS CloudSearch Index. Before you do this, you will need to install AWS CLI and set your credentials (you will need permission to post new CloudSearch documents):

    aws s3 ls --profile monumentlab
    

    Then follow the prompts for entering your key and secret, and use region us-east-1. This will store your credentials under profile "monumentlab". Then you can run the following script to upload the records from the previous step:

    python index_upload.py -in "search-index/documents-2020-01-01/*.json"
    

    If no directory is passed in, search-index/documents-latest/*.json will be uploaded. It may take some time for re-indexing, but it should happen automatically. You can manually refresh the index through the AWS console.

  5. To view the changes locally, you will need to install (only once) and run the node server:

npm install
npm start
  1. You can now view the dashboard on localhost:2020/app/map.html
  2. Committing and pushing your changes to the main branch will automatically update the online interface
  3. For debugging purposes, there is also an advanced search interface that exposes the full dataset (before filtering out non-monuments) as well as all the raw fields

Manual corrections

There are three .csv files that track manual corrections to the data:

  1. data/corrections.csv - Manual corrections to any record's field

    • You must provide four things: (1) Record Id (e.g. "osm_3461770102"), (2) Field (e.g. "Entities People"), (3) Correct Value (e.g. "Martin Luther King Jr."), and (4) Action ("set", "add", or "remove")
    • Action "set" will set and overwrite the "Correct Value" explicitly. If the value is a list, it should be pipe ( | ) delimited
    • Action "add" will add this value to the existing value, assuming the existing value is a list (e.g. "Entities People", "Subjects", "Object Types")
    • Action "remove" will remove this value to the existing value, assuming the existing value is a list (e.g. "Entities People", "Subjects", "Object Types")
    • Note if you correct a child record (of a merged record), you must also manually update the parent record (or simply just update the parent record)
    • If you only edit this file, all you need to run is:
    python ingest.py
    

    Then follow the indexing process starting at step 3 in the previous section.

  2. data/entities_aliases.csv - Alternative names (aliases) of existing people or events

    • E.g. "Gen. U.S. Grant" is an alias of "Ulysses S. Grant"
    • Note this is only for existing names
    • The "target" field should be the "official" spelling of the name as it exists on Wikipedia. The "official" spelling will usually have the highest number on the "People" drop-down list in the interface.
    • If you only edit this file, you need to run:
    python normalize_entities.py
    python resolve_entities.py
    python visualize_entities.py
    python ingest.py
    

    Then follow the indexing process starting at step 3 in the previous section.

  3. data/entities_add.csv - More or less identical to the above file, but this is for names that currently do not exist in the data (were not recognized by the named entity extraction process)

    • If you edit this file, it takes the longest to re-process since we need to re-analyze the entities:
    python extract_entities.py
    python normalize_entities.py
    python resolve_entities.py
    python visualize_entities.py
    python ingest.py
    

    Then follow the indexing process starting at step 3 in the previous section.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •