National Monument Audit

Under construction until Spring 2021

Front-end interface is here. And full documentation of technical process is here.

This repository will contain:

Scripts that process monument data from a number of sources
The latest processed aggregate monument dataset
An interface for exploring that dataset

Code repository organization

./app/ directory contains the files (HTML, CSS, JS, JSON) for the front-end user interface. The files in ./app/data/ are generated by running the Python scripts outlined further down in this document.
./config/ directory contains configuration files in JSON format. This is used during the data processing step (via Python scripts) outlined further down in this document.
1. data-model.json is the main configuration file that contains all the logic for generating the monument study set.
  1. "fields" key contains a list of all the fields that will be indexed by the search engine
  2. "fieldsForEntities" contains the fields that should be used for extracting PEOPLE and EVENT entities from
  3. "conditionalFieldsForEntities" is the same as above, but only includes entities that are also part of the object's name
  4. "types" is a list of rules for determining an object's group (e.g. Marker, Building, Monument, etc) and a monument's type (e.g. pyramid, bust, obelisk, etc) when applicable.
2. ingest/ directory contains one JSON file per data source that is processed. Each one of these files contains the logic for how the source data set should be parsed and mapped to the study set fields.
./data/ directory contains all the data (before, during, and after processing):
1. ./compiled/ the output data from the data processing (as a result of running Python scripts outlined below)
2. ./preprocessed/ cached data (e.g. geocoded locations) used during the data processing; this reduces the amount of time it takes to re-run the scripts.
3. ./vendor/ directory contains all the raw data collected from the data sources. This also contains pre-processed data as a result of custom scripts written for parsing a particular data source (see "./scripts/" description below)
4. corrections.csv contains manual corrections to the data
5. entities_add.csv contains entities (PEOPLE or EVENTS) that should be included but were not automatically extracted from the entity recognition process
6. entities_aliases.csv contains a list of aliases for entities (PEOPLE or EVENTS), e.g. "Cristoforo Colombo" refers to the same person as "Christopher Columbus"
./lib/ contains custom Python libraries for use in Python scripts
./scripts/ contains custom scripts that were used to preprocess data sources. This includes scripts for downloading and parsing websites.

For running scripts

You must be using Python 3.x for running the scripts in this repository (developed using 3.6.8). To install requirements:

pip install -r requirements.txt

If you will be doing name entity extraction, you must also download the corpus:

python -m spacy download en_core_web_sm

If you are doing entity linking, you should also unzip ./data/wikidata.zip into ./data/wikidata/. This contains pre-processed Wikidata used for entity linking. If you don't do this, entity linking will be done from scratch and will take a long time to process.

Adding new data sources

Create a new .json file in folder ./config/ingest/. You can copy the contents of an existing .json file as a template
1. You can view existing configuration files
Run python run.py -entities. This will re-process all the data and update the compiled data files as well as the data for the app. The -entities flag is added to add a step to search for new PERSON and EVENT entities in the new data. This can take a while. If you are making only minor tweaks to the data (i.e. nothing that requires reprocessing entities), you can omit that flag.
1. Behind the scenes this runs a number of scripts sequentially:
```
python ingest.py
python extract_entities.py
python normalize_entities.py
python resolve_entities.py
python visualize_entities.py
python ingest.py
```
2. ingest.py is the central script that contains all the logic for transforming the source data into the compiled study set and interface. The next four scripts contain the steps for doing entity recognition and entity linking (to Wikidata entries). ingest.py must be run again after processing entities since entities are used when determining if an object is a monument or not.
3. You can run each/any of these manually. Note that extract_entities.py takes a very long time and only needed if you need to re-analyze the data to extract named entities or if you edited data/entities_add.csv (a .csv file that manually adds names to look for in the data.) If you only edited data/entities_aliases.csv (alternative names for named entities), you can skip extract_entities.py and start with normalize_entities.py
Next you will need to index the data for the search interface; you can do this by running:
```
python index.py -out "search-index/documents-2020-01-01/"
```
The output folder name can be anything; I usually use the current date. If you do not pass in a folder name, it will put it in a folder search-index/documents-latest/ (note, the script will always create a back-up directory at search-index/backup/YYYY-MM-DD-HH-MM/).

Optionally you can add the path to the previous index output to look for deletions (otherwise, no documents will ever be deleted; only updated or added). You should include this parameter if you made changes that would remove records, otherwise, there will be stale/outdated records in the search index. (If there are stale records in the index, however, you can run python index.py -clean and go to the next step.)
```
python index.py -out "search-index/documents-2020-01-01/" -prev "search-index/documents-2019-12-01/"
```
This will generate a number of batch json files in the output directory. Each individual batch file should be under 5MB; otherwise AWS will reject it. You can adjust the batch size by increasing or decreasing the records per batch file, e.g.:
```
python index.py -out "search-index/documents-2020-01-01/" -prev "search-index/documents-2019-12-01/" -batchsize 2000
```
If you received no error in the previous step, you can now upload the batch files to the AWS CloudSearch Index. Before you do this, you will need to install AWS CLI and set your credentials (you will need permission to post new CloudSearch documents):
```
aws s3 ls --profile monumentlab
```
Then follow the prompts for entering your key and secret, and use region us-east-1. This will store your credentials under profile "monumentlab". Then you can run the following script to upload the records from the previous step:
```
python index_upload.py -in "search-index/documents-2020-01-01/*.json"
```
If no directory is passed in, search-index/documents-latest/*.json will be uploaded. It may take some time for re-indexing, but it should happen automatically. You can manually refresh the index through the AWS console.
To view the changes locally, you will need to install (only once) and run the node server:

npm install
npm start

You can now view the dashboard on localhost:2020/app/map.html
Committing and pushing your changes to the main branch will automatically update the online interface
For debugging purposes, there is also an advanced search interface that exposes the full dataset (before filtering out non-monuments) as well as all the raw fields

Manual corrections

There are three .csv files that track manual corrections to the data:

data/corrections.csv - Manual corrections to any record's field
- You must provide four things: (1) Record Id (e.g. "osm_3461770102"), (2) Field (e.g. "Entities People"), (3) Correct Value (e.g. "Martin Luther King Jr."), and (4) Action ("set", "add", or "remove")
- Action "set" will set and overwrite the "Correct Value" explicitly. If the value is a list, it should be pipe ( | ) delimited
- Action "add" will add this value to the existing value, assuming the existing value is a list (e.g. "Entities People", "Subjects", "Object Types")
- Action "remove" will remove this value to the existing value, assuming the existing value is a list (e.g. "Entities People", "Subjects", "Object Types")
- Note if you correct a child record (of a merged record), you must also manually update the parent record (or simply just update the parent record)
- If you only edit this file, all you need to run is:
```
python ingest.py
```
Then follow the indexing process starting at step 3 in the previous section.
data/entities_aliases.csv - Alternative names (aliases) of existing people or events
- E.g. "Gen. U.S. Grant" is an alias of "Ulysses S. Grant"
- Note this is only for existing names
- The "target" field should be the "official" spelling of the name as it exists on Wikipedia. The "official" spelling will usually have the highest number on the "People" drop-down list in the interface.
- If you only edit this file, you need to run:
```
python normalize_entities.py
python resolve_entities.py
python visualize_entities.py
python ingest.py
```
Then follow the indexing process starting at step 3 in the previous section.
data/entities_add.csv - More or less identical to the above file, but this is for names that currently do not exist in the data (were not recognized by the named entity extraction process)
- If you edit this file, it takes the longest to re-process since we need to re-analyze the entities:
```
python extract_entities.py
python normalize_entities.py
python resolve_entities.py
python visualize_entities.py
python ingest.py
```
Then follow the indexing process starting at step 3 in the previous section.

Name		Name	Last commit message	Last commit date
Latest commit History 312 Commits
app		app
config		config
data		data
lib		lib
scripts		scripts
tmp		tmp
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
check_duplicates.py		check_duplicates.py
docs.py		docs.py
download_images.py		download_images.py
error.html		error.html
extract_entities.py		extract_entities.py
filter_csv.py		filter_csv.py
generate_sample.py		generate_sample.py
index.html		index.html
index.py		index.py
index_upload.py		index_upload.py
ingest.py		ingest.py
make_boundaries.py		make_boundaries.py
make_color_key.py		make_color_key.py
normalize_entities.py		normalize_entities.py
package.json		package.json
requirements.txt		requirements.txt
resolve_entities.py		resolve_entities.py
run.py		run.py
server.js		server.js
shapefile_to_csv.py		shapefile_to_csv.py
stats_csv.py		stats_csv.py
stats_geojson.py		stats_geojson.py
visualize_entities.py		visualize_entities.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

National Monument Audit

Code repository organization

For running scripts

Adding new data sources

Manual corrections

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

MonumentLab/national-monument-audit

Folders and files

Latest commit

History

Repository files navigation

National Monument Audit

Code repository organization

For running scripts

Adding new data sources

Manual corrections

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages