Skip to content

Resources data cleaning

Britta edited this page May 30, 2024 · 1 revision

Data is easier to process and search if it's tidy.

Manual QA:

  • Sort A-Z and Z-A by each of the columns, see if anything is missing or inconsistent.
  • Are there any "null"s that shouldn't be nulls?
  • Run it through a broken link checker - are there any broken links?
  • Are there any HTTP links? All links should be HTTPS.
  • Check for links to non-.gov websites. Is it a legitimate government website or publication? For example, is it an official CMS site run by contractors, such as PASRR Technical Assistance Center or ResDAC?

Remove:

  • Duplicate items (multiple items with the same URL)
  • Hidden Unicode control characters
  • Newlines
  • Double spaces

Consider whether to systematically re-process:

  • Curly quotes
  • Curly apostrophes
  • Em dashes and en dashes (may also need to make sure they have spaces around them, to ensure searchability)
  • Copyright and registered trademark symbols ®
  • Section symbols

Overview

Data

Features

Decisions

User research

Usability studies

Design

Development

Clone this wiki locally