Skip to content
Tom R edited this page May 11, 2020 · 15 revisions

Havelovewilltravel documentation

Havelovewilltravel is a project to extract, automatically aggregate (deduplication, data normalisation, ...) and manually clean ConcertAnnouncements from gigfinder platforms such as Facebook, SongKick, BandsInTown and Setlist.fm. Artists and their references to these gigfinder platforms are maintained via Musicbrainz.

The aforementioned gigfinder platforms contain ConcertAnnouncements in great amounts. Some of the announcements are duplicated across the platforms as artists, their management or the concert halls promote the event. Havelovewilltravel does a best effort to automatically de-duplicate and normalise such ConcertAnnouncements into Concerts, which may then refer to (multiple) ConcertAnnouncement(s).

Simultaneously, other information is also subjected to automatic best efforts to maintain data quality, i.e. Venue, Organisation and Location information. However, despite these automatic data quality rules, some manual quality assurance is needed.

The speed and quantity with which the information can be aggregated prohibits us from maintaining the data in spreadsheets. Therefore, we are developing a Data Management Tool for ConcertAnnouncements.

"hlwtadmin" is a Python on Django web application that handles

  • the automatic extraction of concert announcements (via APIs and screenscraping) from gigfinder websites,
  • a rule engine to improve data quality semi-automatically, and
  • a dashboard and tooling for manual Quality Assurance.

Future developments should envisage

  • decoupling the rule engine from the interface
  • LOD endpoint
  • JSON-LD schema in html pages
  • REST-API

Model philospophy

The model consists of two levels that are intertwined. At the bottom, we have a "low leve" data model that is as close as possible to the concertannouncement data that we can capture from the data providing websites such as Songkick or Bandsintown. At the top, we have a "high level" model that aims to model clean data.

Low level model

To describe the raw data as it comes in from the gigfinder websites, we only need a very simple model, with at its core, these three entities:

  • ConcertAnnouncement: a simplified schema for capturing the temporal information of a concert, and link it to an artist and a venue
  • Artist: basically a copy of a MusicBrainz Artist ID
  • Venue: a string representation of the raw venue information

Hereby, a ConcertAnnouncement is related to one Artist and one Venue, through a ForeignKey Relation.

High level model

To keep track of the source of the low-level ConcertAnnouncement, we need some additional models. The philosophy behind the higher level data model is inspired by Musicbrainz. We work with a number of core entities, and there are relations possible between all core entities.

The core entities are

  • Artist: which is the same as the Artist model for the low-level data
  • Concert: a thin model for holding temporal information about a concert, and which serves as a crossroads for the relations (see below)
  • Organisation: a model to hold information about venues, festivals, arenas, etc. (these types of organisations are available as Organisation Types)
  • Location: the geographical information. For reasons of normalisation, we also employ a Country model.

The core entities are among each other related via seperate tables:

Relations between Concerts and Artists/Organisations contain a field "credited as". This is useful for expressing that an Artist with an "official name" X performs at a certain concert as Y, e.g. "Hi Hawaii" performed a concert as "Geroezemoes".

Relation between low and high level models

These two models do not live independently from each other. Several links exist, foremost:

  • Artist: this model is shared between the two levels.

Then there are a number of ForeignKey relations:

  • ConcertAnnouncement > Concert: each concertannouncement should be related to exactly one Concert through a foreign key.
  • Venue > Organisation: a venue as reported on the gigfinder website should resolve to exactly one Organisation through a foreign key.
  • Venue > Location: a venue as reported on the gigfinder website should also be related to exactly on Location through a Foreign Key.
  • Organisation > Location: an organisation is located at exactly one Location. The shared relation between an Organisation and a Venue with a Location is helpful for the deduplication automatization, explained elsewhere.

Model documentation

Data model

Relations

Semantics

Merge functionalities

Automation

QA Lists and procedures

Batch operations

Development

Clone this wiki locally