Skip to content
Tom R edited this page May 20, 2020 · 15 revisions

Havelovewilltravel documentation

Havelovewilltravel is a project to extract, automatically aggregate (deduplication, data normalisation, ...) and manually clean ConcertAnnouncements from gigfinder platforms such as Facebook, SongKick, BandsInTown and Setlist.fm. Artists and their references to these gigfinder platforms are maintained via Musicbrainz.

The aforementioned gigfinder platforms contain ConcertAnnouncements in great amounts. Some of the announcements are duplicated across the platforms as artists, their management or the concert halls promote the event. Havelovewilltravel does a best effort to automatically de-duplicate and normalise such ConcertAnnouncements into Concerts, which may then refer to (multiple) ConcertAnnouncement(s).

Simultaneously, other information is also subjected to automatic best efforts to maintain data quality, i.e. Venue, Organisation and Location information. However, despite these automatic data quality rules, some manual quality assurance is needed.

The speed and quantity with which the information can be aggregated prohibits us from maintaining the data in spreadsheets. Therefore, we are developing a Data Management Tool for ConcertAnnouncements.

"hlwtadmin" is a Python on Django web application that handles

  • the automatic extraction of concert announcements (via APIs and screenscraping) from gigfinder websites,
  • a rule engine to improve data quality semi-automatically, and
  • a dashboard and tooling for manual Quality Assurance.

Future developments should envisage

  • decoupling the rule engine from the interface
  • LOD endpoint
  • JSON-LD schema in html pages
  • REST-API

Model philospophy

The model consists of two levels that are intertwined. At the bottom, we have a "low leve" data model that is as close as possible to the concertannouncement data that we can capture from the data providing websites such as Songkick or Bandsintown. At the top, we have a "high level" model that aims to model clean data.

Low level model

To describe the raw data as it comes in from the gigfinder websites, we only need a very simple model, with at its core, these three entities:

  • ConcertAnnouncement: a simplified schema for capturing the temporal information of a concert, and link it to an artist and a venue
  • Artist: basically a copy of a MusicBrainz Artist ID
  • Venue: a string representation of the raw venue information

Hereby, a ConcertAnnouncement is related to one Artist and one Venue, through a ForeignKey Relation.

High level model

To keep track of the source of the low-level ConcertAnnouncement, we need some additional models. The philosophy behind the higher level data model is inspired by Musicbrainz. We work with a number of core entities, and there are relations possible between all core entities.

The core entities are

  • Artist: which is the same as the Artist model for the low-level data
  • Concert: a thin model for holding temporal information about a concert, and which serves as a crossroads for the relations (see below)
  • Organisation: a model to hold information about venues, festivals, arenas, etc. (these types of organisations are available as Organisation Types)
  • Location: the geographical information. For reasons of normalisation, we also employ a Country model.

The core entities are among each other related via seperate tables:

Relations between Concerts and Artists/Organisations contain a field "credited as". This is useful for expressing that an Artist with an "official name" X performs at a certain concert as Y, e.g. "Hi Hawaii" performed a concert as "Geroezemoes".

Relation between low and high level models

These two levels do not live independently from each other. Several links exist, foremost:

  • Artist: this model is shared between the two levels.

Then there are a number of ForeignKey relations:

Model visualisation

Below, you find a simplified ER diagram of the application, which visualizes the structure explained above.

TODO: added until_date to concert announcement and concert, and a is_festival boolean to concertannouncement.

The code for this visualisation is as follows:

Table GigFinder {
  id int [pk]
  name varchar
  base_url varchar
  api_key varchar
}

Table Artist {
  mbid varchar [pk]
  name varchar
  disambiguation int
  include bool
  genre_id int [ref: > Genre.id]
}

Table GigFinderUrl {
  id int [pk]
  gigfinder_id int [ref: > GigFinder.id]
  artist_id int [ref: > Artist.mbid]
  url varchar
  last_confirmed_by_musicbrainz datetime
  last_synchronized datetime
}

Table ConcertAnnouncement {
  id int [pk]
  title varchar
  date date
  time time
  gigfinder_id int [ref: > GigFinder.id]
  raw_venue_id int [ref: > Venue.id]
  artist_id int [ref: > Artist.mbid]
  gigfinder_concert_id int
  concert_id int [ref: > Concert.id]
  last_seen_on datetime
  seen_count int
  
  ignore bool
  created_at datetime
  updated_at datetime
  latitude float
  longitude float
}

Table Venue {
  id int [pk]
  raw_venue varchar
  raw_location varchar
  organisation_id int [ref: > Organisation.id]
  non_assignable bool
}

Table Concert {
  id int [pk]
  title varchar
  date date
  time time
  cancelled bool
  verified bool
  ignore bool
  created_at datetime
  updated_at datetime
  genre_id int [ref: > Genre.id]
  latitude float
  longitude float
  annotation varchar
}

Table Organisation {
  id int [pk]
  name varchar
  disambiguation varchar
  latitude float
  longitude float
  location_id int [ref: > Location.id]
  organisation_type_id int [ref: > OrganisationType.id]
  start_date datetime
  start_date_precision int
  end_date datetime
  end_date_precision int
  website varchar
  verified bool
  genre_id int [ref: > Genre.id]
  active bool
  capacity varchar
  annotation varchar
  address varchar
}

Table OrganisationType {
  id int [pk]
  name varchar
}

Table Location {
  id int [pk]
  city varchar
  zipcode varchar
  latitude float
  longitude float
  country_id int [ref: > Country.id]
  subcountry varchar
  verified bool
}

Table RelationConcertArtist {
  id int [pk]
  artist_id int [ref: > Artist.mbid]
  artist_credited_as varchar
  concert_id int [ref: > Concert.id]
  relation_type_id int [ref: > RelationConcertArtistType.id]
}

Table RelationConcertArtistType {
  id int [pk]
  name varchar
}

Table RelationConcertOrganisation {
  id int [pk]
  relation_type_id int [ref: > RelationConcertOrganisationType.id]
  concert_id int [ref: > Concert.id]
  organisation_id int [ref: > Organisation.id]
  organisation_credited_as varchar
  
  verified bool
}

Table RelationConcertOrganisationType {
  id int [pk]
  name varchar
}

Table RelationArtistArtist {
  artist_a int [ref: > Artist.mbid]
  artist_b int [ref: > Artist.mbid]
  start_date date
  start_date_precision int
  end_date date
  end_date_precision int
  relation_type_id int [ref: > RelationArtistArtistType.id]
}

Table RelationArtistArtistType {
  id int [pk]
  name varchar
}

Table RelationOrganisationOrganisation {
  organisation_a int [ref: > Organisation.id]
  organisation_b int [ref: > Organisation.id]
  start_date date
  start_date_precision int
  end_date date
  end_date_precision int
  relation_type_id int [ref: > RelationOrganisationOrganisationType.id]
}

Table RelationOrganisationOrganisationType {
  id int [pk]
  name varchar
}

Table RelationConcertConcert {
  concert_a int [ref: > Concert.id]
  concert_b int [ref: > Concert.id]
  start_date date
  start_date_precision int
  end_date date
  end_date_precision int
  relation_type_id int [ref: > RelationConcertConcertType.id]
}

Table RelationConcertConcertType {
  id int [pk]
  name varchar
}

Table Country {
  id int [pk]
  name varchar
  iso_code varchar
}

Table Genre {
  id int [pk]
  name varchar
}

Data model

Relations

Semantics

Merge functionalities

Automation

QA Lists and procedures

Batch operations

Development

Clone this wiki locally