-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Havelovewilltravel is a project to extract, automatically aggregate (deduplication, data normalisation, ...) and manually clean ConcertAnnouncements from gigfinder platforms such as Facebook, SongKick, BandsInTown and Setlist.fm. Artists and their references to these gigfinder platforms are maintained via Musicbrainz.
The aforementioned gigfinder platforms contain ConcertAnnouncements in great amounts. Some of the announcements are duplicated across the platforms as artists, their management or the concert halls promote the event. Havelovewilltravel does a best effort to automatically de-duplicate and normalise such ConcertAnnouncements into Concerts, which may then refer to (multiple) ConcertAnnouncement(s).
Simultaneously, other information is also subjected to automatic best efforts to maintain data quality, i.e. Venue, Organisation and Location information. However, despite these automatic data quality rules, some manual quality assurance is needed.
The speed and quantity with which the information can be aggregated prohibits us from maintaining the data in spreadsheets. Therefore, we are developing a Data Management Tool for ConcertAnnouncements.
"hlwtadmin" is a Python on Django web application that handles
- the automatic extraction of concert announcements (via APIs and screenscraping) from gigfinder websites,
- a rule engine to improve data quality semi-automatically, and
- a dashboard and tooling for manual Quality Assurance.
Future developments should envisage
- decoupling the rule engine from the interface
- LOD endpoint
- JSON-LD schema in html pages
- REST-API
The model consists of two levels that are intertwined. At the bottom, we have a "low leve" data model that is as close as possible to the concertannouncement data that we can capture from the data providing websites such as Songkick or Bandsintown. At the top, we have a "high level" model that aims to model clean data.
To describe the raw data as it comes in from the gigfinder websites, we only need a very simple model, with at its core, these three entities:
- ConcertAnnouncement: a simplified schema for capturing the temporal information of a concert, and link it to an artist and a venue
- Artist: basically a copy of a MusicBrainz Artist ID
- Venue: a string representation of the raw venue information
Hereby, a ConcertAnnouncement is related to one Artist and one Venue, through a ForeignKey Relation.
To keep track of the source of the low-level ConcertAnnouncement, we need some additional models. The philosophy behind the higher level data model is inspired by Musicbrainz. We work with a number of core entities, and there are relations possible between all core entities.
The core entities are
- Artist: which is the same as the Artist model for the low-level data
- Concert: a thin model for holding temporal information about a concert, and which serves as a crossroads for the relations (see below)
- Organisation: a model to hold information about venues, festivals, arenas, etc. (these types of organisations are available as Organisation Types)
- Location: the geographical information. For reasons of normalisation, we also employ a Country model.
The core entities are among each other related via seperate tables:
- RelationArtistArtist: to express artist to artist relations, e.g. type "also performs as"
- RelationConcertArtist: to express concert to artist relations e.g. type "main artist"
- RelationConcertOrganisation: to express concert to organisation relations, e.g. type "was held at"
- RelationConcertConcert: to express concert to concert relations, e.g. type "support of"
Relations between Concerts and Artists/Organisations contain a field "credited as". This is useful for expressing that an Artist with an "official name" X performs at a certain concert as Y, e.g. "Hi Hawaii" performed a concert as "Geroezemoes".
These two models do not live independently from each other. Several links exist, foremost:
- Artist: this model is shared between the two levels.
Then there are a number of ForeignKey relations:
- ConcertAnnouncement > Concert: each concertannouncement should be related to exactly one Concert through a foreign key.
- Venue > Organisation: a venue as reported on the gigfinder website should resolve to exactly one Organisation through a foreign key.
- Venue > Location: a venue as reported on the gigfinder website should also be related to exactly on Location through a Foreign Key.
- Organisation > Location: an organisation is located at exactly one Location. The shared relation between an Organisation and a Venue with a Location is helpful for the deduplication automatization, explained elsewhere.
Data model
- Concert
- Artist
- ConcertAnnouncement
- Country
- Genre
- GigFinder
- GigFinderURL
- Location
- Organisation
- Organisation types model
- Venue
Relations
- Relation Concert - Artist
- Relation Concert - Organisation
- Relation Concert - Concert
- Relation Artist - Artist
- Relation Organisation - Organisation
- Relation Organisation - External Identifier
Semantics
- Organisation types semantics
- Concert - Organisation semantics
- Concert - Artist semantics
- Artist - Artist semantics
- Concert - Concert semantics
- Organisation - Organisation semantics
Merge functionalities
Automation
- Venue to Organisation
- Venue to Location
- Concert Announcement to Concert
- Delete concerts and ignore ConcertAnnouncements
- Delete concerts and delete ConcertAnnouncements
QA Lists and procedures
- Concerts without Organisations
- Concerts without Artists
- Organisations without Concerts
- Concerts with multiple organisations in different locations
- Organisations without latitude and longitude
- Organisations without genre
- Organisations without disambiguation
- Concerts without latitude and longitude
- Concerts without genre
- Concerts without titles
- Artists without genre
- Announcements wo/ concert
- Concerts wo/ announcements
- Venues wo/ organisation
- Organisations wo/ venues
- Organisations without locations
- Organisations without concerts
- Cities without countries
Batch operations
Development