This repository contains scripts for exploring, filtering and depersonalising, and analysing tweet datasets recorded during 4 livestreamed concerts by BTS in 2021.
Tweets during all four concerts were collected using the Twitter Streaming API. All public tweets were captured, except in cases where rate limits made it impossible to archive all tweets. Data was collected and stored by the Center for an Informed Public at the University of Washington. Streaming APIs capture tweets at the time of posting according to predefined monitoring criteria made up of user ids, keywords, and hashtags. For each tweet capture, the API logged what and when the status update was posted, along with information on tweets related by retweet, quote tweet, and reply, and account details for the posting user and users of related tweets such as their user id, number of followers, and account language. The Twitter APIs did not capture actions such as favorites or views, but we used the real time logging of tweet statistics in the streaming API to capture the accumulation of some interactions on posts that were propagating through retweets (RT).
For the two Sowoozoo performances, concert tweets were collected in real time using exclusively the concert hashtag. After capturing all public tweets that included the hashtag #SOWOOZOO (ignoring case) in the days around the 2021 Muster, this collection was cut down to an interval of 3.5 hours around each concert, 222677 and 111512 status updates respectively. While not all concert-related tweets used this concert hashtag, this method of sampling captures nearly complete retweet trajectories for content intentionally associated with the performances, giving a detailed view of network activity within an interested segment of Twitter users.
For the Permission to Dance concerts, the concert hashtags were not used to a comparable degree. Instead of tweets collected with the concert hashtag, we sampled Twitter activity from a previously-established stream. In 2020, a project at the Centre for an Informed Public had defined a select population of BTS-oriented Twitter users to monitor continuously for activity within this region of Twitter: a partially-random selection of users who followed some key accounts within the ARMY network, had posting histories focused on Kpop, and matched a minimum rate of posting activity. This group of accounts formed the core of the Stream samples, along with the inconsistently applied concert hashtags and some keywords from other projects. To check for whether the tweets captured on this stream were principally concert related during the performances, we also collected tweets on the same stream on two non-concert days, a week before and after the PTD On Stage performance (Alt1 and Alt2). Differences in tweet rates and rate change patterns on these non-performance intervals suggest the majority of tweets captured during the concert broadcasts were motivated by or related to the live streams.
This repo contains notebooks for processing these tweet datasets:
- Tweet_content_review.ipynb: a notebook demonstrating the code to explore tweet content and user bios using keywords. First this was used on the raw datasets to identify users who had explicitely requested to not have their tweet content used off the platform, who were then filtered out of subsequent analysis. Second is an exploration of covid related terms in tweet text, with records of example mentions as well as sample code.
- Depersonalising_data.ipynb: a notebook to clean the initial tweet datasets collected by API, first filtering for users requesting exclusion (12), then for tweets by or responding to official accounts, then depersonalising the fan tweets to generate the fan_tweet_X_reduced.csv files in the data folder, and last for depersonlising the subsets of tweets used for content analysis.
- Paper_Plots_Anon.ipynb: scripts to generate the plots included in the paper Audience Reconstructed (under revision) from the depersonalised datasets of tweets and other metadata csv files in the data folder.
- twt.py and twt_red.py: function files for processing the tweets. twt.py is structured to run on csv generated directly from databases from the Twitter API. twt_red.py is mostly the same functions adapted to work with the reduced and depersonlised version of the tweet datasets found in the data folder.
The folders contain:
- data/: csv files of the depersonnalised tweet datasets, the event timing list for each the concerts (_setlist.csv), and the depersonnalised subsets of tweets with assigned content codes.
- plots/: figures generated by the Paper_Plots_Anon.ipynb for the paper.
Related supplimentary materials data available: https://doi.org/10.6084/m9.figshare.24260452