Abstract - Youtube, the world-famous content sharing platform maintains a list of the most trending videos which keeps updating everyday. Many videos of many different categories are published every minute. It maintains a track of user interactions(n umber of views, shares, comments and likes) with every video. We utilized that available data for this project, performing a detailed analysis on the trending videos and the tagged user data.
Keywords—Youtube, Ensemble learning, Data Cleaning
YouTube is the most popular and most used video platform in the world today. It has a list of trending videos that is updated constantly. We use Python and its data & numeric computational libraries such as Pandas, nump, Machine Learning libraries sklearn,xgb-boost and also Matplotlib, Seaborn, json to analyze the dataset .The dataset is a collection of around 2 lakh trending videos, likes, dislikes, categories for 6 different regions(Canada, Denmark, France, India, United Kingdom, United States of America).The dataset that we will use is taken from Kaggle. It contains data about trending videos for various countries. We will analyze this data to get insights of YouTube trending videos, to see what is common between these trends. Machine learning methods like Ensemble learning have been used to predict the essential missing data.Those insights might also be used by people who want to increase the popularity of their videos on YouTube or just by the people who want to surf through stats of their favorite creator. The main focus of this data analysis is to help the content creators by giving them detailed analysis and also help viewers to find interesting stats:-
How many views do our trending videos have? Do most of them have a large number of views? Is having a large number of views required for a video to become trending?
-
What factors contribute to make a particular video a trending one?
-
Which category attracts most of the viewers?
-
When were trending videos published? On which days of the week? at which times of the day?
-
Predicting category based on title of the video.
-
Categorising YouTube videos based on their comments and statistics.
-
Finding the most popular, most likes and disliked videos on youtube
-
Analysing what factors affect how popular a YouTube video will be.
-
Statistical analysis over time.
-
Predicting categories of videos published in future.
- Paper : CHENG Xu et.al, “Understanding the YouTube and their data”, [IEEE,2014]
-
Main Claims: This paper as the title suggests, helped us to get the gist of the YouTube data. It provides insights, many of them on various things such as, how social aspects influence the published videos and also talks about how YouTube videos have noticeably different statistics compared to traditional streaming videos, ranging from length, access pattern, to their active life span.
-
Takeaway: The main takeaway from this paper is how to work on YouTube data and how to search for various aspects in the data to provide a good analysis of the data
- Paper: T he Philosophy of Exploratory Data Analysis, 1987
-
Main Claims: T his paper attempts to define Exploratory Data Analysis (EDA) more precisely than usual, and to produce the beginnings of a philosophy of this topical and somewhat novel branch of statistics
-
Takeaway : Though this paper dates back to the 1980's it still helped us to grasp the basics of EDA and how to perform it in the intended way
- Paper: B huiyan, Hanif & Ara, Jinat & Bardhan, Rajon & Islam, Dr. MD Rashedul. (2018). Retrieving YouTube Video by Sentiment Analysis on User Comment
- Main Claims: This paper is related to the technique we are viewing to use for analysis on the comments of the published videos. This presents a Natural Language Processing (NLP) based sentiment analysis approach on user comments.
- Takeaway: T his paper helps to find out the most relevant and popular video of YouTube according to the search.
- Paper: Thomas G. Dietterich, Ensemble Methods in Machine Learning.
- Main Claims: This paper introduces and explains ensemble learning methods and how they highly help in implementing models with better performa nce.
- Takeaway: This papers helps us to understand ho w ensemble methods function.
-
Month has no effect on the video that is being published, it varies from country to country and in India, most number of videos are posted in the month of June.
-
There isn’t any effect of weekends on the number of videos published, but it has been a common trend that the highest number of videos have been published on Friday in any of the six countries.
-
Time of posting in a day seems to be very important for publishing. From our analysis it has been observed that most number of videos are being posted between 13:00 to 16:00 hrs for all the countries.
-
Here, India had the most viewership around 13:00 and all the other countries had the most viewership from around 4:00 to 5:00 hrs of the day. The possible reason might be the time zone differences. As all the other countries are westrern parts, they have a common time zone, thus leading to a common result.
-
We came to a conclusion that the videos which have more views are also observed to have more likes and dislikes.They are positively correlated. The same thing is valid even for videos of a particular category in all the countries.
-
The top 5 channels in any category also have the top 5 viewership, likes and dislikes in most of the cases.
-
‘Entertainment’ has been the most famous category in all the 6 countries.
-
Different countries have a different trend in how long a video trends. The United Kingdom has a longest trend duration of 35 days.
-
Friday has also the highest viewership among weekdays.
-
“Late night show with Stephen Colbert” is the most famous channel in the world
-
“Nicky Jam x J. Balvin - X (EQUIS) | Official video | Prod. Afro Bros & Jeon” is the most viewed video
-
97.82% of videos have comments enabled.
-
Just 0.05% of videos have been removed
-
There has been a sudden and huge spike in views, likes and comment_count between May and June of 2018.
[2] The Philosophy of Exploratory Data Analysis, 1987
[3] Bhuiyan, Hanif & Ara, Jinat & Bardhan, Rajon & Islam, Dr. MD Rashedul. (2018)
[4] Visualizing data using Matplotlib and Seaborn libraries in Python for data science.
[5] Thomas G. Dietterich, Ensemble Methods in Machine Learning
[6] Top 50 matplotlib Visualizations – The Master Plots
[7] Simple guide for ensemble learning methods