diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..278764c --- /dev/null +++ b/.gitignore @@ -0,0 +1,3 @@ +.idea +.idea/* +.venv \ No newline at end of file diff --git a/README.md b/README.md index 0d869eb..12b1d8c 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,55 @@ -# goodgrief -Analysis of IMDb datasets +[🔙 Back to my profile](https://shefaliisharma.github.io/) + + + + + +This project contains the analysis of [IMDb Dataset](https://developer.imdb.com/non-commercial-datasets/) + +## Objectives: +1. How have genres and viewer preferences evolved over the years within the content industry for English Language, and how do these trends differ across regions and languages? + + +## Dataset & Methodology: +The analysis was performed using PostgreSQL queries. The dataset was queried to extract relevant information and answer the research questions. The queries used in the analysis are provided in the results section below. + +The output from SQL queries were loaded into Tableau. My local setup for achieving the above consisted of: + +- PostgreSQL server running on localhost on my Mac OS Sonoma +- Datagrip for querying the database and exploratory data analysis +- Tableau for Visualizations + +## Analysis: + +### Trends within Genres of English Language Content: + +I've crafted a query that dissects the ever-changing trends in genres using IMDb's extensive database. By transforming the genre information from a single string into individual elements, I’ve prepared the data to showcase the number of films and average ratings for each genre per year. + +**The Process:** +Expand Genres: With a CTE, I convert the list of genres for each title into separate rows using PostgreSQL's UNNEST and STRING_TO_ARRAY functions. +Aggregate Insights: Joining the expanded genres with the IMDb dataset, I focus on English language titles and known regions, filtering out any unknowns. +Calculate Metrics: I compute the total number of titles (title_count) and their average IMDb rating (average_rating) for each genre annually. + +```sql +WITH GenreExpansions AS ( + SELECT + imdb_basic.tconst, + UNNEST(STRING_TO_ARRAY(genres, ',')) AS genre_split, + startyear, + averagerating + FROM imdb_basic JOIN imdb_ratings ON imdb_basic.tconst = imdb_ratings.tconst +) +SELECT + genre_split, + startyear, + COUNT(tconst) AS title_count, + AVG(averagerating) AS average_rating +FROM GenreExpansions +JOIN imdb_akas ON imdb_akas.titleid = GenreExpansions.tconst +JOIN imdb_country_codes ON imdb_country_codes.region_code = imdb_akas.region +WHERE language = 'en' AND region_name != 'Unknown' +GROUP BY genre_split, startyear +ORDER BY genre_split, startyear; +``` + +[![Visual01](assets/viz1.png)](https://public.tableau.com/views/IMDbdatasetGenreTimeSeries/Ratingsanalysisovertheyears?:language=en-US&:sid=&:display_count=n&:origin=viz_share_link) \ No newline at end of file diff --git a/_config.yml b/_config.yml new file mode 100644 index 0000000..5c2d24e --- /dev/null +++ b/_config.yml @@ -0,0 +1,4 @@ +remote_theme: pages-themes/slate@v0.2.0 +title: [IMDb Dataset Analysis] +description: [Project to showcase my Data Analysis skills] +show_downloads: "false" \ No newline at end of file diff --git a/assets/viz1.png b/assets/viz1.png new file mode 100644 index 0000000..fb6baf8 Binary files /dev/null and b/assets/viz1.png differ