Skip to content

cteplovs/dataquest

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project
Python Programming
Author Expertise Tool Industry
Darryl Buswell Exploratory Analysis Python Entertainment
Environment
Government Policy and Planning
Information Technology
Securities and Finance
Sports and Recreation
Transportation
Description

Foundations of Python and programming, including modules, enumeration, indexing, scopes, object-oriented programming, lambda functions, and exception handling.

Includes:

  • Employed basic Python syntax on Star Wars script data to determine which character speaks most often.
  • Use of Python to read/ parse a raw dataset, convert data types, apply IF statements, and apply for loops in order to find which US city has the lowest rate of violent crime.
  • Application of Python functions to parse data, apply IF statements, and create a dictionary in order to calculate the frequency of different weather conditions in Los Angeles.
  • Use of Python functions which tokenize string data, check for syntax and index errors, and normalize data dictionaries in order to provide a check for spelling errors within text data.
  • Employed modules and classes in Python to determine the number of wins for an American National Football League (NFL) team using data from 2009 to 2013.
  • Use of the enumerate function, list comprehensions, try/ except blocks, and the None type in Python, while finding the most common names for US Congressman/ Congresswomen.
  • Application of Python functions to create a 'while' loop, use the 'break' keyword, and add named and optional arguments to a function in order to find which US airlines experience the most delays.
  • Use of scopes and debugging in Python while analyzing student loan defaults in the US.
  • Object oriented programming in Python, including writing organized sensible code and implementing comparison operators, to compare the average ages of players on various NBA teams.
  • Example exception handling code in Python applied to recorded chopstick 'food pinching efficiency' data.
  • Advanced string manipulation and anonymous functions in Python in order to assess characteristics of a list of user passwords.
Dataset
  • Star Wars Episode IV script. [link]
  • Number of incidents of 'violent crime' within each US city for 2013. [link]
  • Historic daily weather conditions for Los Angeles. [link]
  • Short story text file with a number of spelling mistakes. [link]
  • National Football League (NFL) win/ loss records for each game from 2009 to 2013
  • Members of the United States Congress (1789-Present) and congressional committees (1973-Present) in YAML. [link]
  • US airline flight delay statistics from the US Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS). [link]
  • Student loan debt data (e.g. number of borrowers and defaulted borrowers) for educational institutions within the US.
  • NBA players data (e.g. player name, position and points per game) from the 2013-2014 season. [link]
  • Recorded 'food pinching efficiency' for 31 male junior college students and 21 primary school pupils who used chopsticks of various lengths.
  • List of 2,151,220 unique ASCII paswords.

Project
Data Analysis with Pandas
Author Expertise Tool Industry
Darryl Buswell Exploratory Analysis Python Food, Beverages and Tobacco
Government Policy and Planning
Transportation
Description

Use of the Pandas dataframe object in Python for more efficient data analysis.

Includes:

  • Use of the Numpy library, matrices, and vectors in Python in order to assess alcohol consumption by country.
  • Application of Python and Pandas to index, retrieve, sort, normalize and run calculations on US Department of Agriculture (USDA) data to discover the most/ least nutritional foods.
  • Application of Python and Pandas to compute summary statistics, create pivot tables, remove missing values, and reindex rows of passenger survival data from the Titanic.
  • Use of Python and Pandas to manipulate dataframes and calculate summary statistics of employment data from the American Community Survey (ACS) for 2010 to 2012.
Dataset
  • Alcohol consumption data (e.g. type of alcohol and amount consumed) for countries from around the world.
  • Food nutrition data from the US Department of Agriculture (USDA) National Nutrient Database for Standard Reference. [link]
  • Passenger data (e.g. age, gender, fare, cabin) who were onboard the Titanic. [link]
  • American Community Survey (ACS) results for 2010 to 2012 from a survey on job outcomes for recent college graduates based on the major they studied in college. [link]

Project
Data Visualization
Author Expertise Tool Industry
Darryl Buswell Exploratory Analysis Python Education
Environment
Health Care
Description

Implementation of various techniques to visualize data using Python with Matplotlib and Seaborn.

Includes:

  • Use of the Matplotlib module to plot charts in Python of forest fire data from Montesinho National Park.
  • Application of Python, Pandas and Matplotlib to effectively utilize visualization to explore employment data from the American Community Survey for 2010 to 2012.
  • Prodution of a visually appealing histogram plot in Python using Seaborn and data from the National Survey of Family Growth, 2002 to 2003.
  • Use of different components of the Matplotlib module to create customizable data visualizations in Python.
Dataset
  • Meteorological and forest burning data for the northeast region of Portugal, including temperature, humidity, wind speed and area of forest burned. [link]
  • American Community Survey (ACS) results for 2010 to 2012 from a survey on job outcomes for recent college graduates based on the major they studied in college. [link]
  • Survey data from the National Survey of Family Growth from January 2002 to March 2003 which contains data on mothers age, pregnancy duration, and birth weight. [link]

Project
Data Cleaning
Author Expertise Tool Industry
Darryl Buswell Exploratory Analysis Python Entertainment
Description

Basics of using Python to clean and manipulate data.

Includes:

  • Use of Python and Pandas code to clean a dataset of Avengers characters deaths with the aim of making the data more useful for analysis.
Dataset
  • Details the deaths of Marvel comic book characters between the time they joined the Avengers and April 30, 2015, the week before Secret Wars #1. [link]

Project
Python for Business Analysts
Author Expertise Tool Industry
Darryl Buswell Exploratory Analysis Python Arts and Culture
Housing
Securities and Finance
Description

Use of Python to clean, visualize, and explore data.

Includes:

  • Python code to analyze, clean, and visualize US housing affordability survey data.
  • Generalized Python code applied to housing affordability survey data in order to automate typical data processing tasks using lists, functions, filters and loops.
  • Plots of time series data in Python using Matplotlib to analyze historical changes to the share price of Microsoft and Apple.
  • Write functions in Python and Pandas (e.g. lstrip(chars) and apply()) to diagnose data quality issues and correct data errors in a Museum of Modern Art dataset.
Dataset
  • Annual housing affordability survey data (e.g. age of household head, housing costs, number of bedrooms and year house was built) collected by the US Department of Housing & Urban Development (HUD). [link]
  • Historical daily share price data for Microsoft, Inc. and Apple Computer, Inc. from the date each company went public.
  • Basic metadata (e.g. title, artist, date made, medium, dimensions, and date acquired) for 120,000 records within the Museum of Modern Art's catalogue of artwork. [link]

Project
APIs and Web Scraping
Author Expertise Tool Industry
Darryl Buswell Data Management CSS
HTML
Python
Information Technology
Sciences
Description

Techniques to acquire and process data from APIs and the web using Python.

Includes:

  • Use of a GET request to retrieve information from the OpenNotify API, and JavaScript Object Notation (JSON) to encode data structures, allowing the longitude and latitude position of the International Space Station to be identified.
  • Use API pagination and POST, PUT, PATCH and DELETE requests in Python to explore the Github repositories and users.
  • Introduction to HTML webscraping in Python, including use of the Beautiful Soup Library and CSS selectors in order to scrape 2014 Super Bowl data.
Dataset
  • International Space Station location data tracked by the OpenNotify API. [link]
  • Github repository data tracked by the Github API. [link]
  • HTML page of 2014 Super Bowl summary data [link]

Project
Probability and Statistics in Pythons
Author Expertise Tool Industry
Darryl Buswell Statistical Inference Python Arts and Culture
Food, Beverages and Tobacco
Securities and Finance
Sports and Recreation
Tourism
Description

Application of basic probability and statistical theory using Python.

Includes:

  • Application of discete, continuous, ordinal and categorical scales; histograms; measures of central tendency; and assessment of skew, kurtosis and modality to explore the characteristics of people who survived the Titanic shipwreck, using Python.
  • Application of a number of statistical measures to NBA player performance data using Python, including computation of median, variance and standard deviation; distribution plots data; and correlation and covariance calculation of variables.
  • Calculation of standard deviation, standard error and linear regression using a wine dataset using Python, with the aim of ascertaining the relationship between quality and density.
  • Use of Python to randomly select a sample from American Community Survey (ACS) data; compute the median and mean of the sample; plot the results in a histogram; and calculate the statistical significance of the relationship between education level and income.
  • Assessment of the probability of specific attributes appearing on national flags, using Python. Specifically, to calculate conjunctive, dependent, disjunctive and disjunctive dependent probabilities.
  • Use of Python to calculate the probability of renting a bike in Washington DC, by applying 'the number of combinations' formula and 'per combination probability formula' for each bike rental combination.
  • Production of distributions of bikesharing data, and computation of the mean, standard deviation, cumulative density function and z-scores of the probability distribution, using Python.
Dataset
  • NBA players data (e.g. player name, position and points per game) from the 2013-2014 season. [link]
  • Dataset of white wine chemical properties and subjective taste quality rankings. [link]
  • American Community Survey (ACS) results collected by the US Census Bureau. [link]
  • Collins Gem Guide to Flags. [link]
  • Hourly and daily count of rental bikes between 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information. [link]

Project
Linear Algebra in Python
Author Expertise Tool Industry
Darryl Buswell Exploratory Analysis Python
Description

Application of the basics of linear algebra using Python.

Includes:

  • Use of Python to represent systems of equations as matricies; solve systems of linear equations using Gauss' Method; convert matricies to Echelon form and reduced Echelon form; and test equations for inconsistency, infinite solutions, homogeneity and singularity.
  • Use of Python to add vectors, multiply vectors by scalars, plot vectors, calculate vector length, create a dot product, multiply a matrix by a vector and multiply matricies, with the aim of predicting how many points NBA players scored in 2013 using how many field goals they attempted.
Dataset
  • NBA players data (e.g. player name, position and points per game) from the 2013-2014 season. [link]

Project
Machine Learning in Python
Author Expertise Tool Industry
Darryl Buswell Machine Learning Python Education
Entertainment
Environment
Securities and Finance
Sports and Recreation
Tourism
Transportation
Description

Basics of machine learning using Python, including application of regression and classification algorithms.

Includes:

  • Applied basic regression analysis in Python to predict next day S&P 500 values. Involved cleaning invalid data; use of linear regression class in the sckit-learn package to predict values; measuring the mean squared error, root mean squared error and mean absolute error of the model; and visualizing the regression model.
  • Use of Python to create a classification model utilizing binary discrimination. The model was used to optimize bank profit from credit card approvals and required the application of techniques such as calculation of the model's predictive power, sensitivity, specificity and fallout, and computation of ROC curves, and precision and recall curves.
  • Created a linear regression model using Python and Sklearn to predict if an applicant will be admitted to a US University. Techniques such as logistic regression, determination of the model's predictive power, computation of the ROC curve and interpretation of results were used to achieve this outcome.
  • Applied one-versus-all multiclass classification techniques using Python and Sklearn to create a logistic regression model and predict the origin of a vehicle. Used techniques such as classification matricies and confusion matrcies; calculated average accuracy, precision and recall; and measured the F-score of the model.
  • Applied linear regression, r-squared and t-statistics using Python and scipy to estimate the leaning rate of the Leaning Tower of Pisa.
  • Applied Scikit-learn tools using Python to run and visualize the results of a robust K-means implementation, and as a result segment NBA players into groups with similar traits.
  • Use of Python and Sklearn to normalize data, fit data to a linear model and apply a gradient descent algorithm in order to predict the accuracy of a golfer's drive using the distance of the drive.
  • Use of Python and Sklearn to apply neural network theory, backpropagation and splitting data techniques to predict the species of iris flowers.
Dataset
  • Daily S&P 500 index price from 2005 to 2015. [link]
  • Sample credit score data for a collection of individuals, including binary data on whether the individual has paid-off their credit in the past and a score of probability of being approved for future credit.
  • Data for 1,000 University applicants, including Graduate Record Exam (GRE) score, Grade Point Average (GPA) and whether the applicant was/was not admitted.
  • Attribute data for 398 automobiles from the StatLib library, including fuel consumption, number of cylinders, displacement, and origin. [link]
  • Yearly data recorded from 1975 to 1987 measuing the lean angle of the Leaning Tower of Pisa.
  • NBA players data (e.g. player name, position and points per game) from the 2013-2014 season. [link]
  • Professional golfers' driving statistics, measuring driving distance and accuracy. [link]
  • Iris flower dataset, including flower sepal length, sepal width, petal length, petal width and species. [link]

Project
Decision Trees
Author Expertise Tool Industry
Darryl Buswell Machine Learning Python Government Policy and Planning
Description

Construct and interpret decision trees using Python.

Includes:

  • Use of data from the 1994 census (e.g. marital status, age, type of work) to predict if an individual earns less or more than 50k per year. To achieve this, Python was used to create a modified version of the ID3 algorithm, develop a full decision tree model, print results and make predictions using the tree.
  • Built on the concepts learnt in 'Introduction to Decision Trees' by creating a full decision tree model in Python, printing the results, and making predictions about income based on demographic characteristics using a modified version of the ID3 algorithm.
  • Use of Scikit-learn to fit a decision tree which uses demographic characteristics to predict if a person's income is less than or greater than 50K. Includes evaluation of error with classification using AUC, and assessment of data for overfitting and underfitting.
  • Use of Python to reduce decision tree overfitting by implementing the random forest algorithm, and as a result, improving the accuracy of the decision tree which aims to determine the likelihood of income being less than or greater than 50K based on demographic characteristics.
Dataset
  • Extracted data (marital status, age, type of work) for citizens from the 1994 US Census. [link]

Project
Data Structures and Algorithms
Author Expertise Tool Industry
Darryl Buswell Exploratory Analysis Python Sports and Recreation
Description

How computers work and how they work with data.

Includes:

  • Exploration of concepts such as binary numbers, unicode, utf-8, hexadecimal and tokenizing statements; including the conversion of characters (e.g. hexadecimal to binary; bites to strings), using Python.
  • Creation of constant time and linear time algorithms, using Python. For example, creation of a linear algorithm which returns the age of an NBA player given their name.
  • Use of Python to perform binary searches on ordered data, enabling a player's age to be found within an NBA dataset.
  • Use of Python to insert values into arrays, implement a 2D-array, write a hash function and select appropriate data structures.
  • Application of concepts of recursion and linked lists using Python to analyze data.
Dataset
  • NBA player performance statistics for the 2013-14 season, including player name, position, games played and total points scored. [link]

Project
Exploring Topics in Data Science
Author Expertise Tool Industry
Darryl Buswell Exploratory Analysis
Machine Learning
Statistical Inference
Python Entertainment
Government Policy and Planning
Health Care
Information Technology
Sports and Recreation
Description

Exploration of data science topics such as Natural Language Processing (NLP) and clustering using Python.

Includes:

  • Use of Python to apply k-means clustering and principle component analysis to group US senators into clusters based on their voting habits.
  • Applied Naïve Bayes theory using Scikit-learn and Python to predict whether a movie review is positive or negative given only the text of the review.
  • Creation of counters that can represent a multiset and probability mass function (e.g. Suite), using Python.
  • Use of Python to plot a Probability Mass Function (PMF) and analyze the differences between PMFs to explore correlations in data from the National Survey of Family Growth.
  • Use of Euclidean Distance, the K-nearest neighbours (KNN) algorithm in Python to analyze NBA player performance data in the 2013-14 season.
  • Application of chi-squared testing and ridge regression using Python to predict the number of upvotes that a news headline will receive.
Dataset
  • Recorded votes from the 114th Senate. [link], [link]
  • Movie-review documents labeled with respect to their overall sentiment polarity (positive or negative). [link]
  • Survey data from the National Survey of Family Growth from January 2002 to March 2003 which contains data on mothers age, pregnancy duration, and birth weight. [link]
  • NBA player performance statistics for the 2013-14 season, including player name, position, games played and total points scored. [link]
  • Submissions to Hacker News from 2006 to 2015, including submission time, url, headline and number of upvotes. [link]

Project
Spark and Map-Reduce
Author Expertise Tool Industry
Darryl Buswell Big Data
Data Management
Python
Spark
Education
Entertainment
Description

Use of Apache Spark via PySpark in Python and the map-reduce technique to clean and analyze large datasets.

Includes:

  • Basic use of PySpark in Python to take advantage of distributed processing. Used of the ReduceByKey() function and filtering to tally the number of Daily Show guests for each year that the show has been running.
  • Use of PySpark in Python to transform Hamlet text into a format that is suitable for data analysis and to explore an RDD before trying to chain another transformation to the RDD.
  • Use of PySpark in Python, including the map-reduce paradigm, transformations and actions, and data cleaning, in order to transform Hamlet text into a dataset that is suitable for analysis.
Dataset
  • List of guest appearances on the Daily Show from 1999 to 2015.
  • The entire text of Shakespeare's Hamlet. [link]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 79.3%
  • Jupyter Notebook 15.6%
  • Python 4.8%
  • SAS 0.3%