Skip to content

Installka/yandex-practicum-data-scientist-projects

Repository files navigation

yandex-practicum-data-scientist-projects

Projects from Yandex.Practicum Data Scientist course.

Unfortunately, I can’t post datasets from most of the projects here because of limitations in the Yandex.Practicum terms of use (clause 4.1). But in some cases datasets were pulled from open sources (project 6), so you can download data and notebook and fully run it.

Description of projects

Music of big cities

  • Client - Yandex.Music;
  • Input data - music listening data for Moscow and St. Petersburg;
  • Target - researching user preferences.

Tools used - pandas.

Borrowers reliability research

  • Client - bank credit department;
  • Input data - customers solvency statistics;
  • Target - determine whether the marital status and number of children of the customer affects the fact of repayment of the loan on time.

Tools used - pandas, pymystem3.

Analysis of apartment advertisements

  • Input data - Yandex.Nedvizhimost service data: archive of ads for the sale of apartments in St. Petersburg and neighboring settlements for several years;
  • Target - determine the parameters for determining the market value of real estate.

Tools used - pandas, pymystem3.

Definition of a prospective tariff for a telecom company

  • Client - federal mobile operator;
  • Input data - data from 500 users: who they are, where they come from, what tariff they use, how many calls and messages each one sent in 2018;
  • Target - analyze customer behavior and conclude - which tariff is better.

Tools used - pandas, NumPy, Matplotlib, SciPy.

Computer games market research

  • Client - computer games online store;
  • Input data - historical game sales data, user and expert ratings, genres and platforms (e.g. Xbox or PlayStation);
  • Target - identify patterns governing the success of the game.

Tools used - pandas, NumPy, Matplotlib, seaborn, SciPy.

Tariff recommendation

  • Client - federal mobile operator.
  • Input data - data on the behavior of customers who have already switched to these tariffs (from the project 3).
  • Target - build a model for the classification problem, which will choose the appropriate tariff.

Tools used - pandas, scikit-learn.

Models used - Logistic Regression, Decision Tree Classifier, Random Forest Classifier.

Customer churn

  • Client - bank;
  • Input data - historical data on customer behavior and termination of contracts with the bank (this data pulled from Kaggle, so you can download it from Kaggle or from repository and fully run this project's notebook);
  • Target - predict whether the client will leave the bank in the near future or not.

Tools used - pandas, NumPy, Matplotlib, scikit-learn, plotly, seaborn.

Models used - Logistic Regression, Decision Tree Classifier, Random Forest Classifier.

Determining the place for a new oil well

  • Client - oil company;
  • Input data - oil samples in 3 regions;
  • Target - build a ML model to help determine the region where mining will generate the most profit. Analyze the potential rewards and risks with the Bootstrap technique.

Tools used - pandas, NumPy, Matplotlib, scikit-learn.

Models used - Linear Regression.

Gold ore recovery rate prediction

  • Client - Zyfra;
  • Input data - data with parameters of mining and purification of gold ore;
  • Target - prediction of the recovery rate of gold from gold ore.

Tools used - pandas, NumPy, Matplotlib, SciPy, scikit-learn.

Models used - Linear Regression, Decision Tree Regressor, Random Forest Regressor.

Customer data protection

  • Client - insurance company;
  • Input data - data with info about customer;
  • Target - develop a method for transforming data so that it is difficult to recover personal information from them and the quality of machine learning models does not deteriorate.

Tools used - pandas, NumPy, scikit-learn.

Models used - Linear Regression.

Toxic comments detection

  • Client - online store;
  • Input data - cased comments;
  • Target - build a model for classifying comments into positive and negative.

Tools used - pandas, NumPy, Matplotlib, re, NLTK, SciPy, PyTorch, transfromers, LightGBM, scikit-learn.

Models used - LightGBM Classifier, BERT, Logistic Regression, Decision Tree Classifier, Random Forest Classifier.

Determining age from a photograph

  • Client - supermarket network;
  • Input data - set of photographs of people with age indication;
  • Target - model that will determine the approximate age of a person from a photograph.

Tools used - pandas, NumPy, Matplotlib, PIL, Keras.

Models used - ResNet50.

Telecom operator customer churn prediction

  • Client - telecom operator.
  • Input data - personal data of clients, information about their tariffs and contracts.
  • Target - learn to predict customer churn.

Tools used - pandas, NumPy, Matplotlib, scikit-learn, CatBoost.

Models used - Logistic Regression, Decision Tree Classifier, Random Forest Classifier, CatBoost Classifier, VotingClassifier.