Skip to content

1101-datascience/finalproject-finalproject_group10

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[Group10] 預測忠誠分數

Groups

  • 劉柏毅, 106305022
  • 林藝潔, 108703005
  • 郭沛澐, 108703006
  • 鄭宛薰, 108703003
  • 江宏繹, 108304016

Goal

The goal of our project is to predict the loyalty of customers towards the payment company Elo, one of the largest payment brands in Brazil. The company has built partnerships with merchants in order to offer promotions or discounts to cardholders, and this data science project will help them to assess their business model and know more about the customers' experience.

Demo

You should provide an example commend to reproduce your result

Rscript modeling.R --train train_0109_tg.csv --test test_0106.csv --report performance.csv --predict predict.csv

on-line visualization

Folder organization and its related information

docs

  • Preprocessing and Modeling could be found in codes.
  • Processed Data is linked below.

data

  • Kaggle: https://www.kaggle.com/c/elo-merchant-category-recommendation
  • Input format: Dataset Training Data 201917 Customers x 84 (83 features + 1 target + 1 exp(target)). Testing Data 123623 x 83 features.
  • There are four datasets. One for customer profile, one for past transaction records, one for new transaction records and one for merchant details.
  • We first clean the merchant details. the main issue here include duplicated merchants with different properties, missing values in some of the merchant features. Next, we process transaction records by grouping up data from the same card_id and examine there mean, variance and more. Lastly, we try to construct some new features that measures the ratio or the difference between transaction behaviors in two time period.
  • Most of the missing values are imputed by either mode, mean or median based on the exploratory data analysis.
  • Since the processed data is too big, we will upload it through cloud: https://drive.google.com/drive/folders/11UEHfoTjIopv38MvSmCNH7UICLZi84tn?usp=sharing

code

  • Which method do you use?
    • random forest
  • What is a null model for comparison?
    • kaggle提供的kaggle值為3.87852
  • How do your perform evaluation? ie. cross-validation, or addtional indepedent data set
    • 82分為train、test
    • 建五個model,去掉最大和最小的預測值,以剩下三個預測數的平均,作為最終的預測值

results

  • Which metric do you use
    • MSE、RMSE、R-square
  • Is your improvement significant?
    • no,R-Squared通常以0.01成長
  • What is the challenge part of your project?
    • 原始資料的數值和類別是匿名的,難以判斷分析
      • sol:將許多特徵的分佈畫出,以及將商品特徵與消費者特徵進行比較。例如:在category_3中,值為A、B、C。根據分析我們發現在target的均值上,A>B>C,因此將特徵轉換為A:2、B:1、C:0
    • 原始資料龐大且分散,需要多層處理。主表上只有3個特徵,最後我們共造出84個特徵
      • sol:先將商品的主表中重複和缺值的商品進行處理,接著分別針對過去和近期資料groupby消費者進行整合,最後將過去和近期資料進行比較
    • 檔案很大,導致部署到ShinyIO、prepocessing或modeling時,會因為out of memory等情況無法正常運作。
      • sol:將 input 檔案處理過後再上傳,只留下處理後的資料,去除原始資料
      • 把function拆開一行一行執行
      • 降低資料量,隨機選取

References

About

finalproject-finalproject_group10 created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5

Languages