3rd Place Solution - Kaggle-OTTO-Comp - Chris' Part

In Kaggle's OTTO competition, we need to build a model that predicts what a user will click, cart, and order in the future at online ecommerce website https://otto.de We are given 4.5 weeks of data and must predict the next 0.5 week. More details are explained at Kaggle here and final leaderboard is here. Our team of Benny, Chris, Giba, and Theo won 3rd place cash gold! Our detailed writeup is published here(Chris), here(Theo), and here(Benny) with final submission Kaggle submit notebook here

Code for 15th Place Gold Single Model

The code in this GitHub repo, will create Chris' 15th place solo gold single model which achieves CV 0.601 and LB 0.601. When we ensemble the single models of Benny, Chris, Giba, Theo, we achieve 3rd place cash gold with Private LB 0.6038 and Public LB 0.6044. Final submission Kaggle submit notebook is published here

This model is a "candidate rerank" model explained here, here, and here. Our challenge is to predict 20 items for each target per user (the 3 targets are click, cart, order) that we suspect user will engage in the future. First we generate 100 item candidates (per target per user) using co-visitiation matrices (and heuristic rules). Next we merge features onto the user item pairs. Lastly we train a GBT reranker model to select 20 from 100.

The following image will help understand the organization of code in this repo. First we train our model on the first 3.5 weeks of data. Then we infer our model on 4.5 weeks of data. Therefore we will basically run the same pipeline twice:

How To Run Code

This code ran successfully on 20xCPU 256GB and 1xGPU 32GB. Using less memory may cause memory errors. To run this code, first install libraries RAPIDS (cuDF cuML), XGBoost, and Pandarallel in addition to basic Python libraries Pandas, NumPy, Pickle, Scikit-Learn, Matplotlib, and Tqdm. The script to compute item embeddings requires PyTorch and Merlin-Dataloader. Next follow these 3 main steps with substeps:

(1) Download Data from Kaggle
=> Run /data/make_train_valid.ipynb
(2) Train Models
=> compute co-visit matrices by running /train/covisit_matrices/script.ipynb
=> generate candidates and scores with /train/candidates/script.ipynb
=> engineer features with /train/item_user_features/script.ipynb
=> merge candidates and features for click model with /train/make_parquets/script-1.ipynb
=> train click model with /train/ranker_model/XGB-186-CLICKS.ipynb
=> merge candidates and features for cart and order model with /train/make_parquets/script-2.ipynb
=> train cart model with /train/ranker_model/XGB-406-CARTS.ipynb
=> train order model with /train/ranker_model/XGB-412-ORDERS.ipynb
(3) Infer Models
=> compute LB co-visit matrices by running /infer/covisit_matrices_LB/script.ipynb
=> generate LB candidates and scores with /infer/candidates_LB/script.ipynb
=> engineer LB features with /infer/item_user_features_LB/script.ipynb
=> merge LB candidates and features for click model with /infer/make_parquets_LB/script.ipynb
=> infer models with /infer/inference_LB/script.ipynb

After running the steps above, the file /data/submission_final/submission_chris_v186v406v412.csv is generated. This file will score Private LB 0.6012 and Public LB 0.6010. To achieve a better CV and LB, we can train CatBoost with the code /train/ranker_model/CAT-200-orders.ipynb and /train/ranker_model/CAT-203-carts.ipynb and change inference to infer CatBoost. The result is Private LB 0.6018 and Public LB 0.6016. We discovered that CatBoost was better after the competition ended.

├── train
│   ├── covisit_matrices         # Compute matrices with RAPIDS cuDF
│   ├── candidates               # Generate candidates from matrices
│   ├── item_user_features       # Feature engineering with RAPIDS cuDF
│   ├── make_parquets            # Combine candidates, features, targets
│   └── ranker_models            # Train XGB model
├── infer        
│   ├── covisit_matrices_LB      # Compute matrices with RAPIDS cuDF
│   ├── candidates_LB            # Generate candidates from matrices
│   ├── item_user_features_LB    # Feature engineering with RAPIDS cuDF
│   ├── make_parquets_LB         # Combine candidates, features, targets
│   └── inference_LB             # Infer XGB model with RAPIDS FIL
├── data    
│   ├── make_train_valid.ipynb   # Run to download data
│   ├── train_data               # Train data downloaded to here
│   ├── infer_data               # Infer data downloaded to here
│   ├── covisit_matrices         # Matrices stored here
│   ├── candidate_scores         # Candidate lists and scores here
│   ├── item_user_features       # Item and user features here
│   ├── train_with_features      # Train data with features merged
│   ├── infer_with_features      # Infer data with features merged
│   ├── models                   # Trained models here
│   ├── submission_parts         # Partial submission.csv here
│   └── submission_final         # Final submission.csv here
└── README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

3rd Place Solution - Kaggle-OTTO-Comp - Chris' Part

Code for 15th Place Gold Single Model

How To Run Code

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
data		data
infer		infer
train		train
LICENSE		LICENSE
README.md		README.md

License

cdeotte/Kaggle-OTTO-Comp

Folders and files

Latest commit

History

Repository files navigation

3rd Place Solution - Kaggle-OTTO-Comp - Chris' Part

Code for 15th Place Gold Single Model

How To Run Code

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages