This repository contains my code to predict quality class of Wikipedia articles.
You should find the code in R file in analysis directory.
The data is stored in all_data.tsv file.
The data set contains information of ~ 20 000 Wikipedia articles, collected through Wikipedia projects.
You should have R installed. I suggest that you should also use RStudio as the IDE, but it is optional.
Please note that the code is tested with R 3.2.3
These following packages are required:
- caTools
- rpart
- class
- h2o
First, you should load the code
setwd ("path to AnalyzeData.R file")
source ("AnalyzeData.R")
Then you can run the following analysis.
The linear regression is done by calling the function runRegression.
The CART model is done by calling the function runCART.
The function for kNN model is runKNNModel.
The predictor using multinominal logistic regression could be called with the function runMultinominalLogisticRegression
The function requires packages caret and nnet.
Packages required: caret and e1071
Function name: runSVM
We provided two functions for randomForest model.
The first function is runRFModel
, which will load and run the data with readability scores using k-fold (with k = 5)
The second function is runRFModel_withoutReadabilityScore
, which will run without using readability scores, as in [1].
We applied 5-folds cross validation.
You should observe that the first function provide a better prediction.
We provided some other utility functions such as calculate RMSE or NDCG.
[1] Warncke-Wang, M., Ayukaev, V.R., Hecht, B. and Terveen, L.G., 2015, February. The Success and Failure of Quality Improvement Projects in Peer Production Communities. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing (pp. 743-756). ACM.