wikipedia_analysis

This repository contains my code to predict quality class of Wikipedia articles.

You should find the code in R file in analysis directory.

Data set

The data is stored in all_data.tsv file.

The data set contains information of ~ 20 000 Wikipedia articles, collected through Wikipedia projects.

Running the code

You should have R installed. I suggest that you should also use RStudio as the IDE, but it is optional.

Please note that the code is tested with R 3.2.3

These following packages are required:

caTools
rpart
class
h2o

First, you should load the code

setwd ("path to AnalyzeData.R file")
source ("AnalyzeData.R")

Then you can run the following analysis.

Linear regression

The linear regression is done by calling the function runRegression.

CART

The CART model is done by calling the function runCART.

kNN

The function for kNN model is runKNNModel.

Multinominal logistic regression

The predictor using multinominal logistic regression could be called with the function runMultinominalLogisticRegression

The function requires packages caret and nnet.

SVM

Packages required: caret and e1071

Function name: runSVM

random forest

We provided two functions for randomForest model.

The first function is runRFModel, which will load and run the data with readability scores using k-fold (with k = 5)

The second function is runRFModel_withoutReadabilityScore, which will run without using readability scores, as in [1].

We applied 5-folds cross validation.

You should observe that the first function provide a better prediction.

Utilities

We provided some other utility functions such as calculate RMSE or NDCG.

[1] Warncke-Wang, M., Ayukaev, V.R., Hecht, B. and Terveen, L.G., 2015, February. The Success and Failure of Quality Improvement Projects in Peer Production Communities. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing (pp. 743-756). ACM.

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
analysis		analysis
analysis_2		analysis_2
lang_model		lang_model
vandalism		vandalism
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wikipedia_analysis

Data set

Running the code

Linear regression

CART

kNN

Multinominal logistic regression

SVM

random forest

Utilities

About

Releases

Packages

Languages

License

vinhqdang/wikipedia_analysis

Folders and files

Latest commit

History

Repository files navigation

wikipedia_analysis

Data set

Running the code

Linear regression

CART

kNN

Multinominal logistic regression

SVM

random forest

Utilities

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages