Android Program Malware Prediction
Abstract:
This project uses 8 machine learning models with parameter tuning to determine the most suitable model for classifying an android app as malicious or not based on the settings it has enabled. Conclusions were based on accuracy of models at their ideal parameters and all were written in Python using Scikit.
Introduction:
The goal of this project is to train a model to determine if an installed android program is malware or not based on the permissions it has enabled. These permissions range showing notifications to access payment information. People could use this for flagging applications as suspicious based on the permissions requested, can help determine which permissions should have more careful consideration and protection of (notify the user of the specific permissions instead of sending a ToS that nobody will read), and apps that are not worth the risk for the permissions provided if for example there is a 1 off app that you won’t use again that has some pretty invasive permissions.
Data Description:
The data is the NATICUSdroid (Android Permissions) Dataset which is a collection of about 30,000 samples of about 70 features where the samples are android apps and the features are the settings they have enabled. All of this data is binary with 0 being off and 1 being on, there are no null values or missing values.
Methodology:
The idea for determining if an application is a malware or not is going to be making comparisons between the permissions that are enabled for each app. Using machine learning models in Scikit we can predict if applications are malicious or not and using a split training and test set we can determine the accuracy of the models. Models with adjustable parameters will be run with a large set of parameters and we will take the best approximated parameter combinations to get each model as accurate as possible. With each model in its ideal state we can determine which does the best overall performance with this dataset.
Data Preprocessing:
The data only has binary values which helps us in already having normalized values for better prediction. We placed missing values in various permissions so that we can replace them via preprocessing (not something to do realistically just to show we are capable of it in the project). We can easily fill in the value that has occurred the most for that feature to keep it closer to the previous data. Some permissions are highly correlated with each other so removing those would help us avoid overfitting our model. Since the data is just 0s and 1s, there is no point in adding non-linear permissions as they would result in the same values as previous permissions. No scaling or normalization is required for such data. One instance where two of the permissions are highly correlated with each other are android.permission.SET_WALLPAPER_HINTS and android.permission.SET_WALLPAPER. Since we have a large number of instances, 29333 to be exact, we would remove some of them and keep about 5000~10000 instances to improve the speed of our models. We can remove one of them to avoid overfitting our model and get better predictions. We also used a PCA transform on data. After testing multiple component values across multiple models, we determined that condensing the number of components to 50 produced the best results, resulting in an average of 2% boost across all of our model results.
Data Analysis:
Using a subset of the models created below we were able to use scikit’s inbuilt feature analysis to see what features contribute the most to being malware. What's interesting is that each of the models that were tested, a permission labeled as android.permission.READ_APP_BADGE was listed on each model is being very likely to be permission and contained by malware. Looking it up, this isn't a permission that is anywhere in the Android codebase, so having this permission doesn't actually give the malware any additional permissions to the base operating system. One permission that was controversial between the models was android.permission.FLASHLIGHT, which is a permission for controlling the flashlight. Malware apps in the past historically have tried to disguise as some sort of utility app. However, we didn't believe it would be that controversial on whether or not flashlight being accessed constituted malware. The last interior hosting permission that was labeled as being for potential malware was the permission to control alarms: com.android.alarm.permission.SET_ALARM. I guess it would make sense as Android already has an inbuilt app for manipulating alarms, but we were still surprised to see this permission listed as being a red flag for malware. Evaluation Metrics & Feature Engineering Techniques: Check all of the models from the list and determine which gives the best results from our datasets. We will use the model that gets the best results without overfitting, taking into consideration Accuracy, Precision, recall, specificity, and others that are suited for specific models. We will compare linear and non-linear models and compare the results to find which performs the best. We have made standard functions for our models to share to reduce the work of implementing new models. Because they are all classifiers, we simply made universal “Evaluation” and “Metrics” functions so we can just call it at the end of every model function except the nonlinear models which needed their own evaluate function. We are parameter tuning the models manually with for loops to find the best parameters and compare each model at their best for our final results.
KNN:
KNN at neighbors = 7 Precision Recall Sensitivity Accuracy 0.81 0.383 0.383 0.882
For KNN we tuned the parameter “neighbors” to find our best results. For KNN you can use massive amounts of neighbors and you won’t overfit (overfitting happens with fewer neighbors because neighbors smooth out the data and reduce the weight of outliers). While this shows that 30-40 neighbors can be ideal, we are going to use 7 so that we don’t have to sit through massive processing times and because it has a high test accuracy overall. We plan to make a similar report for every model as we test all of their parameters.
LDA:
LDA with ‘svd’ solver Precision Recall Sensitivity Accuracy 0.802 0.581 0.581 0.906
For the LDA, as almost all the perimeters were not tunable, the only perimeters that we changed between tests was the solver. Do this, we don't get a very complete image about how the training of the algorithm went. However, the training and test accuracy are extremely close together, despite how the graph makes it look. It's within 0.4% of each other which is quite good for a model. And due to the SVD solver producing slightly better results, that's this algorithm we went with.
Decision Tree:
Decision Tree at max depth of 7 Precision Recall Sensitivity Accuracy 0.707 0.592 0.593 0.891
A decision tree is a machine learning model that makes a tree of greater than or less than values for certain features and classifies based on that. Depth is the parameter that is measured and it is the number of features that are compared, at each depth a feature is compared and the results of the comparison leads to the leaf nodes which will have a ratio of samples that are of each class, this ratio is what is used to determine how samples should be classified when they get to that spot in the tree. After 7 layers the tree begins to overfit very badly, this is because when the number of layers = the number of features the leaf notes in the test set will all have 1 sample and that will be its exact parameters so training accuracy will be 100%. We decided on using 7 samples because that is as high as the accuracy gets before training accuracy diverges from test accuracy. Decision tree had mediocre accuracy because its a simple tree and model designed to do fast problems and be computationally cheap.
Random Forest:
Random Forest at estimators = 21 Precision Recall Sensitivity Accuracy R2 score 0.776 0.542 0.542 0.898 0.898
Random Forest is a more complicated decision tree that runs permutations of the features at each layer to generate a bunch of different decision trees. The n_estimators is the number of random decision trees that are generated. Random forest is another model that can not necessarily be overfitted with estimators because they just make the search process more through, with that being said we decided on 21 as the ideal number of estimators because its right about where the average slope of the test accuracy flattens out, after this the computation for adding more estimators is not worth the computation.
Gradient Boosted:
Gradient Boosted at estimators = 7 and learning rate = .99 Precision Recall Sensitivity Accuracy MeanSquared 0.735 0.669 0.669 0.905 0.905
Gradient Boosted is a different tree based model from the previous two because those are inherently parallel models where the information generated in each iteration is not required for later iterations, Gradient Boosted on the other hand is sequential where the results of the previous tree is used in generating the next tree. The estimators is again the number of trees that are generated while the learning rate is how much the previous models correct/incorrect classifications affect the generation of later models. We settled on a learning rate of .99 because this did not really affect the accuracy but learning rate of 1 was a poor performing outlier. Because learning rate is insignificant in the results we chose close to 1 because lower learning rates are more computationally expensive and increase the run time of the program. We also chose estimators of 7 because after 7 train and test acc diverge but right before 7 the test accuracy is higher then the training accuracy which makes the numbers not very confident.
Logistic Regression:
Logistic regression is a statistical method used for binary classification, which means it is used to predict the probability of a binary outcome (either 0 or 1, True or False, Yes or No) based on one or more predictor variables. Since we have binary data in our dataset, we can easily apply this model to predict the target values. Logistic Regression gives us an overview of the target values if they are true positives or false negatives. The confusion matrix above shows the number of predictions made in the respective quadrant. A darker color in the "True Positive" cell indicates that a lower number of true positive predictions were made by the model but on the other hand there is also a greater number of values in the “True Negative” region giving us a low accuracy i.e. less than 90%.
SVM:
SVMs are particularly effective in solving binary and multi-class classification problems. SVM has a hyperparameter called "C" that controls the trade-off between maximizing the margin and minimizing the classification error. A smaller C value encourages a wider margin but may allow for some misclassification of training data, while a larger C value results in a narrower margin but fewer training errors. We ran the cross validation for SVM to get the best parameters and found C=10 and gamma=0.1 to be the best ones. It divides the data into 2 classes and segregates accordingly. We also showed an ROC curve to better display how the SVM performed for the true positive and false positive rates. Finally, we achieved an accuracy of 0.401 for a lower value of C. As we increase the value of C, the model leads towards overfitting. Since we have such a low accuracy it is not a good model to rely on for our data.
Neural Network:
Neural Network is an amazing model for our type of data because it performs really well and gives us the best accuracy of 0.91 i.e. 91%. Since, increasing the number of hidden layers was overfitting our model, we tend to keep it low and still achieve a higher score. We tend to get good results until hidden layers are less than 4 but still achieve a good score at 7 hidden layers. However, this model takes more time to execute than gradient boosted which still makes it reliable but not efficient enough. We got really low mean errors giving us more accurate results. The data starts to overfit after 9 hidden layers and drops the test accuracy drastically. Overall, it is a great model to fit in this kind of data.
Conclusions:
With looking at our models the best performers were the Gradient Boosted and Neural Network, this makes sense as these are fairly powerful non linear models and with the fact that our information is all 0 and 1 it's likely that these models which rely on boolean values were better suited for this type of data then our other models.