Webshell作为黑客惯用的入侵工具,是以php、asp、jsp、perl、cgi、py等网页文件形式存在的一种命令执行环境。黑客在入侵一个网站服务器后,通常会将webshell后门文件与网站服务器WEB目录下正常网页文件混在一起,通过Web访问webshell后门进行文件上传下载、访问数据库、系统命令调用等各种高危操作,达到非法控制网站服务器的目的,具备威胁程度高,隐蔽性极强等特点。
本仓库尝试通过一个 TextCNN + 二分类网络合成的综合深度神经网络实现对于 Webshell 的静态检测。TextCNN 用于处理向量化后的词数组,二分类网络用于处理手动提取的数字化特征(文件的大小以及熵值等等)。
原始数据集采集自 Github,下面是详细的仓库列表.
- tennc/webshell
- JohnTroony/php-webshells
- xl7dev/webshell
- tutorial0/webshell
- bartblaze/PHP-backdoors
- BlackArch/webshells
- nikicat/web-malware-collection
- fuzzdb-project/fuzzdb
- lcatro/PHP-webshell-Bypass-WAF
- linuxsec/indoxploit-shell
- b374k/b374k
- LuciferoO/webshell-collector
- tanjiti/webshell-Sample
- JoyChou93/webshell
- webshellpub/awsome-webshell
- xypiie/webshell
- leett1/Programe/
- lhlsec/webshell
- feihong-cs/JspMaster-Deprecated
- threedr3am/JSP-Webshells
- oneoneplus/webshell
- fr4nk404/Webshell-Collections
- mattiasgeniar/php-exploit-scripts
- WordPress/WordPress
- yiisoft/yii2
- johnshen/PHPcms
- https://www.kashipara.com
- joomla/joomla-cms
- laravel/laravel
- learnstartup/4tweb
- phpmyadmin/phpmyadmin
- rainrocka/xinhu
- octobercms/october
- alkacon/opencms-core
- craftcms/cms
- croogo/croogo
- doorgets/CMS
- smarty-php/smarty
- source-trace/phpcms
- symfony/symfony
- typecho/typecho
- leett1/Programe/
- rpeterclark/aspunit
- dluxem/LiberumASP
- aspLite/aspLite
- coldstone/easyasp
- amasad/sane
- sextondb/ClassicASPUnit
- ASP-Ajaxed/asp-ajaxed
- https://www.codewithc.com
首先将仓库下载到本地,验证环境中 Python 版本为 3.10.13。
pip install -r requirements.txt
验证环境中包版本:
Package Name | Version |
---|---|
tensorflow | 2.15.0.post1 |
pandas | 2.2.0 |
nltk | 3.8.1 |
scikit-learn | 1.4.1.post1 |
joblib | 1.3.2 |
不修改原始参数的情况下,直接运行python train.py
。参数介绍如下:
Args Name | Default Value | Description |
---|---|---|
--config.version | v1 | Version |
--data.webshell_folder | Dataset/webshell | Path to the folder which contains webshells |
--data.normal_folder | Dataset/normal | Path to the folder which containd normal files |
--data.file_extensions | ['.php', '.asp', '.aspx', '.jsp', '.java'] | File extension list for training |
--train.max_features | 5000 | Max tokens for TextVectorizer |
--train.sequence_length | 1024 | Output sequence length for TextVectorizer |
--train.embedding_dim | 300 | Ouput dimensions of the embedding layer |
--train.num_epochs | 5 | Number of training epochs |
--train.batch_size | 32 | Training batch size |
--train.validation_split | 0.2 | Proportion of validation split in train dataset |
不修改原始参数的情况下,直接运行python predict.py
。参数介绍如下:
Args Name | Default Value | Description |
---|---|---|
--config.version | v1 | Version |
--config.folder | Output/ | Folder which contains all config and weight of model/scaler/TextVectorizer |
--data.unknown_folder | Dataset/predict | Folder to be detected |
--data.file_extensions | ['.php', '.asp', '.aspx', '.jsp', '.java'] | File extension list for training |
--predict.max_features | 5000 | Max tokens for TextVectorizer |
--predict.sequence_length | 1024 | Output sequence length for TextVectorizer |
--predict.embedding_dim | 300 | Ouput dimensions of the embedding layer |
config.version
/data.file_extensions
/train.max_features
/train.sequence_length
/train.embedding_dim
参数需要和训练时设置的参数一致。
结果评估如下。