Skip to content

Automated discovery and classification of websites content through unsupervised learning approach

License

Notifications You must be signed in to change notification settings

Samuele95/WebCat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebCat

webcat

Description

WebCat is a project dedicated to the automated discovery and classification of websites based on content similarities, through an overall unsupervised learning approach using algorithmic models trained as necessary. The activities carried out range from web crawling and web scraping, for the discovery and the acquisition of the textual content of web pages, up to the use of neural networks for the vectorization of this content and the classification of the findings based on clustering algorithms. Specifically, the vectorization activity is carried out by the transformer-based BERT neural network, while the clustering process is the work of a Self-Organizing Map (SOM) as a form of unsupervised learning based on a neural network.

gui

Installation

Launch Docker compose from the same folder containing the compose.yaml file, with the following command.

docker compose up

Docs

Please refer to the wiki for the overall docs and usage instructions.