Skip to content

A telegram channel parser + binary text classifier utilizing a simple logistic regression model

License

Notifications You must be signed in to change notification settings

samalyarov/telegram_monitor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Telegram monitoring script

A script is based on telethon library and utilizes a personal Telegram account for acquiring messages from open channels or channel to which the user has access. Data can be collected from multiple channels (from a list in separate files). That particular version of the scripts then utilizes text pre-processing and a trained logistic regression model to determine if the message is "of interest" and if so - forwards it to a specified channel. Whether the message is "of interest" is arbitrary and based purely on the samples and interests of one using the script.

In this particular example, the script is used by a concierge company to ease the monitoring of many chats in which they search for potential clients. This helped save a lot of working time (estimated up to 70-80% workload reduction on that particular task) and expand the monitoring range - the script allowed to consistently monitor many more channels and groups, thus substantially increasing the company coverage without incurring any additional costs (a single worker could monitor several times more channels).

Files included in repository:

  • t_channels.xlsx is a list of channels (with names and links) from which the posts are to be loaded. The list is done in a .xlsx file for ease of communicating with colleagues (especially those with no coding experience) and leaving comments and remarks.
  • config.ini is a config file listing username, password, phone number of a user as well as bot API settings for connection.
  • TG_Monitor.py is a script itself
  • Posts_prediction.ipynb is an Jupyter Notebook with preliminary data analysis and model training, as well as prospects for future development and overall analysis of the model.
  • In order to be operational the folder should also include 3 .pkl files - a trained logistic regression model, a trained vectorizer model and vocabulary for said vectorizer model. All three can be created through the use of the "Posts_prediction" notebook.

The folder is designed to later on be compressed into a single .exe file (via the use of pyinstaller) for easy use.

Libraries used: telethon, configparser, datetime, numpy, pandas, pytz, matplotlib, seaborn, pickle, re, pymorphy2, nltk, scikit-learn(sklearn), eli5, tqdm (tqdm_notebook)