Hacker news crawler 👾📃

This is a small project dedicated towards scraping information from Hacker News website. The project consists of several classes that work together to perform crawling and parsing of the relevant content from the website:

Crawler 🐍: The crawler is in charge of extracting the information from the HTML of the website. It can get the raw HTML content using requests and BeautifulSoup and the find the elements that are relevant to the specifications.
Retriever 🔍: This class is in charge of retrieving the relevant fields from the parsed elements and organizing it into a dictionary. In this case, we are only interested in retrieving the following fields: title, number in the order, number of comments and points.
Cleaner 🧹: The cleaner is in charge of formatting and preparing the fields, taking care of removing unicode characters and converting numerical fields to int or float.
Sorter 📚: The sorter class takes care of filtering and sorting the lists of dictionaries.
Serializer 💾: Finally, the serializer class is used for saving data to file.

To execute the code, run the following command from the root directory:

python3 news-crawler

The program will automatically create directories for storing data and program logs. Once executed, you can head to the data/ directory and check the generated output files. If you do not wish to execute the code, I have already provided a couple of examples of the produced output within the data/ directory. Feel free to check them out!

Future improvements 🚀

Due to the time limitations, the implementations so far are limited. Future improvements upon this project would include:

Developing automated test for testing the different functionalities using libraries such as unittest or pytest ✅
Implementing classes for modelling entries, which could be useful for storage of content in databases using appropriate frameworks (e.g.: fastAPI) ✅
Storing data into a database (e.g.: PostgresSQL) ✅
Dockerizing API and database ✅

Instructions for running the code 💻

In order to run the dockerized version of the code, you can clone the repository into your computer and run the following command from the terminal:

docker-compose up --build

The above command will build the Docker images for the FastAPI application and create two new containers: hacker-db (which will contain a dockerized PostgreSQL database) and hacker-fastapi (which will contain the dockerized FastAPI application). You can now run the main application using the following command:

python3 newscrawler

Once the application has finished running, you can use a cURL command to retrieve the data that has been added to the database:

curl -X GET "http://localhost:8000/entries/" -H  "accept: application/json"

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
backend		backend
data		data
newscrawler		newscrawler
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
postgres.sh		postgres.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hacker news crawler 👾📃

Future improvements 🚀

Instructions for running the code 💻

About

Releases

Packages

Languages

carlota-moh/hacker-news-crawler

Folders and files

Latest commit

History

Repository files navigation

Hacker news crawler 👾📃

Future improvements 🚀

Instructions for running the code 💻

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages