- 1. Problem Statement
- 2. Steps towards building the solution
- 3. Problems/Challenges Faced/ Decisions
- 4. Requirements/ Machine Setup
-
The backend should receive the URL of the webpage being analyzed as a parameter.
-
After processing the results should be returned to the user. The result comprises the following information:
-
What HTML version has the document?
-
What is the page title?
-
How many headings of what level are in the document?
-
How many internal and external links are in the document? Are there any inaccessible links and how many?
-
Did the page contain a login-form?
-
In case the URL given by the user is not reachable an error message should be sent as a response. The message should contain the HTTP status-code and some useful error description.
The backend should cache the scraping results for each URL for 24 hours such that your backend does not have to redo the scraping for any given URL within the next 24 hours.
I had no prior experience of working with Django, so my first step was to quickly skim through some video tutorials on Django. This was followed by going through the Django documentation thouroughly and implementing the steps.
The problem basically asks us:
-
Take input from the user
-
If the url is present in the cache : return the results by directing the user to a new page.
-
else: Hit the url and do the required analysis.
-
Redirect the user to the results page.
Setup the machine to support django and python 3. Turns out, I got stuck in trivial issues with machine setup and it ended up consuming some time.
Creating a model and adding the respective fields. This was followed by creating a super user and running migration scripts. This was followed by creating a form to get user input. This was fairly easy. I found it easy to navigate through the documentation and get the desired result.
Creating the result view.
This involved writing the helper function to perform all the analysis on the web page. I wrote separate functions for each of the fields that needed to be determined.
Setting up the cache.
After the cache was setup, I moved towards writing the unit tests for my web application.
I had primarily worked on Java before. This was the first time, I was working with django and python3.
As a result, it took me some extra time to grasp the fundamentals and get going.
If you have nested tags with a depth of about 480 levels, and you want to convert this tag to string/unicode, you get the RuntimeError maximum recursion depth reached. Every level needs two nested method calls and soon you hit the default of 1000 nested python calls.
As a result, it is not possibe to cache soup objects directly.
I was using a field
soup.title.string : to display the Title of a web page.
I somehow ended up assuming that this field must be a string. However soup.title.string does not return a python string. As a result for some pages, my cache wasn't working.
After investing some time on the above issue, I figured out that the problem is due to the usage of soup objects. I thought of converting the result object to a string. This was definitely a hack around the problem i was facing. However, for serializing and deserializing, I ended up adding a lot of unnecessary code. Seeing my code get messy , I decided to drop this approach and further investigate the problem.
On further investigation, I found out that the problem was with soup.title.string field. I also realized the importance of setting up an IDE and the debugger. Had I used a debugger earlier, I would have solved this issue very quickly.
I finally found a way to get the title string by using soup.title.text. After making this change the code started working perfectly.
I had to choose between two libraries beautiful soup and scrapy Factors I considered :
-
Learning Curve BeautifulSoup is very easy to learn, you can quickly use it to extract the data you want, in most cases, you will also need a downloader to help you get the HTML source. Since Scrapy does no only deal with content extraction but also many other tasks such as downloading HTML, learning curve of Scrapy is much steeper.
-
Extensibility So if the project is small, beautiful soup is preferred. If your project needs more customization such as proxy, data pipeline, then Scrapy becomes an obvious choice.
As a result I decided to use Beautiful Soup for analysing web pages.
During testing I realized that my code did not work for https connections. This was due to SSL certificate errors. I ended up adding some code to ignore to ignore SSL certificate errors and the system started working fine.
Any kind of login form would define a password field. Using this, I was able to detect all the login forms in a website.
This means that we need to acces all the links present in the website. This is an expensive operation for the web application. If the website has a lot of links, the web application becomes very slow. So, the performance of the application is affected when a website with high number of links is analysed. This performance has been enhanced by making the use of caching.
To run the web application, your system must fulfill the following requirements:
- Python 3
$ brew install python3
- Pip 3
Pip3 is automatically installed with Python3
- beautifulsoup4 - 4.6.3
pip install beautifulsoup4==4.6.3
- urllib3 - 1.23
$ pip install urllib3
Steps for setting up a virtual environment to run this code:
-
Installation
To install virtualenv via pip run:
$ pip3 install virtualenv
-
Creation of virtualenv:
A. Create a directory for virtual environment
$ mkdir HelloWorld
B. Connect virtual environment to directory
$ virtualenv -p python3 HelloWorld
C. Move to the deirectory
$ cd HelloWorld
D. Activate the virtualenv:
$ source bin/activate
-
Install django framework After you’ve created and activated a virtual environment, enter the command
$ pip3 install Django
at the shell prompt/terminal.
Running the project:
-
Checkout the project: https://github.com/loney7/demo.git
-
Move to the project directory
$ cd ~/Downloads/demo-master
-
Start django
$ django-admin startproject my_project_0
- Start the server
$ python3 manage.py runserver
This will run the server on your local machine.
- visit http://127.0.0.1:8000/ on your web browser to analyse the Urls.
Cheers!!!
Starting development server at http://127.0.0.1:8000/ Started development server at http://127.0.0.1:8000/ . . .
Stop the server with: control+c


