I am going to attempt to simply all these concepts with a storyline. A story between two friends Alice and Bob.
Alice worked at a university and Bob worked as an engineer. They often shared their work with each other, including documents and references. However each time one wanted to share an update they had to go through the tedious process of writing the entire document on the email and share it with the other. This process was untidy and tiring. To make it easier, Alice proposed a Hypertext system over the internet, which would allow for Alice to host his content (text) along with links (link different documents together). Alice thus came up with the following building block for the system:
- A server, hosting the files containing the hypertext
- A client, the browser which will provide a way to present the data at the clients (Bob) end
- A protocol, set of rules for the transfer of data between client and server, HTTP
- A structure/ format for hypertext
This made Bob just write one line request to get the content server by Alice's server.
GET /Research_Alice.html
in response to the request, Bob will see the HTML document containing the Hypertext from the proposal, which would look like.
<html>A lot of what we are going understand here</html>
however, the browser would use the HTML tags (<html>) for rendering the hypertext and only show us "A lot of what we are going to understand here"
This was the simple web, which was ultimately called HTTP 0.9, however as they used the system more they realized it's potential, such as
-
providing more control to the client for editing, uploading, deleting content,
-
decision making by the server for different requests by the client,
-
decision making by the client corresponding to the response,
-
serving other document types,
-
authentication and authorization, etc.
Alice with other contributors made the system more extensible and usable. They added more Methods (POST, HEAD, etc.), Headers (Server, User-agent, etc.), compatibility with other documents (Content Type), pipelining (allows you to send the second message before the response for the first one is the server), content negotiation (including encoding, language), SNIs (Host headers), Status codes (200, 404, 302, etc.), Session management (Cookies, Cache, etc.)
The development of the protocol continued and advanced concepts introduced over time. As it had to security also caught up with the advancements in the protocol and some security-specific changes were also introduced (CORS, Origin, tokens, encryption, Protective headers). Encryption with HTTPS (SSL, TLS) specifically was one of the major development in the protocol.
Understanding these concepts is important for any pentesters, I will try to explain some of these and try to link resources for others. So let's continue with our narrative.
Bob liked the research idea and the content of the paper, however, Bob as usual wrote an email to Alice enclosing his suggestions about the research. This prompted Alice to add features to modify the content served. Thus Alice built the POST method.
Now Bob could query and get the research paper with an HTTP GET request and upload his feedback with an HTTP POST request.
Following the back and forth between each other, the document increased in size and introduced a delay in serving. Sometimes, Bob only wanted to check for updates, and loading large documents just to get the same old document wasn't encouraging. To overcome this Alice introduced the concept of the HEAD method, Bob could send a query using this method and check the change in the size of documents from last time and GET the document if wanted to in later requests. Same is done today with Caching (we will come to this later).
Now that we have learned three headers, let's practice:
Open a terminal, try to locate a testing site (http://scanme.nmap.org/ can be used, back when Alice started he only had an IP address for his server)
nc scanme.nmap.org 80/443 (telnet scanme.nmap.org 80/ 443)
#Making a HEAD request
HEAD / HTTP/1.1
#Making a GET request
GET / HTTP/1.1
Similarly, you could make a POST request to a server, the server should be requesting some form of data, which could be sent in the body of the request (as this requires a little more knowledge about HTTP, we will come to this later).
Following our narrative back, Alice didn't care much about the grammatical mistakes in the documents, however, these mistakes irritated Bob, Bob wanting to change these mistakes asked for a feature. Alice introduced the PUT method which would allow Bob to upload a document to the provided URI (path). Knowing Bob and the fact that he would upload the same documents multiple times Alice made PUT an idempotent (several repetitions of request doesn't cause a state change), instead of using POST, which could be an assumed alternative.
With the PUT method allowed both update and creation of files, naturally bob created random files. Alice had to build another method "DELETE" to delete the random files from the server. With DELETE Bob could delete a resource from the server proving the path of the resource.
Having so many methods (starting with just one), made it difficult to track the functionality on a server, thus Alice built an OPTIONS method that indicated the options (GET, PUT, POST) which were available on the server.
Several other methods were subsequently introduced by Alice and the team, namely TRACE, PATCH, CONNECT, etc.
With so many functionalities at hand, and so many which did not reflect the immediate change in the current webpage, it was difficult for Bob and Alice to say whether an action was successfully completed or not. To indicate a success/ failure "status" of an action Alice implemented HTTP Status code. Alice being a forward thinker implemented a class of status codes not just for success and failure, but for other actions as well.
Alice implemented the following class of status codes:
- 1xx informational response – the request was received, continuing process
- 2xx successful – the request was successfully received, understood, and accepted
- 3xx redirection – further action needs to be taken in order to complete the request
- 4xx client error – the request contains bad syntax or cannot be fulfilled
- 5xx server error – the server failed to fulfill an apparently valid request
Let us go through the most common ones
200 OK, (Success): One of the most common HTTP status codes, this means the action for which the request was sent has been successful. The status code different request serves different response though e.g. for a GET it means fetching of a resource, for POST it means that the data has been successfully transferred to the server, so on.
301 Moved Permanently (Redirection): This lets the client browser know that the requested resources have been moved permanently and asks for a redirection with the URL sent in the response
302 Found (Redirection): Previously moved temporarily, tells the client to make a request to a specified URL sent in the response (URL is mentioned in Location header).
Note: the search engines do not update the link on a 302, they do so on a 301. 302 allows for a change of HTTP method however 307 doesn't.
401 Unauthorized (Unauthenticated): Loosely translated to unauthenticated, this status code tells the client to authenticate. The response sometimes contains a WWW-authenticate header which informs the client about the authentication options.
403 Forbidden, Not allowed: Lets the client know that they are forbidden from accessing the resource. This may indicate authentication, however, this is more about authorization instead of authentication.
401 and 403 may seem a little confusing, as a thumb rule. 401 can be resolved with authentication 403 might not.
404 Not Found: Most popular status code, which informs the client that the requested resource doesn't exist, may be removed from the server.
500 Internal Server Error: Indicates that the server has encountered an unexpected condition and can't respond to the request.
503 Service Unavailable: Indicates that the server can't handle the request, this could be due to the workload or unavailability of the server. This is sometimes associated with DDoS attacks.
A quick reference to several other Status Codes
Now that the number of methods and functionalities increased, a lot more decisions (what kind of data, encoding the client supported, etc.) were to be made for each request. To streamline the decision making Alice came up with standardizing HTTP Headers, he developed a number of name & value pairs and started making decisions on request and response based on these headers.
As the size of documents grew, the need of sending data in chunks arose, Alice created an HTTP Header "Transfer-Encoding" and used it to notify the server about the chunked data. The server accommodates the request and processes it all together till the last chunk is received.
Multiple other Headers were developed by Alice and the team, we will go through some of them here.
- Host
- Session
- Cookies
- Connection
- Location
- Server
- Set-Cookie
- User-Agent
- X-Forwarded-For
- X-Forwarded-Host
- Referer
- WWW-Authentication
- CORS
- CSP
The system grew and became much more complicated and thus, needed some standardization. Alice named the fetching of data by client "HTTP request" and sending of data by server "HTTP response". Alice also standardized the syntax of request and response.
An HTTP request consists of three major sections namely, request line, headers, & body. Refer to the below sample HTTP request which also includes sections named using uppercase letters.
POST /savedata.html HTTP/1.1 #REQUEST LINE
Host: hakunamatata.com #HEADER1
Accept-Language: fr #HEADER2
#CRLF
datapoint1=data1&datapoint2=data2 #BODY
The request line can be further broken into three sections, the method, the PATH/ URI, the version of the protocol. The method being POST, PATH/ URI /savedata.html, and the version being HTTP1.1
HTTP/1.1 200 OK #STATUS LINE
Server: apache2 #HEADER1
Content-length: 80 #HEADER2
#CRLF
Data returned from the server #BODY
Similar to the HTTP request, the HTTP response, too has three major components, the status line, the headers, and the body.
Similar to the request line, the status line can be broken into three sections the version of the protocol, the status code, and the status message. The body in both request and response is optional and depends on the HTTP method used and the server's action for the corresponding request respectively.
The CRLF line break is used to identify the end of headers and the start of the body. The CRLF is a combination of two control characters CR (\r, beginning of line) and LF(\n, next line) which moves the cursor to the beginning of the next line.
Alice's system gained more popularity and a lot more people wanted to access his papers on the net, however, it was not possible for everyone to remember Alice's IP address every time they wanted to access site papers. Users would save the IP address in a text file with Alice's name against it for reference. To overcome this Alice & the team thought of naming the site. Alice named his website "http://info.cern.ch" using which anyone could access it over the internet.
Users now would save the IP address next to the named address i.e. "info.cern.ch" and used the name for accessing the site (This is what OS Host files are, and which later developed as DNS system, we will come to this later)
URI: Universal Resource Identifier, is a means to identify a logical or a physical resource. It can be used for identifying a person, a place, a concept, a file, a phone number, a book, and a lot more things. Syntax for URI, with some reserved characters (:, /, ? , =):
<scheme>://<authority><path>?<query>#<fragment>
It consists of some optional (authority, query, fragment) and some non-optional(scheme, path) components. Fragment is not considered a part of URI but it is often used with URIs
Scheme: It defines the semantics of the rest of URI, which also represents the protocol to be used to retrieve resources, example: http, smb, telnet, tel, urn, etc.
Authority: It defines a top hierarchical authority, which governs the namespace represented by the rest of the URI. For schemes such as http & smb, the authority can be broken into user info, host, and port.
authority = <userinfo>@<host>:<port>
Where the user info can be username:password, the host could be a hostname, and an IPv4 IP address (in dot representation), the port could be TCP/ UDP port (*This is an oversimplification for the purpose of this post), example: admin:[email protected]:9090
Path: It is authority or scheme dependent string which conatins data identifying the resource, example: /admin/contact_info (reserved chars: ?, /, ;, =)
Query: It is a string containing information that would be interpreted by the resource, example: id=admin
Fragment: It contains reference information, the user agent needs to perform for interpretation of request, example, focus on a certain section of the page
URN: Universal Resource Name, a URI that defines the name of a resource, example urn:isbn:145890123 name of a book's isbn.
URL: Universal Resource Locator, a URI that defines the way to retrieve a resource, example tel:0141266026 a phone number of an individual.
Use the below example for a better understanding of the components of URI/URL/ URN
#URI as well as a URL
userinfo host port
┌─┴─┐ ┌────┴────┐ ┌┴┐
https://[email protected]:123/forum/questions/?tag=networking&order=newest#top
└─┬─┘ └──────────┬──────────┘└───────┬──────┘ └───────────┬────────────┘ └┬┘
scheme authority path query fragment
#URI as well as a URN
urn:isbn:145890123
└┬┘└──────┬──────┘
scheme
HTTP is an application layer protocol and the connection between the client and the server is maintained at the transport layer via TCP. The connection thus is out of spoke of HTTP to control. By default, HTTP/1.0 would require a new TCP connection for each request, this was changed in HTTP/1.1 with the introduction of HTTP pipelining.
HTTP/1.0 connections by default are short lived and require new TCP connection for each request, which consumes more resources and makes them slow.
Alternatively a persistent HTTP connection could be used to send multiple request within a single connection. An HTTP /1.0 connection could be made persistent by setting the Connection header in the request to anything but close (usually retry-after), in response to which server sends a keep-alive header specifying the minimum time for which the connection would be kept open. HTTP/1.1 connections on the other hand are persistent by default.
HTTP/1.1 connection could be made even faster with HTTP pipelining, which enables the client to send multiple request in succession without waiting for the response from the server. HTTP pipelining, though a faster process, comes with limitations of it own.
HTTP is a protocol to fetch resources (hypertext) from the internet, involving a server and a client. The client locates the server (via a URL) and initiates a request (including methods, headers, and body) for resources. The server on its end, processes the request (on the basis of the method, headers and the URL), puts together different components of the resource, and responds (with status code, status message, headers, and body) to the client's request.
The web has evolved a ton from what Alice and Bob started with, it includes many more components and functionalities such as Asynchronous calls, different formats of data (audio, video, images etc.), usage of proxies, usages of CDNs, and a lot more other features. The evolved web is thus called "The Modern Web".