proxy.py/tutorial/http_parser.ipynb

215 lines
6.6 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# HttpParser\n",
"\n",
"`HttpParser` class is at the heart of everything related to HTTP. It is used by `Web server` and `Proxy server` core and their plugin eco-system. As the name suggests, it is capable of parsing both HTTP request and response packets. It can also parse HTTP look-a-like protocols like ICAP, SIP etc. Most importantly, remember that `HttpParser` was originally written to handle HTTP packets arriving in the context of a proxy server and till date its default behavior favors the same flavor."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## HTTP Web Requests\n",
"\n",
"Let's parse a typical HTTP web request using `HttpParser`"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"b'GET / HTTP/1.1\\r\\nHost: jaxl.com\\r\\n\\r\\n'\n"
]
}
],
"source": [
"from proxy.http.methods import httpMethods\n",
"from proxy.http.parser import HttpParser, httpParserTypes, httpParserStates\n",
"from proxy.common.constants import HTTP_1_1\n",
"\n",
"get_request = HttpParser(httpParserTypes.REQUEST_PARSER)\n",
"get_request.parse(memoryview(b'GET / HTTP/1.1\\r\\nHost: jaxl.com\\r\\n\\r\\n'))\n",
"\n",
"# Rebuild the raw request\n",
"print(get_request.build())\n",
"\n",
"assert get_request.is_complete\n",
"assert get_request.method == httpMethods.GET\n",
"assert get_request.version == HTTP_1_1\n",
"assert get_request.host == None\n",
"assert get_request.port == 80\n",
"assert get_request._url != None\n",
"assert get_request._url.remainder == b'/'\n",
"assert get_request.has_header(b'host')\n",
"assert get_request.header(b'host') == b'jaxl.com'\n",
"assert len(get_request.headers) == 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> NOTE:\n",
">\n",
"> `get_request.host` is `None`\n",
"> \n",
"> We use `get_request.header(b'host') == b'jaxl.com'` to get the expected host value"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## HTTP Proxy Requests\n",
"\n",
"Next, let's parse a HTTP proxy request using `HttpParser`"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"b'GET / HTTP/1.1\\r\\nHost: jaxl.com\\r\\n\\r\\n'\n",
"b'GET http://jaxl.com:80/ HTTP/1.1\\r\\nHost: jaxl.com\\r\\n\\r\\n'\n"
]
}
],
"source": [
"proxy_request = HttpParser(httpParserTypes.REQUEST_PARSER)\n",
"proxy_request.parse(memoryview(b'GET http://jaxl.com/ HTTP/1.1\\r\\nHost: jaxl.com\\r\\n\\r\\n'))\n",
"\n",
"print(proxy_request.build())\n",
"print(proxy_request.build(for_proxy=True))\n",
"\n",
"assert proxy_request.is_complete\n",
"assert proxy_request.method == httpMethods.GET\n",
"assert proxy_request.version == HTTP_1_1\n",
"assert proxy_request.host == b'jaxl.com'\n",
"assert proxy_request.port == 80\n",
"assert proxy_request._url != None\n",
"assert proxy_request._url.remainder == b'/'\n",
"assert proxy_request.has_header(b'host')\n",
"assert proxy_request.header(b'host') == b'jaxl.com'\n",
"assert len(proxy_request.headers) == 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> NOTE:\n",
">\n",
"> Notice how `proxy_request.build()` and `proxy_request.build(for_proxy=True)` behave\n",
">\n",
"> Also, here `proxy_request.host` field is populated\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## HTTPS Proxy Requests\n",
"\n",
"To conclude, let's parse a HTTPS proxy request"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"b'CONNECT / HTTP/1.1\\r\\nHost: jaxl.com:443\\r\\n\\r\\n'\n",
"b'CONNECT jaxl.com:443 HTTP/1.1\\r\\nHost: jaxl.com:443\\r\\n\\r\\n'\n"
]
}
],
"source": [
"connect_request = HttpParser(httpParserTypes.REQUEST_PARSER)\n",
"connect_request.parse(memoryview(b'CONNECT jaxl.com:443 HTTP/1.1\\r\\nHost: jaxl.com:443\\r\\n\\r\\n'))\n",
"\n",
"print(connect_request.build())\n",
"print(connect_request.build(for_proxy=True))\n",
"\n",
"assert connect_request.is_complete\n",
"assert connect_request.is_https_tunnel\n",
"assert connect_request.version == HTTP_1_1\n",
"assert connect_request.host == b'jaxl.com'\n",
"assert connect_request.port == 443\n",
"assert connect_request._url != None\n",
"assert connect_request._url.remainder == None\n",
"assert connect_request.has_header(b'host')\n",
"assert connect_request.header(b'host') == b'jaxl.com:443'\n",
"assert len(connect_request.headers) == 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Take Away\n",
"\n",
"- `host` and `port` attributes represent the `host:port` present in the HTTP packet request line. For comparison purposes, below are all the three request lines again. Notice how there is no `host:port` available only for the web server get request\n",
" | Request Type | Request Line |\n",
" | ------------------| ---------------- |\n",
" | `get_request` | `GET / HTTP/1.1` |\n",
" | `proxy_request` | `GET http://jaxl.com HTTP/1.1` |\n",
" | `connect_request` | `CONNECT jaxl.com:443 HTTP/1.1` |\n",
"- `_url` attribute is an instance of `Url` parser and contains parsed information about the URL found in the request line\n",
"\n",
"Few of the other handy properties within `HttpParser` are:\n",
"\n",
"- `is_complete`\n",
"- `is_http_1_1_keep_alive`\n",
"- `is_connection_upgrade`\n",
"- `is_https_tunnel`\n",
"- `is_chunked_encoded`\n",
"- `content_expected`\n",
"- `body_expected`"
]
}
],
"metadata": {
"interpreter": {
"hash": "da9d6927d62b2b95bde149eedfbd5367cb7f465aad65a736f49c99ee3db39df7"
},
"kernelspec": {
"display_name": "Python 3.10.0 64-bit ('venv310': venv)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.0"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}