{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# HttpParser\n", "\n", "`HttpParser` class is at the heart of everything related to HTTP. It is used by `Web server` and `Proxy server` core and their plugin eco-system. As the name suggests, it is capable of parsing both HTTP request and response packets. It can also parse HTTP look-a-like protocols like ICAP, SIP etc. Most importantly, remember that `HttpParser` was originally written to handle HTTP packets arriving in the context of a proxy server and till date its default behavior favors the same flavor." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## HTTP Web Requests\n", "\n", "Let's parse a typical HTTP web request using `HttpParser`" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "b'GET / HTTP/1.1\\r\\nHost: jaxl.com\\r\\n\\r\\n'\n" ] } ], "source": [ "from proxy.http.methods import httpMethods\n", "from proxy.http.parser import HttpParser, httpParserTypes, httpParserStates\n", "from proxy.common.constants import HTTP_1_1\n", "\n", "get_request = HttpParser(httpParserTypes.REQUEST_PARSER)\n", "get_request.parse(memoryview(b'GET / HTTP/1.1\\r\\nHost: jaxl.com\\r\\n\\r\\n'))\n", "\n", "# Rebuild the raw request\n", "print(get_request.build())\n", "\n", "assert get_request.is_complete\n", "assert get_request.method == httpMethods.GET\n", "assert get_request.version == HTTP_1_1\n", "assert get_request.host == None\n", "assert get_request.port == 80\n", "assert get_request._url != None\n", "assert get_request._url.remainder == b'/'\n", "assert get_request.has_header(b'host')\n", "assert get_request.header(b'host') == b'jaxl.com'\n", "assert len(get_request.headers) == 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> NOTE:\n", ">\n", "> `get_request.host` is `None`\n", "> \n", "> We use `get_request.header(b'host') == b'jaxl.com'` to get the expected host value" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## HTTP Proxy Requests\n", "\n", "Next, let's parse a HTTP proxy request using `HttpParser`" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "b'GET / HTTP/1.1\\r\\nHost: jaxl.com\\r\\n\\r\\n'\n", "b'GET http://jaxl.com:80/ HTTP/1.1\\r\\nHost: jaxl.com\\r\\n\\r\\n'\n" ] } ], "source": [ "proxy_request = HttpParser(httpParserTypes.REQUEST_PARSER)\n", "proxy_request.parse(memoryview(b'GET http://jaxl.com/ HTTP/1.1\\r\\nHost: jaxl.com\\r\\n\\r\\n'))\n", "\n", "print(proxy_request.build())\n", "print(proxy_request.build(for_proxy=True))\n", "\n", "assert proxy_request.is_complete\n", "assert proxy_request.method == httpMethods.GET\n", "assert proxy_request.version == HTTP_1_1\n", "assert proxy_request.host == b'jaxl.com'\n", "assert proxy_request.port == 80\n", "assert proxy_request._url != None\n", "assert proxy_request._url.remainder == b'/'\n", "assert proxy_request.has_header(b'host')\n", "assert proxy_request.header(b'host') == b'jaxl.com'\n", "assert len(proxy_request.headers) == 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> NOTE:\n", ">\n", "> Notice how `proxy_request.build()` and `proxy_request.build(for_proxy=True)` behave\n", ">\n", "> Also, here `proxy_request.host` field is populated\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## HTTPS Proxy Requests\n", "\n", "To conclude, let's parse a HTTPS proxy request" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "b'CONNECT / HTTP/1.1\\r\\nHost: jaxl.com:443\\r\\n\\r\\n'\n", "b'CONNECT jaxl.com:443 HTTP/1.1\\r\\nHost: jaxl.com:443\\r\\n\\r\\n'\n" ] } ], "source": [ "connect_request = HttpParser(httpParserTypes.REQUEST_PARSER)\n", "connect_request.parse(memoryview(b'CONNECT jaxl.com:443 HTTP/1.1\\r\\nHost: jaxl.com:443\\r\\n\\r\\n'))\n", "\n", "print(connect_request.build())\n", "print(connect_request.build(for_proxy=True))\n", "\n", "assert connect_request.is_complete\n", "assert connect_request.is_https_tunnel\n", "assert connect_request.version == HTTP_1_1\n", "assert connect_request.host == b'jaxl.com'\n", "assert connect_request.port == 443\n", "assert connect_request._url != None\n", "assert connect_request._url.remainder == None\n", "assert connect_request.has_header(b'host')\n", "assert connect_request.header(b'host') == b'jaxl.com:443'\n", "assert len(connect_request.headers) == 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Take Away\n", "\n", "- `host` and `port` attributes represent the `host:port` present in the HTTP packet request line. For comparison purposes, below are all the three request lines again. Notice how there is no `host:port` available only for the web server get request\n", " | Request Type | Request Line |\n", " | ------------------| ---------------- |\n", " | `get_request` | `GET / HTTP/1.1` |\n", " | `proxy_request` | `GET http://jaxl.com HTTP/1.1` |\n", " | `connect_request` | `CONNECT jaxl.com:443 HTTP/1.1` |\n", "- `_url` attribute is an instance of `Url` parser and contains parsed information about the URL found in the request line\n", "\n", "Few of the other handy properties within `HttpParser` are:\n", "\n", "- `is_complete`\n", "- `is_http_1_1_keep_alive`\n", "- `is_connection_upgrade`\n", "- `is_https_tunnel`\n", "- `is_chunked_encoded`\n", "- `content_expected`\n", "- `body_expected`" ] } ], "metadata": { "interpreter": { "hash": "da9d6927d62b2b95bde149eedfbd5367cb7f465aad65a736f49c99ee3db39df7" }, "kernelspec": { "display_name": "Python 3.10.0 64-bit ('venv310': venv)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.0" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }