Network Working Group G. Sneddon Internet-Draft October 20, 2008 Updates: 2109, 2616, 2965 (if approved) Intended status: Informational Expires: April 23, 2009 Tolerant HTTP Parsing http-parsing Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. This document may not be modified, and derivative works of it may not be created. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on April 23, 2009. Abstract The HyperText Transfer Protocol (HTTP) has been widely used by the World Wide Web (WWW) since 1990. This specification updates RFC 2616, defining how to parse HTTP requests and responses in a way that is compatible with user-agents (UAs) and servers at the time of writing. Sneddon Expires April 23, 2009 [Page 1] Internet-Draft Tolerant HTTP Parsing October 2008 Editorial Note [[anchor1: Remove this section upon publication.]] This is a work in progress, and may change in part, or in whole. Do not take anything in any draft version to be final. Comments are very welcome, and should be sent to geoffers@gmail.com [1] . Known issues as of writing: o The majority of the parsing algorithm is yet to be written. o [RFC2616] isn't properly referenced. o Security Considerations needs: A. "one thing for the security section of that draft is the need for implementations to follow the spec exactly lest they be vulnerable to content stuffing that abuses differences in parsing algorithms" - Hixie B. Most are unchanged from [RFC2616]. o Define handling of various things, which should make Appendix B obsolete. This means moving Content-Type sniffing into this spec, as part of parsing the Content-Type header, as well as defining how to resolve the base IRI of a HTTP document. o Add anchor attributes for each and every section. o Look over and . o Do we really need to expand HTTP anywhere? It's listed as being so well known there is no need to expand it. Either we expand it elsewhere apart from the abstract (i.e., in the title and its first occurrence in the body of the document), or nowhere at all. o Fix all xml2rfc warnings (these are currently all related to artwork being outdented. o CHAR is different between [RFC2616] and [RFC5234] --- reality check is needed. o Look at behaviour for things apart from HTTP/1. o Actually test responses. Sneddon Expires April 23, 2009 [Page 2] Internet-Draft Tolerant HTTP Parsing October 2008 o Aim: any response starting with http-version should have defined behaviour, and not fallback to HTTP/0.9. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1. Notational Conventions . . . . . . . . . . . . . . . . . . 4 1.1.1. Basic ABNF Rules . . . . . . . . . . . . . . . . . . . 5 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 1.3. Conformance Requirements . . . . . . . . . . . . . . . . . 6 2. Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1. Fatal Error . . . . . . . . . . . . . . . . . . . . . . . 6 3. Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1. Shared Rules . . . . . . . . . . . . . . . . . . . . . . . 7 3.2. Requests . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.3. Responses . . . . . . . . . . . . . . . . . . . . . . . . 8 4. Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.1. Unescaping Quoted Strings . . . . . . . . . . . . . . . . 8 5. Security Considerations . . . . . . . . . . . . . . . . . . . 9 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 9 7.1. Normative References . . . . . . . . . . . . . . . . . . . 9 7.2. Informative References . . . . . . . . . . . . . . . . . . 10 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 10 Appendix B. Further Suggestions . . . . . . . . . . . . . . . . . 10 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 10 Intellectual Property and Copyright Statements . . . . . . . . . . 12 Sneddon Expires April 23, 2009 [Page 3] Internet-Draft Tolerant HTTP Parsing October 2008 1. Introduction Ever since HTTP's conception, there have never been any standards regarding its parsing in the real world. [RFC2616] tried to improve this situation with a section (19.3) entitled "Tolerant Applications", providing advice about parsing requests and responses. However, it did not go into specific details that are needed for interoperability with current (non-conformant) user-agents (UAs) and servers. The lack of any current specification defining such specifics makes it hard for any new UA to be created without first spending large amounts of time reverse engineering what is in cases purely bizarre behaviour, which unless you know about beforehand, you may not write enough test cases to find some of the oddest behaviour. This specification aims to help the above mentioned problem by documenting the behaviour of UAs at the time of writing. Hopefully, over time, the real world will align itself with this specification. 1.1. Notational Conventions This specification is defined in terms of the US-ASCII character set, as defined in [ANSI.X3-4.1986]. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. This specification is defined in terms of ABNF, as described in [RFC5234]. Sneddon Expires April 23, 2009 [Page 4] Internet-Draft Tolerant HTTP Parsing October 2008 1.1.1. Basic ABNF Rules Rules inherited from [RFC2616] converted to [RFC5234] ABNF: LWS = [ [ CR ] LF ] 1*( SP / HTAB ) ; This is changed from RFC2616, as CR is now ; optional within the already optional line ; break sequence (this is suggested in RFC2616's ; section 19.3, "Tolerant Applications"). separators = "(" / ")" / "<" / ">" / "@" / "," / ";" / ":" / "\" / DQUOTE / "/" / "[" / "]" / "?" / "=" / "{" / "}" / SP / HTAB token = 1*( "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." / "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA ) comment = "(" *( ctext / quoted-pair / comment ) ")" ctext = %x21-27 / %x2A-7E / %x80-FF / LWS quoted-string = ( DQUOTE *( qdtext / quoted-pair ) DQUOTE ) qdtext = %x21 / %x23-5B / %x5D-7E / %x80-FF / LWS quoted-pair = "\" CHAR As well as the above, this specification also inherits all the rules from [RFC3986], which are not given here as they are already given in ABNF. 1.2. Terminology Terminology is as in [RFC2616] Section 1.3, with the following additions: interactive user agent This is a type of user agent, which directly returns the result to the same user that made the request (e.g., web browsers). non-interactive user agent This is a type of user agents, which don't return the result of the request to the user that made the request (e.g., search engine spiders). Sneddon Expires April 23, 2009 [Page 5] Internet-Draft Tolerant HTTP Parsing October 2008 1.3. Conformance Requirements The conformance requirements of this specification are phrased as algorithms and may be implemented in any manner, so long as the end result is equivalent (in particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant). Implementations may impose implementation-specific limits on otherwise unconstrained inputs, e.g., to prevent denial of service attacks, to guard against running out of memory, or to work around platform-specific limitations. This specification defines two different types of parsers: "strict" parsers, and "non-strict" parsers. It is RECOMMENDED that request parsers are strict parsers, and that response parsers are non-strict parsers. 2. Errors This section describes the behaviour that MUST be taken on certain types of errors. 2.1. Fatal Error The tokenizer/parser MUST stop processing immediately. If a request is being parsed, the server MUST respond with 400 (Bad Request); if a response is being parsed, the client SHOULD report the error. 3. Tokenization A HTTP request/response MUST be broken up into header-fields and message-body following the request rule for requests, and the response rule for responses. If the appropriate rule fails to match, it is a fatal error (Section 2.1). Any matches of the LWS rule MUST be replaced by a single 0x20 byte (US-ASCII space), except where there are consecutive matches of the LWS rule, where they MUST be compressed to a single 0x20 byte. If the parser is a strict parser, a fatal error (Section 2.1) MUST be thrown in any of the following circumstances: o There are any matches for the invalid-header rule, or Sneddon Expires April 23, 2009 [Page 6] Internet-Draft Tolerant HTTP Parsing October 2008 o There are any matches for header-name that do not also match the token rule. o There is a match for sp-garbage. o There is a match for code-garbage. If the major-version is "0" or "1" (or has no match although the appropriate rule as a whole matches), then the recipient of the message MUST follow this specification; if it is not, it is RECOMMENDED to follow this specification. 3.1. Shared Rules http-version = "HTTP/" *"0" major-version "." *"0" minor-version ; Note that strings in ABNF are case-insensitive version-number = %x31-39 *DIGIT ; A version number cannot begin with a "0". major-version = version-number minor-version = version-number header = header-name ":" *LWS header-value *LWS header-name = 1*header-content-nc header-value = header-content [ *( header-content / LWS ) header-content ] header-content = header-content-nc / ":" header-content-nc = ( %x00-08 / %x0B-0C / %x0E-1F / %x21-39 / %x3B-FF ) invalid-header = ( [ ":" *LWS ] 1*header-content-nc [ *LWS ":" ] / 1*":" / 1*header-content-nc 1*LWS ":" *LWS header-content [ *( header-content / LWS ) header-content ] ) *LWS Sneddon Expires April 23, 2009 [Page 7] Internet-Draft Tolerant HTTP Parsing October 2008 3.2. Requests request = simple-request / full-request simple-request = get absolute-uri / path-absolute [ CR ] LF get = %x47.45.54 ; "GET" case-sensitively full-request = request-line *( ( header / invalid-header ) [ CR ] LF ) [ CR ] LF message-body request-line = method SP request-uri SP http-version [ CR ] LF method = token request-uri = "*" / absolute-uri / path-absolute / authority 3.3. Responses response = status-line [ CR ] LF *( ( header / invalid-header ) [ CR ] LF ) [ CR ] LF message-body status-line = http-version ( 1*SP ( status-code ( 1*SP [ reason-phrase ] / sp-garbage ) / code-garbage ) / sp-garbage ) status-code = 1*DIGIT reason-phrase = 1*( %x00-09 / %x0B-0C / %x0E-7F ) ; All US-ASCII except CR and LF sp-garbage = [ ( %x00-09 / %x0B-0C / %x0E-19 / %x21-FF ) status-garbage ] code-garbage = [ ( %x00-09 / %x0B-0C / %x0E-2F / %x3A-FF ) status-garbage ] status-garbage = *( %x00-09 / %x0B-0C / %x0E-FF ) If there is no reason-phrase, let it be equal to "OK". If there is no status-code, let it be equal to 200. 4. Parsing This section details the processing follows that tokenizing. 4.1. Unescaping Quoted Strings To unescape a quoted string (i.e., a string that follows the quoted- string specification in [RFC2616]), the following algorithm MUST be run: Sneddon Expires April 23, 2009 [Page 8] Internet-Draft Tolerant HTTP Parsing October 2008 1. Let "input" be the string being parsed. 2. If "input" does not match the quoted-string rule, return "input"; otherwise: 3. Let "string" be the unescaped output string, initially set to "input". 4. Remove the first and last bytes from "string" (these are the delimiting 0x22 (US-ASCII quotation mark) bytes). 5. Remove any 0x5C (US-ASCII backslash) bytes that are not preceded by another 0x5C byte from "string" (taking the initial state of the string, so that if the preceding byte is stripped itself (which it will be), the byte is still not stripped). 6. Return "string". 5. Security Considerations [[anchor14: This section is just a very rough draft.]] This specification is just a parsing algorithm, and therefore any risks (excluding implementations issues such as buffer overflows) are inherited from [RFC2616]. 6. IANA Considerations This document has no actions for IANA. 7. References 7.1. Normative References [ANSI.X3-4.1986] American National Standards Institute, "Coded Character Set - 7-bit American Standard Code for Information Interchange", ANSI X3.4, 1986. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. Sneddon Expires April 23, 2009 [Page 9] Internet-Draft Tolerant HTTP Parsing October 2008 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986, January 2005. [RFC5234] Crocker, D. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", STD 68, RFC 5234, January 2008. 7.2. Informative References [W3C.WD-html5-20080610] Hyatt, D. and I. Hickson, "HTML 5", World Wide Web Consortium WD WD-html5-20080610, June 2008, . URIs [1] Appendix A. Acknowledgments Thanks to: Ian Hickson, Philip Taylor. Appendix B. Further Suggestions This section is informative. While the scope of this specification is only parsing of HTTP requests and responses, there are several other things that I am aware of that should be pointed out to anyone implementing [RFC2616]: o The Content-Location header SHOULD be ignored. This is due to multiple versions of Microsoft Internet Information Services (IIS) sending incorrect Content-Location headers. Implementing this as required by [RFC2616] will break a significant number of websites. o The Content-Type SHOULD NOT on its own be trusted. Content-Type sniffing as defined in [W3C.WD-html5-20080610] SHOULD be used to determine the true type of the resource. This is due to a large number of websites sending incorrect Content-Type headers, often because the maintainer of the website cannot change the header, or because the file extension/MIME type database is outdated. Sneddon Expires April 23, 2009 [Page 10] Internet-Draft Tolerant HTTP Parsing October 2008 Author's Address Geoffrey Sneddon Toll Park 20 Hepburn Gardens St Andrews, Fife KY16 9DE GB Phone: +44 7807 360 291 Email: geoffers@gmail.com URI: http://gsnedders.com/ Sneddon Expires April 23, 2009 [Page 11] Internet-Draft Tolerant HTTP Parsing October 2008 Full Copyright Statement Copyright (C) The IETF Trust (2008). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Sneddon Expires April 23, 2009 [Page 12]