TOC 
Network Working GroupG. Sneddon
Internet-DraftOctober 20, 2008
Updates: 2109, 2616, 2965 
(if approved) 
Intended status: Informational 
Expires: April 23, 2009 


Tolerant HTTP Parsing
http-parsing

Status of this Memo

By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. This document may not be modified, and derivative works of it may not be created.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as “work in progress.”

The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt.

The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.

This Internet-Draft will expire on April 23, 2009.

Abstract

The HyperText Transfer Protocol (HTTP) has been widely used by the World Wide Web (WWW) since 1990. This specification updates RFC 2616, defining how to parse HTTP requests and responses in a way that is compatible with user-agents (UAs) and servers at the time of writing.

Editorial Note

[anchor1] (Remove this section upon publication.)

This is a work in progress, and may change in part, or in whole. Do not take anything in any draft version to be final. Comments are very welcome, and should be sent to geoffers@gmail.com .

Known issues as of writing:



Table of Contents

1.  Introduction
    1.1.  Notational Conventions
        1.1.1.  Basic ABNF Rules
    1.2.  Terminology
    1.3.  Conformance Requirements
2.  Errors
    2.1.  Fatal Error
3.  Tokenization
    3.1.  Shared Rules
    3.2.  Requests
    3.3.  Responses
4.  Parsing
    4.1.  Unescaping Quoted Strings
5.  Security Considerations
6.  IANA Considerations
7.  References
    7.1.  Normative References
    7.2.  Informative References
Appendix A.  Acknowledgments
Appendix B.  Further Suggestions
§  Author's Address
§  Intellectual Property and Copyright Statements




 TOC 

1.  Introduction

Ever since HTTP's conception, there have never been any standards regarding its parsing in the real world. [RFC2616] (Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, “Hypertext Transfer Protocol -- HTTP/1.1,” June 1999.) tried to improve this situation with a section (19.3) entitled "Tolerant Applications", providing advice about parsing requests and responses. However, it did not go into specific details that are needed for interoperability with current (non-conformant) user-agents (UAs) and servers. The lack of any current specification defining such specifics makes it hard for any new UA to be created without first spending large amounts of time reverse engineering what is in cases purely bizarre behaviour, which unless you know about beforehand, you may not write enough test cases to find some of the oddest behaviour.

This specification aims to help the above mentioned problem by documenting the behaviour of UAs at the time of writing. Hopefully, over time, the real world will align itself with this specification.



 TOC 

1.1.  Notational Conventions

This specification is defined in terms of the US-ASCII character set, as defined in [ANSI.X3‑4.1986] (American National Standards Institute, “Coded Character Set - 7-bit American Standard Code for Information Interchange,” 1986.).

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119] (Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” March 1997.).

This specification is defined in terms of ABNF, as described in [RFC5234] (Crocker, D. and P. Overell, “Augmented BNF for Syntax Specifications: ABNF,” January 2008.).



 TOC 

1.1.1.  Basic ABNF Rules

Rules inherited from [RFC2616] (Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, “Hypertext Transfer Protocol -- HTTP/1.1,” June 1999.) converted to [RFC5234] (Crocker, D. and P. Overell, “Augmented BNF for Syntax Specifications: ABNF,” January 2008.) ABNF:

LWS               = [ [ CR ] LF ] 1*( SP / HTAB )
                        ; This is changed from RFC2616, as CR is now
                        ; optional within the already optional line
                        ; break sequence (this is suggested in RFC2616's
                        ; section 19.3, "Tolerant Applications").

separators        = "(" / ")" / "<" / ">" / "@" / "," / ";" / ":" / "\"
                    / DQUOTE / "/" / "[" / "]" / "?" / "=" / "{" / "}"
                    / SP / HTAB

token             = 1*( "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+"
                    / "-" / "." / "^" / "_" / "`" / "|" / "~" / DIGIT /
                    ALPHA )

comment           = "(" *( ctext / quoted-pair / comment ) ")"
ctext             = %x21-27 / %x2A-7E / %x80-FF / LWS

quoted-string     = ( DQUOTE *( qdtext / quoted-pair ) DQUOTE )
qdtext             = %x21 / %x23-5B / %x5D-7E / %x80-FF / LWS

quoted-pair       = "\" CHAR

As well as the above, this specification also inherits all the rules from [RFC3986] (Berners-Lee, T., Fielding, R., and L. Masinter, “Uniform Resource Identifier (URI): Generic Syntax,” January 2005.), which are not given here as they are already given in ABNF.



 TOC 

1.2.  Terminology

Terminology is as in [RFC2616] (Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, “Hypertext Transfer Protocol -- HTTP/1.1,” June 1999.) Section 1.3, with the following additions:

interactive user agent

This is a type of user agent, which directly returns the result to the same user that made the request (e.g., web browsers).

non-interactive user agent

This is a type of user agents, which don't return the result of the request to the user that made the request (e.g., search engine spiders).



 TOC 

1.3.  Conformance Requirements

The conformance requirements of this specification are phrased as algorithms and may be implemented in any manner, so long as the end result is equivalent (in particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant).

Implementations may impose implementation-specific limits on otherwise unconstrained inputs, e.g., to prevent denial of service attacks, to guard against running out of memory, or to work around platform-specific limitations.

This specification defines two different types of parsers: "strict" parsers, and "non-strict" parsers. It is RECOMMENDED that request parsers are strict parsers, and that response parsers are non-strict parsers.



 TOC 

2.  Errors

This section describes the behaviour that MUST be taken on certain types of errors.



 TOC 

2.1.  Fatal Error

The tokenizer/parser MUST stop processing immediately. If a request is being parsed, the server MUST respond with 400 (Bad Request); if a response is being parsed, the client SHOULD report the error.



 TOC 

3.  Tokenization

A HTTP request/response MUST be broken up into header-fields and message-body following the request rule for requests, and the response rule for responses. If the appropriate rule fails to match, it is a fatal error (Fatal Error).

Any matches of the LWS rule MUST be replaced by a single 0x20 byte (US-ASCII space), except where there are consecutive matches of the LWS rule, where they MUST be compressed to a single 0x20 byte.

If the parser is a strict parser, a fatal error (Fatal Error) MUST be thrown in any of the following circumstances:

If the major-version is "0" or "1" (or has no match although the appropriate rule as a whole matches), then the recipient of the message MUST follow this specification; if it is not, it is RECOMMENDED to follow this specification.



 TOC 

3.1.  Shared Rules

http-version      = "HTTP/" *"0" major-version "." *"0" minor-version
                        ; Note that strings in ABNF are case-insensitive

version-number    = %x31-39 *DIGIT
                        ; A version number cannot begin with a "0".
major-version     = version-number
minor-version     = version-number

header            = header-name ":" *LWS header-value *LWS
header-name       = 1*header-content-nc
header-value      = header-content
                    [ *( header-content / LWS ) header-content ]
header-content    = header-content-nc / ":"
header-content-nc = ( %x00-08 / %x0B-0C / %x0E-1F / %x21-39 / %x3B-FF )

invalid-header    = ( [ ":" *LWS ] 1*header-content-nc [ *LWS ":" ] /
                    1*":" / 1*header-content-nc 1*LWS ":" *LWS
                    header-content [ *( header-content / LWS )
                    header-content ] ) *LWS


 TOC 

3.2.  Requests

request           = simple-request / full-request

simple-request    = get absolute-uri / path-absolute [ CR ] LF
get               = %x47.45.54
                        ; "GET" case-sensitively

full-request      = request-line *( ( header / invalid-header )
                    [ CR ] LF ) [ CR ] LF message-body

request-line      = method SP request-uri SP http-version [ CR ] LF
method            = token
request-uri       = "*" / absolute-uri / path-absolute / authority


 TOC 

3.3.  Responses

response          = status-line [ CR ] LF *( ( header / invalid-header )
                    [ CR ] LF ) [ CR ] LF message-body

status-line       = http-version ( 1*SP ( status-code ( 1*SP
                    [ reason-phrase ] / sp-garbage ) / code-garbage )
                    / sp-garbage )
status-code       = 1*DIGIT
reason-phrase     = 1*( %x00-09 / %x0B-0C / %x0E-7F )
                        ; All US-ASCII except CR and LF
sp-garbage        = [ ( %x00-09 / %x0B-0C / %x0E-19 / %x21-FF )
                    status-garbage ]
code-garbage      = [ ( %x00-09 / %x0B-0C / %x0E-2F / %x3A-FF )
                    status-garbage ]
status-garbage    = *( %x00-09 / %x0B-0C / %x0E-FF )

If there is no reason-phrase, let it be equal to "OK". If there is no status-code, let it be equal to 200.



 TOC 

4.  Parsing

This section details the processing follows that tokenizing.



 TOC 

4.1.  Unescaping Quoted Strings

To unescape a quoted string (i.e., a string that follows the quoted-string specification in [RFC2616] (Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, “Hypertext Transfer Protocol -- HTTP/1.1,” June 1999.)), the following algorithm MUST be run:

  1. Let "input" be the string being parsed.
  2. If "input" does not match the quoted-string rule, return "input"; otherwise:
  3. Let "string" be the unescaped output string, initially set to "input".
  4. Remove the first and last bytes from "string" (these are the delimiting 0x22 (US-ASCII quotation mark) bytes).
  5. Remove any 0x5C (US-ASCII backslash) bytes that are not preceded by another 0x5C byte from "string" (taking the initial state of the string, so that if the preceding byte is stripped itself (which it will be), the byte is still not stripped).
  6. Return "string".


 TOC 

5.  Security Considerations

[anchor14] (This section is just a very rough draft.)

This specification is just a parsing algorithm, and therefore any risks (excluding implementations issues such as buffer overflows) are inherited from [RFC2616] (Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, “Hypertext Transfer Protocol -- HTTP/1.1,” June 1999.).



 TOC 

6.  IANA Considerations

This document has no actions for IANA.



 TOC 

7.  References



 TOC 

7.1. Normative References

[ANSI.X3-4.1986] American National Standards Institute, “Coded Character Set - 7-bit American Standard Code for Information Interchange,” ANSI X3.4, 1986.
[RFC2119] Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” BCP 14, RFC 2119, March 1997 (TXT, HTML, XML).
[RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, “Hypertext Transfer Protocol -- HTTP/1.1,” RFC 2616, June 1999 (TXT, PS, PDF, HTML, XML).
[RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, “Uniform Resource Identifier (URI): Generic Syntax,” STD 66, RFC 3986, January 2005 (TXT, HTML, XML).
[RFC5234] Crocker, D. and P. Overell, “Augmented BNF for Syntax Specifications: ABNF,” STD 68, RFC 5234, January 2008 (TXT).


 TOC 

7.2. Informative References

[W3C.WD-html5-20080610] Hyatt, D. and I. Hickson, “HTML 5,” World Wide Web Consortium WD WD-html5-20080610, June 2008 (HTML).


 TOC 

Appendix A.  Acknowledgments

Thanks to: Ian Hickson, Philip Taylor.



 TOC 

Appendix B.  Further Suggestions

This section is informative.

While the scope of this specification is only parsing of HTTP requests and responses, there are several other things that I am aware of that should be pointed out to anyone implementing [RFC2616] (Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, “Hypertext Transfer Protocol -- HTTP/1.1,” June 1999.):



 TOC 

Author's Address

  Geoffrey Sneddon
  Toll Park
  20 Hepburn Gardens
  St Andrews, Fife KY16 9DE
  GB
Phone:  +44 7807 360 291
Email:  geoffers@gmail.com
URI:  http://gsnedders.com/


 TOC 

Full Copyright Statement

Intellectual Property