1148 lines
41 KiB
Plaintext
1148 lines
41 KiB
Plaintext
#8.2 Parsing HTML documents Table of contents 8.2.5 Tree construction
|
||
|
||
WHATWG
|
||
|
||
HTML 5
|
||
|
||
Draft Recommendation — 7 February 2009
|
||
|
||
← 8.2 Parsing HTML documents – Table of contents – 8.2.5 Tree
|
||
construction →
|
||
|
||
8.2.4 Tokenization
|
||
|
||
Implementations must act as if they used the following state machine to
|
||
tokenise HTML. The state machine must start in the data state. Most
|
||
states consume a single character, which may have various side-effects,
|
||
and either switches the state machine to a new state to reconsume the
|
||
same character, or switches it to a new state (to consume the next
|
||
character), or repeats the same state (to consume the next character).
|
||
Some states have more complicated behavior and can consume several
|
||
characters before switching to another state.
|
||
|
||
The exact behavior of certain states depends on a content model flag
|
||
that is set after certain tokens are emitted. The flag has several
|
||
states: PCDATA, RCDATA, CDATA, and PLAINTEXT. Initially it must be in
|
||
the PCDATA state. In the RCDATA and CDATA states, a further escape flag
|
||
is used to control the behavior of the tokeniser. It is either true or
|
||
false, and initially must be set to the false state. The insertion mode
|
||
and the stack of open elements also affects tokenization.
|
||
|
||
The output of the tokenization step is a series of zero or more of the
|
||
following tokens: DOCTYPE, start tag, end tag, comment, character,
|
||
end-of-file. DOCTYPE tokens have a name, a public identifier, a system
|
||
identifier, and a force-quirks flag. When a DOCTYPE token is created,
|
||
its name, public identifier, and system identifier must be marked as
|
||
missing (which is a distinct state from the empty string), and the
|
||
force-quirks flag must be set to off (its other state is on). Start and
|
||
end tag tokens have a tag name, a self-closing flag, and a list of
|
||
attributes, each of which has a name and a value. When a start or end
|
||
tag token is created, its self-closing flag must be unset (its other
|
||
state is that it be set), and its attributes list must be empty.
|
||
Comment and character tokens have data.
|
||
|
||
When a token is emitted, it must immediately be handled by the tree
|
||
construction stage. The tree construction stage can affect the state of
|
||
the content model flag, and can insert additional characters into the
|
||
stream. (For example, the script element can result in scripts
|
||
executing and using the dynamic markup insertion APIs to insert
|
||
characters into the stream being tokenised.)
|
||
|
||
When a start tag token is emitted with its self-closing flag set, if
|
||
the flag is not acknowledged when it is processed by the tree
|
||
construction stage, that is a parse error.
|
||
|
||
When an end tag token is emitted, the content model flag must be
|
||
switched to the PCDATA state.
|
||
|
||
When an end tag token is emitted with attributes, that is a parse
|
||
error.
|
||
|
||
When an end tag token is emitted with its self-closing flag set, that
|
||
is a parse error.
|
||
|
||
Before each step of the tokeniser, the user agent must first check the
|
||
parser pause flag. If it is true, then the tokeniser must abort the
|
||
processing of any nested invocations of the tokeniser, yielding control
|
||
back to the caller. If it is false, then the user agent may then check
|
||
to see if either one of the scripts in the list of scripts that will
|
||
execute as soon as possible or the first script in the list of scripts
|
||
that will execute asynchronously, has completed loading. If one has,
|
||
then it must be executed and removed from its list.
|
||
|
||
The tokeniser state machine consists of the states defined in the
|
||
following subsections.
|
||
|
||
8.2.4.1 Data state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0026 AMPERSAND (&)
|
||
When the content model flag is set to one of the PCDATA or
|
||
RCDATA states and the escape flag is false: switch to the
|
||
character reference data state.
|
||
Otherwise: treat it as per the "anything else" entry below.
|
||
|
||
U+002D HYPHEN-MINUS (-)
|
||
If the content model flag is set to either the RCDATA state or
|
||
the CDATA state, and the escape flag is false, and there are at
|
||
least three characters before this one in the input stream, and
|
||
the last four characters in the input stream, including this
|
||
one, are U+003C LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+002D
|
||
HYPHEN-MINUS, and U+002D HYPHEN-MINUS ("<!--"), then set the
|
||
escape flag to true.
|
||
|
||
In any case, emit the input character as a character token. Stay
|
||
in the data state.
|
||
|
||
U+003C LESS-THAN SIGN (<)
|
||
When the content model flag is set to the PCDATA state: switch
|
||
to the tag open state.
|
||
When the content model flag is set to either the RCDATA state or
|
||
the CDATA state, and the escape flag is false: switch to the tag
|
||
open state.
|
||
Otherwise: treat it as per the "anything else" entry below.
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
If the content model flag is set to either the RCDATA state or
|
||
the CDATA state, and the escape flag is true, and the last three
|
||
characters in the input stream including this one are U+002D
|
||
HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN
|
||
("-->"), set the escape flag to false.
|
||
|
||
In any case, emit the input character as a character token. Stay
|
||
in the data state.
|
||
|
||
EOF
|
||
Emit an end-of-file token.
|
||
|
||
Anything else
|
||
Emit the input character as a character token. Stay in the data
|
||
state.
|
||
|
||
8.2.4.2 Character reference data state
|
||
|
||
(This cannot happen if the content model flag is set to the CDATA
|
||
state.)
|
||
|
||
Attempt to consume a character reference, with no additional allowed
|
||
character.
|
||
|
||
If nothing is returned, emit a U+0026 AMPERSAND character token.
|
||
|
||
Otherwise, emit the character token that was returned.
|
||
|
||
Finally, switch to the data state.
|
||
|
||
8.2.4.3 Tag open state
|
||
|
||
The behavior of this state depends on the content model flag.
|
||
|
||
If the content model flag is set to the RCDATA or CDATA states
|
||
Consume the next input character. If it is a U+002F SOLIDUS (/)
|
||
character, switch to the close tag open state. Otherwise, emit a
|
||
U+003C LESS-THAN SIGN character token and reconsume the current
|
||
input character in the data state.
|
||
|
||
If the content model flag is set to the PCDATA state
|
||
Consume the next input character:
|
||
|
||
U+0021 EXCLAMATION MARK (!)
|
||
Switch to the markup declaration open state.
|
||
|
||
U+002F SOLIDUS (/)
|
||
Switch to the close tag open state.
|
||
|
||
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL
|
||
LETTER Z
|
||
Create a new start tag token, set its tag name to the
|
||
lowercase version of the input character (add 0x0020 to
|
||
the character's code point), then switch to the tag name
|
||
state. (Don't emit the token yet; further details will be
|
||
filled in before it is emitted.)
|
||
|
||
U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
|
||
Create a new start tag token, set its tag name to the
|
||
input character, then switch to the tag name state. (Don't
|
||
emit the token yet; further details will be filled in
|
||
before it is emitted.)
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Parse error. Emit a U+003C LESS-THAN SIGN character token
|
||
and a U+003E GREATER-THAN SIGN character token. Switch to
|
||
the data state.
|
||
|
||
U+003F QUESTION MARK (?)
|
||
Parse error. Switch to the bogus comment state.
|
||
|
||
Anything else
|
||
Parse error. Emit a U+003C LESS-THAN SIGN character token
|
||
and reconsume the current input character in the data
|
||
state.
|
||
|
||
8.2.4.4 Close tag open state
|
||
|
||
If the content model flag is set to the RCDATA or CDATA states but no
|
||
start tag token has ever been emitted by this instance of the tokeniser
|
||
(fragment case), or, if the content model flag is set to the RCDATA or
|
||
CDATA states and the next few characters do not match the tag name of
|
||
the last start tag token emitted (compared in an ASCII case-insensitive
|
||
manner), or if they do but they are not immediately followed by one of
|
||
the following characters:
|
||
* U+0009 CHARACTER TABULATION
|
||
* U+000A LINE FEED (LF)
|
||
* U+000C FORM FEED (FF)
|
||
* U+0020 SPACE
|
||
* U+003E GREATER-THAN SIGN (>)
|
||
* U+002F SOLIDUS (/)
|
||
* EOF
|
||
|
||
...then emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS
|
||
character token, and switch to the data state to process the next input
|
||
character.
|
||
|
||
Otherwise, if the content model flag is set to the PCDATA state, or if
|
||
the next few characters do match that tag name, consume the next input
|
||
character:
|
||
|
||
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
|
||
Create a new end tag token, set its tag name to the lowercase
|
||
version of the input character (add 0x0020 to the character's
|
||
code point), then switch to the tag name state. (Don't emit the
|
||
token yet; further details will be filled in before it is
|
||
emitted.)
|
||
|
||
U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
|
||
Create a new end tag token, set its tag name to the input
|
||
character, then switch to the tag name state. (Don't emit the
|
||
token yet; further details will be filled in before it is
|
||
emitted.)
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Parse error. Switch to the data state.
|
||
|
||
EOF
|
||
Parse error. Emit a U+003C LESS-THAN SIGN character token and a
|
||
U+002F SOLIDUS character token. Reconsume the EOF character in
|
||
the data state.
|
||
|
||
Anything else
|
||
Parse error. Switch to the bogus comment state.
|
||
|
||
8.2.4.5 Tag name state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0009 CHARACTER TABULATION
|
||
U+000A LINE FEED (LF)
|
||
U+000C FORM FEED (FF)
|
||
U+0020 SPACE
|
||
Switch to the before attribute name state.
|
||
|
||
U+002F SOLIDUS (/)
|
||
Switch to the self-closing start tag state.
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Emit the current tag token. Switch to the data state.
|
||
|
||
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
|
||
Append the lowercase version of the current input character (add
|
||
0x0020 to the character's code point) to the current tag token's
|
||
tag name. Stay in the tag name state.
|
||
|
||
EOF
|
||
Parse error. Emit the current tag token. Reconsume the EOF
|
||
character in the data state.
|
||
|
||
Anything else
|
||
Append the current input character to the current tag token's
|
||
tag name. Stay in the tag name state.
|
||
|
||
8.2.4.6 Before attribute name state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0009 CHARACTER TABULATION
|
||
U+000A LINE FEED (LF)
|
||
U+000C FORM FEED (FF)
|
||
U+0020 SPACE
|
||
Stay in the before attribute name state.
|
||
|
||
U+002F SOLIDUS (/)
|
||
Switch to the self-closing start tag state.
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Emit the current tag token. Switch to the data state.
|
||
|
||
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
|
||
Start a new attribute in the current tag token. Set that
|
||
attribute's name to the lowercase version of the current input
|
||
character (add 0x0020 to the character's code point), and its
|
||
value to the empty string. Switch to the attribute name state.
|
||
|
||
U+0022 QUOTATION MARK (")
|
||
U+0027 APOSTROPHE (')
|
||
U+003D EQUALS SIGN (=)
|
||
Parse error. Treat it as per the "anything else" entry below.
|
||
|
||
EOF
|
||
Parse error. Emit the current tag token. Reconsume the EOF
|
||
character in the data state.
|
||
|
||
Anything else
|
||
Start a new attribute in the current tag token. Set that
|
||
attribute's name to the current input character, and its value
|
||
to the empty string. Switch to the attribute name state.
|
||
|
||
8.2.4.7 Attribute name state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0009 CHARACTER TABULATION
|
||
U+000A LINE FEED (LF)
|
||
U+000C FORM FEED (FF)
|
||
U+0020 SPACE
|
||
Switch to the after attribute name state.
|
||
|
||
U+002F SOLIDUS (/)
|
||
Switch to the self-closing start tag state.
|
||
|
||
U+003D EQUALS SIGN (=)
|
||
Switch to the before attribute value state.
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Emit the current tag token. Switch to the data state.
|
||
|
||
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
|
||
Append the lowercase version of the current input character (add
|
||
0x0020 to the character's code point) to the current attribute's
|
||
name. Stay in the attribute name state.
|
||
|
||
U+0022 QUOTATION MARK (")
|
||
U+0027 APOSTROPHE (')
|
||
Parse error. Treat it as per the "anything else" entry below.
|
||
|
||
EOF
|
||
Parse error. Emit the current tag token. Reconsume the EOF
|
||
character in the data state.
|
||
|
||
Anything else
|
||
Append the current input character to the current attribute's
|
||
name. Stay in the attribute name state.
|
||
|
||
When the user agent leaves the attribute name state (and before
|
||
emitting the tag token, if appropriate), the complete attribute's name
|
||
must be compared to the other attributes on the same token; if there is
|
||
already an attribute on the token with the exact same name, then this
|
||
is a parse error and the new attribute must be dropped, along with the
|
||
value that gets associated with it (if any).
|
||
|
||
8.2.4.8 After attribute name state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0009 CHARACTER TABULATION
|
||
U+000A LINE FEED (LF)
|
||
U+000C FORM FEED (FF)
|
||
U+0020 SPACE
|
||
Stay in the after attribute name state.
|
||
|
||
U+002F SOLIDUS (/)
|
||
Switch to the self-closing start tag state.
|
||
|
||
U+003D EQUALS SIGN (=)
|
||
Switch to the before attribute value state.
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Emit the current tag token. Switch to the data state.
|
||
|
||
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
|
||
Start a new attribute in the current tag token. Set that
|
||
attribute's name to the lowercase version of the current input
|
||
character (add 0x0020 to the character's code point), and its
|
||
value to the empty string. Switch to the attribute name state.
|
||
|
||
U+0022 QUOTATION MARK (")
|
||
U+0027 APOSTROPHE (')
|
||
Parse error. Treat it as per the "anything else" entry below.
|
||
|
||
EOF
|
||
Parse error. Emit the current tag token. Reconsume the EOF
|
||
character in the data state.
|
||
|
||
Anything else
|
||
Start a new attribute in the current tag token. Set that
|
||
attribute's name to the current input character, and its value
|
||
to the empty string. Switch to the attribute name state.
|
||
|
||
8.2.4.9 Before attribute value state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0009 CHARACTER TABULATION
|
||
U+000A LINE FEED (LF)
|
||
U+000C FORM FEED (FF)
|
||
U+0020 SPACE
|
||
Stay in the before attribute value state.
|
||
|
||
U+0022 QUOTATION MARK (")
|
||
Switch to the attribute value (double-quoted) state.
|
||
|
||
U+0026 AMPERSAND (&)
|
||
Switch to the attribute value (unquoted) state and reconsume
|
||
this input character.
|
||
|
||
U+0027 APOSTROPHE (')
|
||
Switch to the attribute value (single-quoted) state.
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Parse error. Emit the current tag token. Switch to the data
|
||
state.
|
||
|
||
U+003D EQUALS SIGN (=)
|
||
Parse error. Treat it as per the "anything else" entry below.
|
||
|
||
EOF
|
||
Parse error. Emit the current tag token. Reconsume the character
|
||
in the data state.
|
||
|
||
Anything else
|
||
Append the current input character to the current attribute's
|
||
value. Switch to the attribute value (unquoted) state.
|
||
|
||
8.2.4.10 Attribute value (double-quoted) state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0022 QUOTATION MARK (")
|
||
Switch to the after attribute value (quoted) state.
|
||
|
||
U+0026 AMPERSAND (&)
|
||
Switch to the character reference in attribute value state, with
|
||
the additional allowed character being U+0022 QUOTATION MARK
|
||
(").
|
||
|
||
EOF
|
||
Parse error. Emit the current tag token. Reconsume the character
|
||
in the data state.
|
||
|
||
Anything else
|
||
Append the current input character to the current attribute's
|
||
value. Stay in the attribute value (double-quoted) state.
|
||
|
||
8.2.4.11 Attribute value (single-quoted) state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0027 APOSTROPHE (')
|
||
Switch to the after attribute value (quoted) state.
|
||
|
||
U+0026 AMPERSAND (&)
|
||
Switch to the character reference in attribute value state, with
|
||
the additional allowed character being U+0027 APOSTROPHE (').
|
||
|
||
EOF
|
||
Parse error. Emit the current tag token. Reconsume the character
|
||
in the data state.
|
||
|
||
Anything else
|
||
Append the current input character to the current attribute's
|
||
value. Stay in the attribute value (single-quoted) state.
|
||
|
||
8.2.4.12 Attribute value (unquoted) state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0009 CHARACTER TABULATION
|
||
U+000A LINE FEED (LF)
|
||
U+000C FORM FEED (FF)
|
||
U+0020 SPACE
|
||
Switch to the before attribute name state.
|
||
|
||
U+0026 AMPERSAND (&)
|
||
Switch to the character reference in attribute value state, with
|
||
no additional allowed character.
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Emit the current tag token. Switch to the data state.
|
||
|
||
U+0022 QUOTATION MARK (")
|
||
U+0027 APOSTROPHE (')
|
||
U+003D EQUALS SIGN (=)
|
||
Parse error. Treat it as per the "anything else" entry below.
|
||
|
||
EOF
|
||
Parse error. Emit the current tag token. Reconsume the character
|
||
in the data state.
|
||
|
||
Anything else
|
||
Append the current input character to the current attribute's
|
||
value. Stay in the attribute value (unquoted) state.
|
||
|
||
8.2.4.13 Character reference in attribute value state
|
||
|
||
Attempt to consume a character reference.
|
||
|
||
If nothing is returned, append a U+0026 AMPERSAND character to the
|
||
current attribute's value.
|
||
|
||
Otherwise, append the returned character token to the current
|
||
attribute's value.
|
||
|
||
Finally, switch back to the attribute value state that you were in when
|
||
were switched into this state.
|
||
|
||
8.2.4.14 After attribute value (quoted) state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0009 CHARACTER TABULATION
|
||
U+000A LINE FEED (LF)
|
||
U+000C FORM FEED (FF)
|
||
U+0020 SPACE
|
||
Switch to the before attribute name state.
|
||
|
||
U+002F SOLIDUS (/)
|
||
Switch to the self-closing start tag state.
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Emit the current tag token. Switch to the data state.
|
||
|
||
EOF
|
||
Parse error. Emit the current tag token. Reconsume the EOF
|
||
character in the data state.
|
||
|
||
Anything else
|
||
Parse error. Reconsume the character in the before attribute
|
||
name state.
|
||
|
||
8.2.4.15 Self-closing start tag state
|
||
|
||
Consume the next input character:
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Set the self-closing flag of the current tag token. Emit the
|
||
current tag token. Switch to the data state.
|
||
|
||
EOF
|
||
Parse error. Emit the current tag token. Reconsume the EOF
|
||
character in the data state.
|
||
|
||
Anything else
|
||
Parse error. Reconsume the character in the before attribute
|
||
name state.
|
||
|
||
8.2.4.16 Bogus comment state
|
||
|
||
(This can only happen if the content model flag is set to the PCDATA
|
||
state.)
|
||
|
||
Consume every character up to and including the first U+003E
|
||
GREATER-THAN SIGN character (>) or the end of the file (EOF), whichever
|
||
comes first. Emit a comment token whose data is the concatenation of
|
||
all the characters starting from and including the character that
|
||
caused the state machine to switch into the bogus comment state, up to
|
||
and including the character immediately before the last consumed
|
||
character (i.e. up to the character just before the U+003E or EOF
|
||
character). (If the comment was started by the end of the file (EOF),
|
||
the token is empty.)
|
||
|
||
Switch to the data state.
|
||
|
||
If the end of the file was reached, reconsume the EOF character.
|
||
|
||
8.2.4.17 Markup declaration open state
|
||
|
||
(This can only happen if the content model flag is set to the PCDATA
|
||
state.)
|
||
|
||
If the next two characters are both U+002D HYPHEN-MINUS (-) characters,
|
||
consume those two characters, create a comment token whose data is the
|
||
empty string, and switch to the comment start state.
|
||
|
||
Otherwise, if the next seven characters are an ASCII case-insensitive
|
||
match for the word "DOCTYPE", then consume those characters and switch
|
||
to the DOCTYPE state.
|
||
|
||
Otherwise, if the insertion mode is "in foreign content" and the
|
||
current node is not an element in the HTML namespace and the next seven
|
||
characters are an ASCII case-sensitive match for the string "[CDATA["
|
||
(the five uppercase letters "CDATA" with a U+005B LEFT SQUARE BRACKET
|
||
character before and after), then consume those characters and switch
|
||
to the CDATA section state (which is unrelated to the content model
|
||
flag's CDATA state).
|
||
|
||
Otherwise, this is a parse error. Switch to the bogus comment state.
|
||
The next character that is consumed, if any, is the first character
|
||
that will be in the comment.
|
||
|
||
8.2.4.18 Comment start state
|
||
|
||
Consume the next input character:
|
||
|
||
U+002D HYPHEN-MINUS (-)
|
||
Switch to the comment start dash state.
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Parse error. Emit the comment token. Switch to the data state.
|
||
|
||
EOF
|
||
Parse error. Emit the comment token. Reconsume the EOF character
|
||
in the data state.
|
||
|
||
Anything else
|
||
Append the input character to the comment token's data. Switch
|
||
to the comment state.
|
||
|
||
8.2.4.19 Comment start dash state
|
||
|
||
Consume the next input character:
|
||
|
||
U+002D HYPHEN-MINUS (-)
|
||
Switch to the comment end state
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Parse error. Emit the comment token. Switch to the data state.
|
||
|
||
EOF
|
||
Parse error. Emit the comment token. Reconsume the EOF character
|
||
in the data state.
|
||
|
||
Anything else
|
||
Append a U+002D HYPHEN-MINUS (-) character and the input
|
||
character to the comment token's data. Switch to the comment
|
||
state.
|
||
|
||
8.2.4.20 Comment state
|
||
|
||
Consume the next input character:
|
||
|
||
U+002D HYPHEN-MINUS (-)
|
||
Switch to the comment end dash state
|
||
|
||
EOF
|
||
Parse error. Emit the comment token. Reconsume the EOF character
|
||
in the data state.
|
||
|
||
Anything else
|
||
Append the input character to the comment token's data. Stay in
|
||
the comment state.
|
||
|
||
8.2.4.21 Comment end dash state
|
||
|
||
Consume the next input character:
|
||
|
||
U+002D HYPHEN-MINUS (-)
|
||
Switch to the comment end state
|
||
|
||
EOF
|
||
Parse error. Emit the comment token. Reconsume the EOF character
|
||
in the data state.
|
||
|
||
Anything else
|
||
Append a U+002D HYPHEN-MINUS (-) character and the input
|
||
character to the comment token's data. Switch to the comment
|
||
state.
|
||
|
||
8.2.4.22 Comment end state
|
||
|
||
Consume the next input character:
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Emit the comment token. Switch to the data state.
|
||
|
||
U+002D HYPHEN-MINUS (-)
|
||
Parse error. Append a U+002D HYPHEN-MINUS (-) character to the
|
||
comment token's data. Stay in the comment end state.
|
||
|
||
EOF
|
||
Parse error. Emit the comment token. Reconsume the EOF character
|
||
in the data state.
|
||
|
||
Anything else
|
||
Parse error. Append two U+002D HYPHEN-MINUS (-) characters and
|
||
the input character to the comment token's data. Switch to the
|
||
comment state.
|
||
|
||
8.2.4.23 DOCTYPE state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0009 CHARACTER TABULATION
|
||
U+000A LINE FEED (LF)
|
||
U+000C FORM FEED (FF)
|
||
U+0020 SPACE
|
||
Switch to the before DOCTYPE name state.
|
||
|
||
Anything else
|
||
Parse error. Reconsume the current character in the before
|
||
DOCTYPE name state.
|
||
|
||
8.2.4.24 Before DOCTYPE name state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0009 CHARACTER TABULATION
|
||
U+000A LINE FEED (LF)
|
||
U+000C FORM FEED (FF)
|
||
U+0020 SPACE
|
||
Stay in the before DOCTYPE name state.
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Parse error. Create a new DOCTYPE token. Set its force-quirks
|
||
flag to on. Emit the token. Switch to the data state.
|
||
|
||
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
|
||
Create a new DOCTYPE token. Set the token's name to the
|
||
lowercase version of the input character (add 0x0020 to the
|
||
character's code point). Switch to the DOCTYPE name state.
|
||
|
||
EOF
|
||
Parse error. Create a new DOCTYPE token. Set its force-quirks
|
||
flag to on. Emit the token. Reconsume the EOF character in the
|
||
data state.
|
||
|
||
Anything else
|
||
Create a new DOCTYPE token. Set the token's name to the current
|
||
input character. Switch to the DOCTYPE name state.
|
||
|
||
8.2.4.25 DOCTYPE name state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0009 CHARACTER TABULATION
|
||
U+000A LINE FEED (LF)
|
||
U+000C FORM FEED (FF)
|
||
U+0020 SPACE
|
||
Switch to the after DOCTYPE name state.
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Emit the current DOCTYPE token. Switch to the data state.
|
||
|
||
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
|
||
Append the lowercase version of the input character (add 0x0020
|
||
to the character's code point) to the current DOCTYPE token's
|
||
name. Stay in the DOCTYPE name state.
|
||
|
||
EOF
|
||
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
||
Emit that DOCTYPE token. Reconsume the EOF character in the data
|
||
state.
|
||
|
||
Anything else
|
||
Append the current input character to the current DOCTYPE
|
||
token's name. Stay in the DOCTYPE name state.
|
||
|
||
8.2.4.26 After DOCTYPE name state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0009 CHARACTER TABULATION
|
||
U+000A LINE FEED (LF)
|
||
U+000C FORM FEED (FF)
|
||
U+0020 SPACE
|
||
Stay in the after DOCTYPE name state.
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Emit the current DOCTYPE token. Switch to the data state.
|
||
|
||
EOF
|
||
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
||
Emit that DOCTYPE token. Reconsume the EOF character in the data
|
||
state.
|
||
|
||
Anything else
|
||
If the six characters starting from the current input character
|
||
are an ASCII case-insensitive match for the word "PUBLIC", then
|
||
consume those characters and switch to the before DOCTYPE public
|
||
identifier state.
|
||
|
||
Otherwise, if the six characters starting from the current input
|
||
character are an ASCII case-insensitive match for the word
|
||
"SYSTEM", then consume those characters and switch to the before
|
||
DOCTYPE system identifier state.
|
||
|
||
Otherwise, this is the parse error. Set the DOCTYPE token's
|
||
force-quirks flag to on. Switch to the bogus DOCTYPE state.
|
||
|
||
8.2.4.27 Before DOCTYPE public identifier state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0009 CHARACTER TABULATION
|
||
U+000A LINE FEED (LF)
|
||
U+000C FORM FEED (FF)
|
||
U+0020 SPACE
|
||
Stay in the before DOCTYPE public identifier state.
|
||
|
||
U+0022 QUOTATION MARK (")
|
||
Set the DOCTYPE token's public identifier to the empty string
|
||
(not missing), then switch to the DOCTYPE public identifier
|
||
(double-quoted) state.
|
||
|
||
U+0027 APOSTROPHE (')
|
||
Set the DOCTYPE token's public identifier to the empty string
|
||
(not missing), then switch to the DOCTYPE public identifier
|
||
(single-quoted) state.
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
||
Emit that DOCTYPE token. Switch to the data state.
|
||
|
||
EOF
|
||
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
||
Emit that DOCTYPE token. Reconsume the EOF character in the data
|
||
state.
|
||
|
||
Anything else
|
||
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
||
Switch to the bogus DOCTYPE state.
|
||
|
||
8.2.4.28 DOCTYPE public identifier (double-quoted) state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0022 QUOTATION MARK (")
|
||
Switch to the after DOCTYPE public identifier state.
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
||
Emit that DOCTYPE token. Switch to the data state.
|
||
|
||
EOF
|
||
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
||
Emit that DOCTYPE token. Reconsume the EOF character in the data
|
||
state.
|
||
|
||
Anything else
|
||
Append the current input character to the current DOCTYPE
|
||
token's public identifier. Stay in the DOCTYPE public identifier
|
||
(double-quoted) state.
|
||
|
||
8.2.4.29 DOCTYPE public identifier (single-quoted) state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0027 APOSTROPHE (')
|
||
Switch to the after DOCTYPE public identifier state.
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
||
Emit that DOCTYPE token. Switch to the data state.
|
||
|
||
EOF
|
||
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
||
Emit that DOCTYPE token. Reconsume the EOF character in the data
|
||
state.
|
||
|
||
Anything else
|
||
Append the current input character to the current DOCTYPE
|
||
token's public identifier. Stay in the DOCTYPE public identifier
|
||
(single-quoted) state.
|
||
|
||
8.2.4.30 After DOCTYPE public identifier state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0009 CHARACTER TABULATION
|
||
U+000A LINE FEED (LF)
|
||
U+000C FORM FEED (FF)
|
||
U+0020 SPACE
|
||
Stay in the after DOCTYPE public identifier state.
|
||
|
||
U+0022 QUOTATION MARK (")
|
||
Set the DOCTYPE token's system identifier to the empty string
|
||
(not missing), then switch to the DOCTYPE system identifier
|
||
(double-quoted) state.
|
||
|
||
U+0027 APOSTROPHE (')
|
||
Set the DOCTYPE token's system identifier to the empty string
|
||
(not missing), then switch to the DOCTYPE system identifier
|
||
(single-quoted) state.
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Emit the current DOCTYPE token. Switch to the data state.
|
||
|
||
EOF
|
||
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
||
Emit that DOCTYPE token. Reconsume the EOF character in the data
|
||
state.
|
||
|
||
Anything else
|
||
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
||
Switch to the bogus DOCTYPE state.
|
||
|
||
8.2.4.31 Before DOCTYPE system identifier state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0009 CHARACTER TABULATION
|
||
U+000A LINE FEED (LF)
|
||
U+000C FORM FEED (FF)
|
||
U+0020 SPACE
|
||
Stay in the before DOCTYPE system identifier state.
|
||
|
||
U+0022 QUOTATION MARK (")
|
||
Set the DOCTYPE token's system identifier to the empty string
|
||
(not missing), then switch to the DOCTYPE system identifier
|
||
(double-quoted) state.
|
||
|
||
U+0027 APOSTROPHE (')
|
||
Set the DOCTYPE token's system identifier to the empty string
|
||
(not missing), then switch to the DOCTYPE system identifier
|
||
(single-quoted) state.
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
||
Emit that DOCTYPE token. Switch to the data state.
|
||
|
||
EOF
|
||
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
||
Emit that DOCTYPE token. Reconsume the EOF character in the data
|
||
state.
|
||
|
||
Anything else
|
||
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
||
Switch to the bogus DOCTYPE state.
|
||
|
||
8.2.4.32 DOCTYPE system identifier (double-quoted) state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0022 QUOTATION MARK (")
|
||
Switch to the after DOCTYPE system identifier state.
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
||
Emit that DOCTYPE token. Switch to the data state.
|
||
|
||
EOF
|
||
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
||
Emit that DOCTYPE token. Reconsume the EOF character in the data
|
||
state.
|
||
|
||
Anything else
|
||
Append the current input character to the current DOCTYPE
|
||
token's system identifier. Stay in the DOCTYPE system identifier
|
||
(double-quoted) state.
|
||
|
||
8.2.4.33 DOCTYPE system identifier (single-quoted) state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0027 APOSTROPHE (')
|
||
Switch to the after DOCTYPE system identifier state.
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
||
Emit that DOCTYPE token. Switch to the data state.
|
||
|
||
EOF
|
||
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
||
Emit that DOCTYPE token. Reconsume the EOF character in the data
|
||
state.
|
||
|
||
Anything else
|
||
Append the current input character to the current DOCTYPE
|
||
token's system identifier. Stay in the DOCTYPE system identifier
|
||
(single-quoted) state.
|
||
|
||
8.2.4.34 After DOCTYPE system identifier state
|
||
|
||
Consume the next input character:
|
||
|
||
U+0009 CHARACTER TABULATION
|
||
U+000A LINE FEED (LF)
|
||
U+000C FORM FEED (FF)
|
||
U+0020 SPACE
|
||
Stay in the after DOCTYPE system identifier state.
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Emit the current DOCTYPE token. Switch to the data state.
|
||
|
||
EOF
|
||
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
||
Emit that DOCTYPE token. Reconsume the EOF character in the data
|
||
state.
|
||
|
||
Anything else
|
||
Parse error. Switch to the bogus DOCTYPE state. (This does not
|
||
set the DOCTYPE token's force-quirks flag to on.)
|
||
|
||
8.2.4.35 Bogus DOCTYPE state
|
||
|
||
Consume the next input character:
|
||
|
||
U+003E GREATER-THAN SIGN (>)
|
||
Emit the DOCTYPE token. Switch to the data state.
|
||
|
||
EOF
|
||
Emit the DOCTYPE token. Reconsume the EOF character in the data
|
||
state.
|
||
|
||
Anything else
|
||
Stay in the bogus DOCTYPE state.
|
||
|
||
8.2.4.36 CDATA section state
|
||
|
||
(This can only happen if the content model flag is set to the PCDATA
|
||
state, and is unrelated to the content model flag's CDATA state.)
|
||
|
||
Consume every character up to the next occurrence of the three
|
||
character sequence U+005D RIGHT SQUARE BRACKET U+005D RIGHT SQUARE
|
||
BRACKET U+003E GREATER-THAN SIGN (]]>), or the end of the file (EOF),
|
||
whichever comes first. Emit a series of character tokens consisting of
|
||
all the characters consumed except the matching three character
|
||
sequence at the end (if one was found before the end of the file).
|
||
|
||
Switch to the data state.
|
||
|
||
If the end of the file was reached, reconsume the EOF character.
|
||
|
||
8.2.4.37 Tokenizing character references
|
||
|
||
This section defines how to consume a character reference. This
|
||
definition is used when parsing character references in text and in
|
||
attributes.
|
||
|
||
The behavior depends on the identity of the next character (the one
|
||
immediately after the U+0026 AMPERSAND character):
|
||
|
||
U+0009 CHARACTER TABULATION
|
||
U+000A LINE FEED (LF)
|
||
U+000C FORM FEED (FF)
|
||
U+0020 SPACE
|
||
U+003C LESS-THAN SIGN
|
||
U+0026 AMPERSAND
|
||
EOF
|
||
The additional allowed character, if there is one
|
||
Not a character reference. No characters are consumed, and
|
||
nothing is returned. (This is not an error, either.)
|
||
|
||
U+0023 NUMBER SIGN (#)
|
||
Consume the U+0023 NUMBER SIGN.
|
||
|
||
The behavior further depends on the character after the U+0023
|
||
NUMBER SIGN:
|
||
|
||
U+0078 LATIN SMALL LETTER X
|
||
U+0058 LATIN CAPITAL LETTER X
|
||
Consume the X.
|
||
|
||
Follow the steps below, but using the range of characters
|
||
U+0030 DIGIT ZERO through to U+0039 DIGIT NINE, U+0061
|
||
LATIN SMALL LETTER A through to U+0066 LATIN SMALL LETTER
|
||
F, and U+0041 LATIN CAPITAL LETTER A, through to U+0046
|
||
LATIN CAPITAL LETTER F (in other words, 0-9, A-F, a-f).
|
||
|
||
When it comes to interpreting the number, interpret it as
|
||
a hexadecimal number.
|
||
|
||
Anything else
|
||
Follow the steps below, but using the range of characters
|
||
U+0030 DIGIT ZERO through to U+0039 DIGIT NINE (i.e. just
|
||
0-9).
|
||
|
||
When it comes to interpreting the number, interpret it as
|
||
a decimal number.
|
||
|
||
Consume as many characters as match the range of characters
|
||
given above.
|
||
|
||
If no characters match the range, then don't consume any
|
||
characters (and unconsume the U+0023 NUMBER SIGN character and,
|
||
if appropriate, the X character). This is a parse error; nothing
|
||
is returned.
|
||
|
||
Otherwise, if the next character is a U+003B SEMICOLON, consume
|
||
that too. If it isn't, there is a parse error.
|
||
|
||
If one or more characters match the range, then take them all
|
||
and interpret the string of characters as a number (either
|
||
hexadecimal or decimal as appropriate).
|
||
|
||
If that number is one of the numbers in the first column of the
|
||
following table, then this is a parse error. Find the row with
|
||
that number in the first column, and return a character token
|
||
for the Unicode character given in the second column of that
|
||
row.
|
||
|
||
Number Unicode character
|
||
0x0D U+000A LINE FEED (LF)
|
||
0x80 U+20AC EURO SIGN ('€')
|
||
0x81 U+FFFD REPLACEMENT CHARACTER
|
||
0x82 U+201A SINGLE LOW-9 QUOTATION MARK ('‚')
|
||
0x83 U+0192 LATIN SMALL LETTER F WITH HOOK ('ƒ')
|
||
0x84 U+201E DOUBLE LOW-9 QUOTATION MARK ('„')
|
||
0x85 U+2026 HORIZONTAL ELLIPSIS ('…')
|
||
0x86 U+2020 DAGGER ('†')
|
||
0x87 U+2021 DOUBLE DAGGER ('‡')
|
||
0x88 U+02C6 MODIFIER LETTER CIRCUMFLEX ACCENT ('ˆ')
|
||
0x89 U+2030 PER MILLE SIGN ('‰')
|
||
0x8A U+0160 LATIN CAPITAL LETTER S WITH CARON ('Š')
|
||
0x8B U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK ('‹')
|
||
0x8C U+0152 LATIN CAPITAL LIGATURE OE ('Œ')
|
||
0x8D U+FFFD REPLACEMENT CHARACTER
|
||
0x8E U+017D LATIN CAPITAL LETTER Z WITH CARON ('Ž')
|
||
0x8F U+FFFD REPLACEMENT CHARACTER
|
||
0x90 U+FFFD REPLACEMENT CHARACTER
|
||
0x91 U+2018 LEFT SINGLE QUOTATION MARK ('‘')
|
||
0x92 U+2019 RIGHT SINGLE QUOTATION MARK ('’')
|
||
0x93 U+201C LEFT DOUBLE QUOTATION MARK ('“')
|
||
0x94 U+201D RIGHT DOUBLE QUOTATION MARK ('”')
|
||
0x95 U+2022 BULLET ('•')
|
||
0x96 U+2013 EN DASH ('–')
|
||
0x97 U+2014 EM DASH ('—')
|
||
0x98 U+02DC SMALL TILDE ('˜')
|
||
0x99 U+2122 TRADE MARK SIGN ('™')
|
||
0x9A U+0161 LATIN SMALL LETTER S WITH CARON ('š')
|
||
0x9B U+203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK ('›')
|
||
0x9C U+0153 LATIN SMALL LIGATURE OE ('œ')
|
||
0x9D U+FFFD REPLACEMENT CHARACTER
|
||
0x9E U+017E LATIN SMALL LETTER Z WITH CARON ('ž')
|
||
0x9F U+0178 LATIN CAPITAL LETTER Y WITH DIAERESIS ('Ÿ')
|
||
|
||
Otherwise, if the number is in the range 0x0000 to 0x0008,
|
||
0x000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF, 0xFDD0 to
|
||
0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF,
|
||
0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE,
|
||
0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF,
|
||
0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE,
|
||
0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF,
|
||
0x10FFFE, or 0x10FFFF, or is higher than 0x10FFFF, then this is
|
||
a parse error; return a character token for the U+FFFD
|
||
REPLACEMENT CHARACTER character instead.
|
||
|
||
Otherwise, return a character token for the Unicode character
|
||
whose code point is that number.
|
||
|
||
Anything else
|
||
Consume the maximum number of characters possible, with the
|
||
consumed characters matching one of the identifiers in the first
|
||
column of the named character references table (in a
|
||
case-sensitive manner).
|
||
|
||
If no match can be made, then this is a parse error. No
|
||
characters are consumed, and nothing is returned.
|
||
|
||
If the last character matched is not a U+003B SEMICOLON (;),
|
||
there is a parse error.
|
||
|
||
If the character reference is being consumed as part of an
|
||
attribute, and the last character matched is not a U+003B
|
||
SEMICOLON (;), and the next character is in the range U+0030
|
||
DIGIT ZERO to U+0039 DIGIT NINE, U+0041 LATIN CAPITAL LETTER A
|
||
to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN SMALL LETTER A
|
||
to U+007A LATIN SMALL LETTER Z, then, for historical reasons,
|
||
all the characters that were matched after the U+0026 AMPERSAND
|
||
(&) must be unconsumed, and nothing is returned.
|
||
|
||
Otherwise, return a character token for the character
|
||
corresponding to the character reference name (as given by the
|
||
second column of the named character references table).
|
||
|
||
If the markup contains I'm ¬it; I tell you, the character
|
||
reference is parsed as "not", as in, I'm ¬it; I tell you. But if
|
||
the markup was I'm ∉ I tell you, the character reference
|
||
would be parsed as "notin;", resulting in I'm ∉ I tell you.
|