>>544For mostly the same reason I'm not going to submit code to Firefox, WebKit, etc. - Because I'd have to rewrite (or remove...) most of it anyway.
>>531>>516Hubbub is NetSurf's parser, and I've already mentioned it in
>>110. In detail, here's what's wrong with it (refer to
https://github.com/servo/libhubbub/blob/master/src/tokeniser/tokeniser.c as of the time of this post):
- Using switch statement and explicit case numbers to implement the tokeniser FSM instead of simple
goto
s and implicit position-based FSM. This means it has to go through that switch every time through the main loop, even if the next state is the same, instead of just jumping to the right place.
- Making
every damn state a function. The amount of work that has to be done in each state is tiny, but somehow they've managed to bloat them big enough to look like each one of those functions is nontrivial. There is a ton of duplicate code as a result - for example, compare
hubbub_tokeniser_handle_attribute_value_dq()
,
hubbub_tokeniser_handle_attribute_value_sq()
, and
hubbub_tokeniser_handle_attribute_value_uq()
. (The names are ridiculously long too.) Each one is over 40 lines and performs basically the same thing... can you spot the differences between them?
Here's the corresponding part of my parser's tokeniser; it handles all 3 types of attribute values.
attrv_unquoted_loop:
cmp al, '>'
jz finish_attr_done
call [ebp+htmlp.func_iswhs]
jnz attrv_loop
finish_attr_done:
; omitted code here finishes an attribute and emits a tag or starts a new one (~10 lines)
...
find_attrv:
xor bl, bl ; presume unquoted
; omitted code here scans for start of an attribute value (<10 lines)
...
cmp al, '"'
jz attrv_quoted
cmp al, "'"
jnz attrv_unquoted
attrv_quoted:
mov bl, al ; save quote char
inc edx ; skip quote char
attrv_loop:
call [ebp+htmlp.func_getchar] ; EOF - ignore
attrv_unquoted:
or bl, bl
jz attrv_unquoted_loop
cmp al, bl
jnz attrv_loop
jmp finish_attr_done
I've shown quite a bit more than the actual unquoted/quoted attribute value states, just so you can follow the structure around it, and yet this code
in x86 Asm is still shorter than
one of Hubbub's functions, in C, that does a
tiny fraction of the real work this does! The actual quoted/unquoted attribute value states are represented by these
ten instructions:
attrv_unquoted_loop:
cmp al, '>'
jz finish_attr_done
call [ebp+htmlp.func_iswhs]
jnz attrv_loop
...
attrv_loop:
call [ebp+htmlp.func_getchar] ; EOF - ignore
attrv_unquoted:
or bl, bl
jz attrv_unquoted_loop
cmp al, bl
jnz attrv_loop
jmp finish_attr_done
I'm not claiming to be an expert on HTML parsing by any means, but this code is so short because what it does
really is that simple - unquoted attributes are terminated by > or whitespace, and quoted ones are terminated by their quote character. There's nothing at all deep or complex about this, yet Hubbub's version makes it look like even "lightweight" code (which they claim to be) needs to do all that shit so it gets used as "evidence" by all the people claiming "browsers need to be complex!"
- Character references don't need to be handled by the tokeniser, since the presence or absence of any '&'s doesn't affect tokenisation. There's another chunk of complexity that could be factored out. I mentioned this in
>>110 already. As expected, Hubbub takes the stupid route.
- DOCTYPE tokens, same thing, already mentioned above.
- EOF - this is another bloater. Hubbub checks for EOF in the code of every state(!), but if you read the spec you'll see that the number of operations upon EOF are limited to two: not doing anything or "emit the current token", so my EOF handling is done in
getchar
. The cases that don't do anything don't need any code, the ones that do are recorded in an array (there's only 16 entries) and the right place is jumped to when the EOF function is called.
- Partial input handling: this is all handled in
getchar
- which returns "up two levels" to whatever called the parser when it runs out of input (try doing that in C!) The input is one big buffer which gets resized and moved as needed. I'm not sure what Hubbub does here, but it's probably more complex than this.
- Output interface: Hubbub stuffs everything into a "token" structure and then calls one function,
token_handler()
, to push the results out. It has functions named
emit_current_tag()
,
emit_current_comment()
,
emit_current_chars()
, etc., but all those have to go through that one
token_handler()
function. On the other side, the tree construction has to then switch on that type, just to call the right function, when they could've made the tokeniser call those different functions in the first place. How stupid is that? It's [i]bloody-panties-on-head retarded!![/code] I take the sane route with my tokeniser interface:
extrn @emit_doctype@4:near
extrn @emit_text@4:near
extrn @emit_comment@4:near
extrn @emit_tag@4:near
One function for each token type, as it should be. No need for a "token type" either, because the function being called implicitly determines what token was emitted, and it can directly read from the right fields of the parser/tokeniser structure to get its info. The pointer to the parser structure is in a register already. Simple and efficient. Tree construction is currently in C so I have to fight with stupid HLL calling conventions bullshit (I use ecx for a loop counter - the way it should be used - and ebp for the parser structure, but __fastcall wants to put the first parameter in ecx!!) but I expect that'll change once I rewrite that part in Asm too.
tl;dr: I'm not going to touch NetSurf, because it wouldn't be NetSurf anymore if I did.