I just realised that UTF-8 is stupidly defined as big-endian!
U+20AC = 0010 000010 101100
In UTF-8 -> 1110
0010 10
000010 10
101100...Meaning that to convert a codepoint into a series of bytes you have to shift the value
before anding/adding the offset, viz.
b0 = (n >> 12) + 0xe0;
b1 = ((n >> 6) & 63) + 0x80;
b2 = (n & 63) + 0x80;
Just looking at the expression it doesn't seem so bad, but shifting right before means
throwing away perfectly good bits in a register! The worst thing is, the bits thrown away are exactly the ones needed in the upcoming computations, so you have to needlessly waste storage to preserve the entire value of the codepoint throughout the computation. Observe:
push eax
shr eax, 12
add al, 224
stosb
pop eax
push eax
shr eax, 6
and al, 63
add al, 128
stosb
pop eax
and al, 63
add al, 128
stosb
14 instructions, 23 bytes. Not so bad, but what if we stored the pieces the other way around, i.e. "UTF-8LE"?
U+20AC = 001000 001010 1100
In UTF-8LE -> 1110
1100 10
001010 10
001000 b0 = (n&15) + 224;
b1 = ((n>>6)&63) + 128;
b2 = (n>>12) + 128;
Observe that each time bits are picked off n, the next step's shift removes them, so there is no need to keep around another copy of n (including bits that wouldn't be used anymore).
shl eax, 4
shr al, 4
add al, 224
stosb
mov al, ah
and al, 63
add al, 128
stosb
shr eax, 14
add al, 128
(Post truncated)