Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

UTF-8LE - a more efficient, saner version of UTF-8

Name: Cudder !MhMRSATORI 2014-08-29 12:03

I just realised that UTF-8 is stupidly defined as big-endian!

U+20AC = 0010 000010 101100
In UTF-8 -> 11100010 10000010 10101100

...Meaning that to convert a codepoint into a series of bytes you have to shift the value before anding/adding the offset, viz.

b0 = (n >> 12) + 0xe0;
b1 = ((n >> 6) & 63) + 0x80;
b2 = (n & 63) + 0x80;


Just looking at the expression it doesn't seem so bad, but shifting right before means throwing away perfectly good bits in a register! The worst thing is, the bits thrown away are exactly the ones needed in the upcoming computations, so you have to needlessly waste storage to preserve the entire value of the codepoint throughout the computation. Observe:

push eax
shr eax, 12
add al, 224
stosb
pop eax
push eax
shr eax, 6
and al, 63
add al, 128
stosb
pop eax
and al, 63
add al, 128
stosb


14 instructions, 23 bytes. Not so bad, but what if we stored the pieces the other way around, i.e. "UTF-8LE"?

U+20AC = 001000 001010 1100
In UTF-8LE -> 11101100 10001010 10001000

b0 = (n&15) + 224;
b1 = ((n>>6)&63) + 128;
b2 = (n>>12) + 128;


Observe that each time bits are picked off n, the next step's shift removes them, so there is no need to keep around another copy of n (including bits that wouldn't be used anymore).

shl eax, 4
shr al, 4
add al, 224
stosb
mov al, ah
and al, 63
add al, 128
stosb
shr eax, 14
add al, 128
stosb


11 instructions, 22 bytes. Clearly superior.

The UTF-8 BOM is EF BB BF; the UTF-8LE BOM similarly will be EF AF BF.

Name: Cudder !MhMRSATORI 2014-08-30 12:35

>>20
UTF-8 and UTF-8LE are identical for the ASCII subset, so a BOM is not required at all for that.

>>21
At this point in time, your suggestion breaks compatibility with commonly used formats
You can default to standard UTF-8 if there is no UTF-8LE BOM. The only thing that breaks is autodetection for UTF-8 files starting with "ARABIC LETTER FARSI YEH MEDIAL FORM", and that's something whose occurrence has around the same likelihood as '1252 text files starting with the same series of characters as the UTF-8 BOM: "".

I've got the UTF-8 one down to 8 instructions and 22 bytes:

shl eax, 4
shr ax, 2
shr al, 2
add eax, 0e08080h
ror eax, 16
stosb
bswap eax
stosw


But the UTF-8LE one can be taken down to 7 instructions and 20 bytes:

shl eax, 6
shr ax, 2
shr al, 4
add eax, 8080e0h
stosb
shr eax, 8
stosw


Yes, I know about http://xkcd.com/927/ . But what I don't know is what the bloody hell the guys who designed UTF-8 were thinking. Put the most significant bits of the codepoint in the byte at the lowest offset!? That's completely backwards.

There is a symmetric issue when decoding UTF-8 into codepoints too: as you read each byte you have to get rid of the upper "indicator" bits which means having to do additional masking operations, whereas with UTF-8LE successive bytes contain increasingly higher-order bits, so can be read in to simply overwrite the indicators as they come (except for the last one). Compare:

UTF-8:

1110aaaa 10bbbbbb 10cccccc

Read first byte:
00000000 00000000 00000000 1110aaaa
Kill indicator bits:
00000000 00000000 00000000 0000aaaa
Shift:
00000000 00000000 0000aaaa 00000000
Read second byte:
00000000 00000000 0000aaaa 10bbbbbb
Shift to kill indicators:
00000000 00000000 0000aaaa bbbbbb00
Shift again:
00000000 000000aa aabbbbbb 00000000
Read third byte:
00000000 000000aa aabbbbbb 10cccccc
Shift to kill indicators:
00000000 000000aa aabbbbbb cccccc00
Shift to finish:
00000000 00000000 aaaabbbb bbcccccc

UTF-8LE:

1110aaaa 10bbbbbb 10cccccc

Read first byte:
00000000 00000000 00000000 1110aaaa
Rotate
aaaa0000 00000000 00000000 00001110
Read second byte:
aaaa0000 00000000 00000000 10bbbbbb
Rotate:
bbbbbbaa aa000000 00000000 00000010
Read third byte:
bbbbbbaa aa000000 00000000 10cccccc
Kill indicators (only once):
bbbbbbaa aa000000 00000000 00cccccc
Rotate to finish:
00000000 00000000 ccccccbb bbbbaaaa

7 vs 9 operations, and decoding is independent of CPU endianness since you have to read individual bytes anyway. Clear victory for UTF-8LE.

The XBTS/IBTS instructions would've been a great way to do this on x86, but unfortunately were only present on the first stepping of the 386 for some reason. Haswell now has similar instructions... although XBTS/IBTS didn't make it back into the opcode map where they should've been. :facepalm:

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List