Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

UTF-8LE - a more efficient, saner version of UTF-8

Name: Cudder !MhMRSATORI 2014-08-29 12:03

I just realised that UTF-8 is stupidly defined as big-endian!

U+20AC = 0010 000010 101100
In UTF-8 -> 11100010 10000010 10101100

...Meaning that to convert a codepoint into a series of bytes you have to shift the value before anding/adding the offset, viz.

b0 = (n >> 12) + 0xe0;
b1 = ((n >> 6) & 63) + 0x80;
b2 = (n & 63) + 0x80;


Just looking at the expression it doesn't seem so bad, but shifting right before means throwing away perfectly good bits in a register! The worst thing is, the bits thrown away are exactly the ones needed in the upcoming computations, so you have to needlessly waste storage to preserve the entire value of the codepoint throughout the computation. Observe:

push eax
shr eax, 12
add al, 224
stosb
pop eax
push eax
shr eax, 6
and al, 63
add al, 128
stosb
pop eax
and al, 63
add al, 128
stosb


14 instructions, 23 bytes. Not so bad, but what if we stored the pieces the other way around, i.e. "UTF-8LE"?

U+20AC = 001000 001010 1100
In UTF-8LE -> 11101100 10001010 10001000

b0 = (n&15) + 224;
b1 = ((n>>6)&63) + 128;
b2 = (n>>12) + 128;


Observe that each time bits are picked off n, the next step's shift removes them, so there is no need to keep around another copy of n (including bits that wouldn't be used anymore).

shl eax, 4
shr al, 4
add al, 224
stosb
mov al, ah
and al, 63
add al, 128
stosb
shr eax, 14
add al, 128
stosb


11 instructions, 22 bytes. Clearly superior.

The UTF-8 BOM is EF BB BF; the UTF-8LE BOM similarly will be EF AF BF.

Name: Cudder !MhMRSATORI 2014-08-29 12:12

And ditto for UTF-16's placement of bits within surrogates... but since UTF-16 already has UTF-16BE/UTF-16LE, I propose the more logical alternative be named UTF-16LEL.

UTF-16BEL could also be defined, but it makes about as much sense as UTF-16BE, i.e. none.

Name: Anonymous 2014-08-29 12:16

Stop using shitty little endian systems

Name: Anonymous 2014-08-29 12:20

What a coinkydink, my anus is big-endian.

Name: Cudder !MhMRSATORI 2014-08-29 12:27

>>3
Stop using shitty big-endian systems.

It doesn't take a genius to realise that the byte at offset n having value 256L-n-1 is absolutely bloody-panties-on-head-retarded compared to it having value 256n.

Name: Anonymous 2014-08-29 12:29

Why is this level of optimization this so important for Unicode? It's no longer 1980, we can actually afford multi-Ghz CPU and many gigabytes of RAM.

Name: Anonymous 2014-08-29 12:54

>>6
Unicode is supposed to be used everywhere, not just on the latest and greatest hardware.

Name: Anonymous 2014-08-29 13:57

>>6
Even in 1980 this would not be a problem

Name: Anonymous 2014-08-29 14:57

no! the ship has sailed! i do not want to deal with colleagues struggling to understand that there are multiple versions of Unicode.

they understand now is that UTF8 = Unicode and everything that is not UTF8 should be made UTF8 and to assign the issue to me if that fails. i don't want to be dealing with this shit.

even right at this very moment, they're having problems. one of the "apps" (FUCK. THAT. WORD.) isn't outputting "special characters".

one world
one people
one encoding

Name: Anonymous 2014-08-29 15:13

>>9
U MENA ASCII

Name: Anonymous 2014-08-29 17:08

>>1
Shalooooooom!

Name: Anonymous 2014-08-29 18:21

>>6
My toaster shouldn't have to have that sort of hardware just to display it's settings.

Name: Anonymous 2014-08-29 18:35

>>12
An 8086 and PCB should be enough and cost, like, $5 (probably). Do you really want to pay two extra dollars for something more powerful?

Name: Anonymous 2014-08-29 19:01

>>13
x86 for embedded systems? Terrible!

Name: Anonymous 2014-08-29 19:13

>>14
8086 is commonly used for this job

Name: Anonymous 2014-08-29 19:17

>>12
its

Name: Anonymous 2014-08-29 20:51

>>16
it's
my toaster is settings, learn to think

Name: Anonymous 2014-08-29 22:03

Fucking English and it's punctuation overloading.

Name: Anonymous 2014-08-29 22:06

>>15
Just because lots of people use it doesn't make it good.

Name: Anonymous 2014-08-29 22:09

>>1
UTF8-LE is an abomination because it requires a BOM, and including a BOM renders files 7 bit unclean even if they are using only the ASCII subset of UTF-8. If the performance of your program is seriously impacted by the text serialization format, something is wrong.

Name: Anonymous 2014-08-29 23:33

>>1

There other metrics than time, which you ought to concern yourself with when designing a character encoding format: Simple and portable implementation, compatibility with commonly used formats, size of the data using the encoding, etc.

At this point in time, your suggestion breaks compatibility with commonly used formats, which is far more important than three fewer instructions on little-endian CPUs that happen to run time-optimised implementations of UTF-8. If you want to talk about sanity, introducing an incompatible character encoding format just so you can squeeze three instructions out of your implementation is fucking crazy.

Name: Anonymous 2014-08-30 3:37

>>21
Hey, this is Cudder we're talking about. No optimization is too premature.

Name: Cudder !MhMRSATORI 2014-08-30 12:35

>>20
UTF-8 and UTF-8LE are identical for the ASCII subset, so a BOM is not required at all for that.

>>21
At this point in time, your suggestion breaks compatibility with commonly used formats
You can default to standard UTF-8 if there is no UTF-8LE BOM. The only thing that breaks is autodetection for UTF-8 files starting with "ARABIC LETTER FARSI YEH MEDIAL FORM", and that's something whose occurrence has around the same likelihood as '1252 text files starting with the same series of characters as the UTF-8 BOM: "".

I've got the UTF-8 one down to 8 instructions and 22 bytes:

shl eax, 4
shr ax, 2
shr al, 2
add eax, 0e08080h
ror eax, 16
stosb
bswap eax
stosw


But the UTF-8LE one can be taken down to 7 instructions and 20 bytes:

shl eax, 6
shr ax, 2
shr al, 4
add eax, 8080e0h
stosb
shr eax, 8
stosw


Yes, I know about http://xkcd.com/927/ . But what I don't know is what the bloody hell the guys who designed UTF-8 were thinking. Put the most significant bits of the codepoint in the byte at the lowest offset!? That's completely backwards.

There is a symmetric issue when decoding UTF-8 into codepoints too: as you read each byte you have to get rid of the upper "indicator" bits which means having to do additional masking operations, whereas with UTF-8LE successive bytes contain increasingly higher-order bits, so can be read in to simply overwrite the indicators as they come (except for the last one). Compare:

UTF-8:

1110aaaa 10bbbbbb 10cccccc

Read first byte:
00000000 00000000 00000000 1110aaaa
Kill indicator bits:
00000000 00000000 00000000 0000aaaa
Shift:
00000000 00000000 0000aaaa 00000000
Read second byte:
00000000 00000000 0000aaaa 10bbbbbb
Shift to kill indicators:
00000000 00000000 0000aaaa bbbbbb00
Shift again:
00000000 000000aa aabbbbbb 00000000
Read third byte:
00000000 000000aa aabbbbbb 10cccccc
Shift to kill indicators:
00000000 000000aa aabbbbbb cccccc00
Shift to finish:
00000000 00000000 aaaabbbb bbcccccc

UTF-8LE:

1110aaaa 10bbbbbb 10cccccc

Read first byte:
00000000 00000000 00000000 1110aaaa
Rotate
aaaa0000 00000000 00000000 00001110
Read second byte:
aaaa0000 00000000 00000000 10bbbbbb
Rotate:
bbbbbbaa aa000000 00000000 00000010
Read third byte:
bbbbbbaa aa000000 00000000 10cccccc
Kill indicators (only once):
bbbbbbaa aa000000 00000000 00cccccc
Rotate to finish:
00000000 00000000 ccccccbb bbbbaaaa

7 vs 9 operations, and decoding is independent of CPU endianness since you have to read individual bytes anyway. Clear victory for UTF-8LE.

The XBTS/IBTS instructions would've been a great way to do this on x86, but unfortunately were only present on the first stepping of the 386 for some reason. Haswell now has similar instructions... although XBTS/IBTS didn't make it back into the opcode map where they should've been. :facepalm:

Name: Anonymous 2014-08-30 18:05

>>23
Shalom!

Name: Anonymous 2014-08-30 18:27

Cudder, you're like that programmer, who optimized the wait loop
http://c2.com/cgi/wiki?OptimizingTheIdleLoop

UTF8 is fast enough

Name: Anonymous 2014-08-31 1:32

What does Lord Master Gordon Moore-sama-kami-san say about character encodings, Cudder?

Name: Anonymous 2014-08-31 1:37

>>26
Use UCS-2-LE?

Name: Cudder !MhMRSATORI 2014-08-31 3:55

>>25
Idle loops need optimisation too! Putting hlt in a loop is acceptable but for lowest idle power consumption you'd ideally want to lower the clock frequencies and voltages too (and put them back up at the right time.) Especially for mobile devices this is extremely important, but even on a desktop where an idle core is a few watts but one spinning in a tight loop takes more than 10x power, the savings are significant. I believe this is automatic (or at least accomplished via SMI) on x86, but on ARM SoCs it must be done manually:

http://processors.wiki.ti.com/index.php/AM335x_Linux_Power_Management_User_Guide

https://blogs.oracle.com/bholler/entry/the_most_executed_code_in

And saying that UTF8 is "fast enough" misses the whole point: every little bit of time saved running that code is potentially more time the CPU can go into low-power mode. You're like those idiots who thought petrol was "cheap enough" - it is until it isn't, and then you have a huge problem because by then the old inefficient standard has already become so widely adopted.

Name: Anonymous 2014-08-31 9:38

>>28
Shalom!

Name: Anonymous 2014-08-31 14:07

>>28
And saying that UTF8 is "fast enough" misses the whole point: every little bit of time saved running that code is potentially more time the CPU can go into low-power mode. You're like those idiots who thought petrol was "cheap enough" - it is until it isn't, and then you have a huge problem because by then the old inefficient standard has already become so widely adopted.

Have you done any profiling?

Name: Anonymous 2014-08-31 14:35

naidne-elttiL si TIHS

Name: Anonymous 2014-08-31 14:39

>>25-san, you're like that guy, who posts unfunny TV Tropes ``articles'' whenever he has the chance.

Stop posting links to C2, aka Know Your Meme Le Brogrammer Edition.

Name: Anonymous 2014-08-31 17:53

>>32

you're like that guy, who is against enterprise quality software, design patterns and extreme programming.

Cum-in-ham 2 is the God's wiki. Prove me wrong.

Name: Anonymous 2014-08-31 18:01

>>32
I like C2

Name: Anonymous 2014-08-31 18:58

>>33,34
/r/programming has lots of epic programming memes too :)

Name: Anonymous 2014-08-31 19:16

>>35
I think you should go back there

Name: Anonymous 2014-08-31 21:19

>>34
So do I, it's like a computer-science/software-developer version of the Talmud.

Name: Anonymous 2014-09-01 1:01

A startling level of anal prowess on the side of Cudder.

Name: Cudder !cXCudderUE 2017-01-09 1:27

>>2
I just realised that UTF-16LEL doesn't require much change to process compared to regular UTF-16LE, since it only means the surrogate pairs are in the other (i.e. saner) order, with the low surrogate at the lower address and the high surrogate at the higher address, as they should've been.

That's right, some idiots at the Unicode Consortium thought the low surrogate should be at the higher address, and vice-versa. Maybe the same ones who came up with UTF-8. :facepalm:

Name: Anonymous 2017-01-09 5:41

>>39
vectors ain't voxels, dolt

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List