I've discovered that all of the HTML5 character entity references (&xxxx;) are at least as long as the codepoints they represent in UTF-8, meaning that I can do the conversion in-place or on an allocation the exact same size as the original string...
...except for these two bastards which are one more byte longer:
≫⃒ U+226B U+20D2 ; E2 89 AB E2 83 92 ≪⃒ U+226A U+20D2 ; E2 89 AA E2 83 92
Two out of 2000+! I'm not going to support those, because it's absolutely idiotic to have to redo and complexify the whole buffer allocation logic just to handle these 0.1% of them which is probably very rarely going to occur in real pages anyway.
What's even more idiotic, is that if someone paid some attention to implementation and made them just a single byte longer e.g. &nGGt; or &nLLt; , they would've fit in perfectly with the rest of them. But it might be too much to expect of the W3C.