HTML Entity Based Codepage Inference 0.01 review

by on

HEBCI is a technique that allows a web form handler to transparently detect the character set its data was encoded with

License: GPL (GNU General Public License)
File size: 0K
Developer: Josh Myer
0 stars award from

HEBCI is a technique that allows a web form handler to transparently detect the character set its data was encoded with. By using carefully-chosen character references, the browser's encoding can be inferred.

Thus, it is possible to guarantee that data is in a standard encoding without relying on (often unreliable) webserver/browser encoding interactions.

The ideal solution will be entirely browser-neutral and passive. Unfortunately, the HTML spec doesn't define any mechanism for this. We need to find some other, sneakier, way to extract the current character encoding from the browser.

Luckily for us, there is a trick we can use for this: entity codes. Entity codes are strings like &, which were (are) used to encode specific characters without using Unicode. When the browser displays a page, it replaces these with the appropriate character from the current encoding.

Thus, & becomes the character 0x26 in most codepages. By itself, this is merely implementation trivia. However, this translation process occurs whenever a user submits a form. That is, the browser parses any entities in the form variables and replaces them with the current encoding's representation of those characters when the user clicks submit. Thus, any entity codes within the form fields are passed along as character values in the browser's current encoding.

So, all we have to do is find an entity that is encoded differently in two different codepages. We slip that into a form field, and then look at its value when we get data. This allows us to differentiate between the two encodings. In fact, we could look at all entities in many codepages, and find the ones that allowed us to disambiguate between many codepages. This is what I've done.

We add hidden form elements with values containing various entity codes, such as °, ÷, and —. Then, when the user submits the form, we take each of those and compare them against a list of what character has what value in what codepage. That is, each codepage has a unique fingerprint for the values of °,÷,—. For MacRoman, it's a1,d6,d1; for UTF-8, c2b0,c3b7,e28094. Thus, we only have to go through our table of codepage-to-fingerprint mappings, and see which fingerprint matches.

Note that, once this table is discovered, the cost of fingerprinting a given form submission is very low. And, in the case of misses, you can assume whatever your page's default codepage is. This fallthrough case is equivalent to what the code would have done before adding this detection layer.

HTML Entity Based Codepage Inference 0.01 keywords