ASCII Encodings

If you’ve haven’t read my last two columns on UTF encodings, I’ll summarize them here in one sentence:

THERE AIN’T NO SUCH THING AS ASCII!

That’s right, there is no such thing as an ASCII string unless you understand the encoding that it uses. And there are a lot of encodings. We now live in an international world that uses languages other than English. We all know a significant segment of the world’s population speaks, reads and writes something other than English. For your ASCII strings to be understood by people speaking languages other than English, you really need to understand ASCII encodings.

In my previous articles, I described how every character on the planet is being assigned a code point by the Unicode Consortium. Yes, that’s right, every symbol for every letter on the planet is getting a code point that looks like U+0639 where U means Unicode and 0639 is a hexadecimal identifier for the letter. Don’t make the assumption that the code points are limited to 16 bits or 65,536 code points. That code point value is unlimited.

That’s the next point. This has nothing to do with any kind of computer memory storage. It’s just a value. A Unicode value says nothing about how to store this in memory or how to send it in an email. Encodings do that.

There are many different encodings that specify how to store a code point in memory. UTF-8 (Unicode Transformation Format 8) is the most popular in North America. UTF-8 resembles the standard way of storing ASCII data that we’ve used forever with the nice property that zero is a string terminator. If your strings are all standard ASCII data, you don’t have to change a thing – you are using UTF-8 by default.

But there are other encodings: UTF-16, which uses 16-bits, or the UCS-2 standard, which uses 2 bytes (yes, it’s still different than the 16-bit UTF-16). There’s something called UTF-7 where the high bit is always zero for those systems that use the high bit for some other purpose. And there are probably others that I haven’t run across. The point here is that you can’t transmit a string or process an incoming string unless you know its encoding. That’s why you will occasionally see an email message or some other string that contains a long series of question marks. That usually means that the programmer didn’t bother to detect the encoding designation and interpret the string properly.

In an email there is an indicator of the form:

Content-Type: text/plain; charset= “UTF-8″

in the header of the email that explains to the receiver how to decode it. When a programmer ignores that kind of information, “???????????????” is what you’ll see.

For websites it gets a little trickier. One method is for each web page to specify an HTML tag that identifies the encoding, but that’s not often used. What actually happens is the browser makes its best guess. It has some heuristics about how often certain letters appear in a certain language and it makes its best guess as to what symbols to display for each code point. It works more often than not.

And that, my friends, concludes my three-part series on ASCII encoding. I hope the next time you are working with ASCII strings, you’ll make sure to communicate what encoding you’re using and use the proper encodings to decode strings you receive. If you are looking for a device to move ASCII data in and out of a PLC, please visit our website – you’ll find out why we at RTA are known as the ASCII guys.

If you’d like to read a very well-written article on this subject, Joel on Software – ASCII Encodings is a great resource for you.