Tech Stuff I Want To Remember: Unicode - what programmers need to know

Tuesday, 9 September 2014

Unicode - what programmers need to know

From: http://www.joelonsoftware.com/articles/Unicode.html

Unicode represents a letter as a Unicode code point (represented as U+nnnn) - this doesn't say how the letter is stored in memory - only what its conceptual representation is. This can then be stored using some encoding (UTF-8, UTF-16, etc).

UTF-8 uses 1 byte to represent all letters below ASCII 128
UCS-2 alias UTF-16 uses 2 bytes - these can then be big-endian or little-endian and uses an initial FE FF (unicode byte order mark) to indicate which endian it is.

The Unicode code points can be encoded in any encoding scheme you want - could be ASCII, Hebew ANSI, OEM Greek, etc. But if the unicode code point has no visual representation in the encoding scheme of the reader, then you get a little ? or a ? in a rhombus.

A string without an encoding specified doesn't make sense.

Tech Stuff I Want To Remember

Tuesday, 9 September 2014

Unicode - what programmers need to know

No comments:

Post a Comment