From: http://www.joelonsoftware.com/articles/Unicode.html
Unicode represents a letter as a Unicode code point (represented as U+nnnn) - this doesn't say how the letter is stored in memory - only what its conceptual representation is. This can then be stored using some encoding (UTF-8, UTF-16, etc).
UTF-8 uses 1 byte to represent all letters below ASCII 128
UCS-2 alias UTF-16 uses 2 bytes - these can then be big-endian or little-endian and uses an initial FE FF (unicode byte order mark) to indicate which endian it is.
The Unicode code points can be encoded in any encoding scheme you want - could be ASCII, Hebew ANSI, OEM Greek, etc. But if the unicode code point has no visual representation in the encoding scheme of the reader, then you get a little ? or a ? in a rhombus.
A string without an encoding specified doesn't make sense.
Unicode represents a letter as a Unicode code point (represented as U+nnnn) - this doesn't say how the letter is stored in memory - only what its conceptual representation is. This can then be stored using some encoding (UTF-8, UTF-16, etc).
UTF-8 uses 1 byte to represent all letters below ASCII 128
UCS-2 alias UTF-16 uses 2 bytes - these can then be big-endian or little-endian and uses an initial FE FF (unicode byte order mark) to indicate which endian it is.
The Unicode code points can be encoded in any encoding scheme you want - could be ASCII, Hebew ANSI, OEM Greek, etc. But if the unicode code point has no visual representation in the encoding scheme of the reader, then you get a little ? or a ? in a rhombus.
A string without an encoding specified doesn't make sense.
No comments:
Post a Comment