How does Java store UTF-16 characters in its 16-bit char type?

The answer is in the javadoc : The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of … Read more

Correctly reading a utf-16 text file into a string without external libraries?

The C++11 solution (supported, on your platform, by Visual Studio since 2010, as far as I know), would be: #include <fstream> #include <iostream> #include <locale> #include <codecvt> int main() { // open as a byte stream std::wifstream fin(“text.txt”, std::ios::binary); // apply BOM-sensitive UTF-16 facet fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>)); // read for(wchar_t c; fin.get(c); ) … Read more

Emoji value range

The Unicode standard’s Unicode® Technical Report #51 includes a list of emoji (emoji-data.txt): … 21A9 ; text ; L1 ; none ; j # V1.1 (↩) LEFTWARDS ARROW WITH HOOK 21AA ; text ; L1 ; none ; j # V1.1 (↪) RIGHTWARDS ARROW WITH HOOK 231A ; emoji ; L1 ; none ; j … Read more

Why does the Java char primitive take up 2 bytes of memory?

When Java was originally designed, it was anticipated that any Unicode character would fit in 2 bytes (16 bits), so char and Character were designed accordingly. In fact, a Unicode character can now require up to 4 bytes. Thus, UTF-16, the internal Java encoding, requires supplementary characters use 2 code units. Characters in the Basic … Read more

Python UTF-16 CSV reader

At the moment, the csv module does not support UTF-16. In Python 3.x, csv expects a text-mode file and you can simply use the encoding parameter of open to force another encoding: # Python 3.x only import csv with open(‘utf16.csv’, ‘r’, encoding=’utf16′) as csvf: for line in csv.reader(csvf): print(line) # do something with the line … Read more

Difference between Big Endian and little Endian Byte order

Big-Endian (BE) / Little-Endian (LE) are two ways to organize multi-byte words. For example, when using two bytes to represent a character in UTF-16, there are two ways to represent the character 0x1234 as a string of bytes (0x00-0xFF): Byte Index: 0 1 ——————— Big-Endian: 12 34 Little-Endian: 34 12 In order to decide if … Read more