utf-16 – Make Me Engineer

UTF-16 to UTF-8 conversion (for scripting in Windows)

May 31, 2023 by Tarik

There is a GNU tool recode which you can also use on Windows. E.g. recode utf16..utf8 text.txt

How does Java store UTF-16 characters in its 16-bit char type?

May 29, 2023 by Tarik

The answer is in the javadoc : The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of … Read more

grepping binary files and UTF16

May 23, 2023 by Tarik

The easiest way is to just convert the text file to utf-8 and pipe that to grep: iconv -f utf-16 -t utf-8 file.txt | grep query I tried to do the opposite (convert my query to utf-16) but it seems as though grep doesn’t like that. I think it might have to do with endianness, … Read more

Correctly reading a utf-16 text file into a string without external libraries?

May 17, 2023 by Tarik

The C++11 solution (supported, on your platform, by Visual Studio since 2010, as far as I know), would be: #include <fstream> #include <iostream> #include <locale> #include <codecvt> int main() { // open as a byte stream std::wifstream fin(“text.txt”, std::ios::binary); // apply BOM-sensitive UTF-16 facet fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>)); // read for(wchar_t c; fin.get(c); ) … Read more

Emoji value range

May 15, 2023 by Tarik

The Unicode standard’s Unicode® Technical Report #51 includes a list of emoji (emoji-data.txt): … 21A9 ; text ; L1 ; none ; j # V1.1 (↩) LEFTWARDS ARROW WITH HOOK 21AA ; text ; L1 ; none ; j # V1.1 (↪) RIGHTWARDS ARROW WITH HOOK 231A ; emoji ; L1 ; none ; j … Read more

Difference between UTF-8 and UTF-16?

May 7, 2023 by Tarik

I believe there are a lot of good articles about this around the Web, but here is a short summary. Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits. Main UTF-8 pros: Basic ASCII characters … Read more

Why does the Java char primitive take up 2 bytes of memory?

May 2, 2023 by Tarik

When Java was originally designed, it was anticipated that any Unicode character would fit in 2 bytes (16 bits), so char and Character were designed accordingly. In fact, a Unicode character can now require up to 4 bytes. Thus, UTF-16, the internal Java encoding, requires supplementary characters use 2 code units. Characters in the Basic … Read more

Byte and char conversion in Java

May 1, 2023 by Tarik

A character in Java is a Unicode code-unit which is treated as an unsigned number. So if you perform c = (char)b the value you get is 2^16 – 56 or 65536 – 56. Or more precisely, the byte is first converted to a signed integer with the value 0xFFFFFFC8 using sign extension in a … Read more

Python UTF-16 CSV reader

April 29, 2023 by Tarik

At the moment, the csv module does not support UTF-16. In Python 3.x, csv expects a text-mode file and you can simply use the encoding parameter of open to force another encoding: # Python 3.x only import csv with open(‘utf16.csv’, ‘r’, encoding=’utf16′) as csvf: for line in csv.reader(csvf): print(line) # do something with the line … Read more

Difference between Big Endian and little Endian Byte order

April 16, 2023 by Tarik

Big-Endian (BE) / Little-Endian (LE) are two ways to organize multi-byte words. For example, when using two bytes to represent a character in UTF-16, there are two ways to represent the character 0x1234 as a string of bytes (0x00-0xFF): Byte Index: 0 1 ——————— Big-Endian: 12 34 Little-Endian: 34 12 In order to decide if … Read more