utf-8 – Make Me Engineer

Bytes in a unicode Python string

June 15, 2023 by Tarik

In Python 2, Unicode strings may contain both unicode and bytes: No, they may not. They contain Unicode characters. Within the original string, \xd0 is not a byte that’s part of a UTF-8 encoding. It is the Unicode character with code point 208. u’\xd0′ == u’\u00d0′. It just happens that the repr for Unicode strings … Read more

UTF-8 in Windows 7 CMD [duplicate]

June 13, 2023 by Tarik

This question has been already answered in Unicode characters in Windows command line – how? You missed one step -> you need to use Lucida console fonts in addition to executing chcp 65001 from cmd console.

Isn’t on big endian machines UTF-8’s byte order different than on little endian machines? So why then doesn’t UTF-8 require a BOM?

June 13, 2023 by Tarik

The byte order is different on big endian vs little endian machines for words/integers larger than a byte. e.g. on a big-endian machine a short integer of 2 bytes stores the 8 most significant bits in the first byte, the 8 least significant bits in the second byte. On a little-endian machine the 8 most … Read more

Make git diff show UTF8 encoded characters properly

June 13, 2023 by Tarik

git is dumping out raw bytes. In this case, it doesn’t care what your file’s encoding is. The highlighted <F6> you’re seeing is coming from less, which is presumably configured as your PAGER. Try setting: LESSCHARSET=UTF-8

PHP json_encode json_decode UTF-8

June 13, 2023 by Tarik

json utf8 encode and decode: json_encode($data, JSON_UNESCAPED_UNICODE) json_decode($json, false, 512, JSON_UNESCAPED_UNICODE) force utf8 might be helpfull too: http://pastebin.com/2XKqYU49

What is a multibyte character set?

June 12, 2023 by Tarik

The term is ambiguous, but in my internationalization work, we typically avoided the term “multibyte character sets” to refer to Unicode-based encodings. Generally, we used the term only for legacy encoding schemes that had one or more bytes to define each character (excluding encodings that require only one byte per character). Shift-jis, jis, euc-jp, euc-kr, … Read more

NSString : easy way to remove UTF-8 accents from a string?

June 12, 2023 by Tarik

NSString *str = @”Être ou ne pas être. C’était là-bas.”; NSData *data = [str dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES]; NSString *newStr = [[NSString alloc] initWithData:data encoding:NSASCIIStringEncoding]; NSLog(@”%@”, newStr); … or try using NSUTF8StringEncoding instead. List of encoding types here: https://developer.apple.com/documentation/foundation/nsstringencoding Just FTR here’s a one line way to write this great answer: yourString = [[NSString alloc] initWithData: [yourString … Read more

How can I be sure of the file encoding?

June 12, 2023 by Tarik

$ file –mime my.txt my.txt: text/plain; charset=iso-8859-1

python encoding utf-8

June 11, 2023 by Tarik

You don’t need to encode data that is already encoded. When you try to do that, Python will first try to decode it to unicode before it can encode it back to UTF-8. That is what is failing here: >>> data = u’\u00c3′ # Unicode data >>> data = data.encode(‘utf8’) # encoded to UTF-8 >>> … Read more

How to write a std::string to a UTF-8 text file

June 11, 2023 by Tarik

The only way UTF-8 affects std::string is that size(), length(), and all the indices are measured in bytes, not characters. And, as sbi points out, incrementing the iterator provided by std::string will step forward by byte, not by character, so it can actually point into the middle of a multibyte UTF-8 codepoint. There’s no UTF-8-aware … Read more