Isn’t on big endian machines UTF-8’s byte order different than on little endian machines? So why then doesn’t UTF-8 require a BOM?

The byte order is different on big endian vs little endian machines for words/integers larger than a byte. e.g. on a big-endian machine a short integer of 2 bytes stores the 8 most significant bits in the first byte, the 8 least significant bits in the second byte. On a little-endian machine the 8 most … Read more

What is a multibyte character set?

The term is ambiguous, but in my internationalization work, we typically avoided the term “multibyte character sets” to refer to Unicode-based encodings. Generally, we used the term only for legacy encoding schemes that had one or more bytes to define each character (excluding encodings that require only one byte per character). Shift-jis, jis, euc-jp, euc-kr, … Read more

NSString : easy way to remove UTF-8 accents from a string?

NSString *str = @”Être ou ne pas être. C’était là-bas.”; NSData *data = [str dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES]; NSString *newStr = [[NSString alloc] initWithData:data encoding:NSASCIIStringEncoding]; NSLog(@”%@”, newStr); … or try using NSUTF8StringEncoding instead. List of encoding types here: https://developer.apple.com/documentation/foundation/nsstringencoding Just FTR here’s a one line way to write this great answer: yourString = [[NSString alloc] initWithData: [yourString … Read more

python encoding utf-8

You don’t need to encode data that is already encoded. When you try to do that, Python will first try to decode it to unicode before it can encode it back to UTF-8. That is what is failing here: >>> data = u’\u00c3′ # Unicode data >>> data = data.encode(‘utf8’) # encoded to UTF-8 >>> … Read more

How to write a std::string to a UTF-8 text file

The only way UTF-8 affects std::string is that size(), length(), and all the indices are measured in bytes, not characters. And, as sbi points out, incrementing the iterator provided by std::string will step forward by byte, not by character, so it can actually point into the middle of a multibyte UTF-8 codepoint. There’s no UTF-8-aware … Read more