Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars

I have done this recently in Java: public static final Pattern DIACRITICS_AND_FRIENDS = Pattern.compile(“[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+”); private static String stripDiacritics(String str) { str = Normalizer.normalize(str, Normalizer.Form.NFD); str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll(“”); return str; } This will do as you specified: stripDiacritics(“Björn”) = Bjorn but it will fail on for example Białystok, because the ł character is not diacritic. If … Read more

u’\ufeff’ in Python string

I ran into this on Python 3 and found this question (and solution). When opening a file, Python 3 supports the encoding keyword to automatically handle the encoding. Without it, the BOM is included in the read result: >>> f = open(‘file’, mode=”r”) >>> f.read() ‘\ufefftest’ Giving the correct encoding, the BOM is omitted in … Read more

How to remove \xa0 from string in Python?

\xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space. string = string.replace(u’\xa0′, u’ ‘) When .encode(‘utf-8’), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, \xa0 is represented by 2 bytes \xc2\xa0. Read … Read more

Normalizing Unicode

The unicodedata module offers a .normalize() function, you want to normalize to the NFC form. An example using the same U+0061 LATIN SMALL LETTER – U+0301 A COMBINING ACUTE ACCENT combination and U+00E1 LATIN SMALL LETTER A WITH ACUTE code points you used: >>> print(ascii(unicodedata.normalize(‘NFC’, ‘\u0061\u0301’))) ‘\xe1’ >>> print(ascii(unicodedata.normalize(‘NFD’, ‘\u00e1’))) ‘a\u0301’ (I used the ascii() … Read more

How does Zalgo text work?

The text uses combining characters, also known as combining marks. See section 2.11 of Combining Characters in the Unicode Standard (PDF). In Unicode, character rendering does not use a simple character cell model where each glyph fits into a box with given height. Combining marks may be rendered above, below, or inside a base character … Read more

Reference: Why are my “special” Unicode characters encoded weird using json_encode?

First of all: There’s nothing wrong here. This is how characters can be encoded in JSON. It is in the official standard. It is based on how string literals can be formed in Javascript ECMAScript (section 7.8.4 “String Literals”) and is described as such: Any code point may be represented as a hexadecimal number. The … Read more