unicode – Page 38 – Make Me Engineer

Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars

May 10, 2022 by Tarik

I have done this recently in Java: public static final Pattern DIACRITICS_AND_FRIENDS = Pattern.compile(“[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+”); private static String stripDiacritics(String str) { str = Normalizer.normalize(str, Normalizer.Form.NFD); str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll(“”); return str; } This will do as you specified: stripDiacritics(“Björn”) = Bjorn but it will fail on for example Białystok, because the ł character is not diacritic. If … Read more

u’\ufeff’ in Python string

May 10, 2022 by Tarik

I ran into this on Python 3 and found this question (and solution). When opening a file, Python 3 supports the encoding keyword to automatically handle the encoding. Without it, the BOM is included in the read result: >>> f = open(‘file’, mode=”r”) >>> f.read() ‘\ufefftest’ Giving the correct encoding, the BOM is omitted in … Read more

How to remove \xa0 from string in Python?

May 10, 2022 by Tarik

\xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space. string = string.replace(u’\xa0′, u’ ‘) When .encode(‘utf-8’), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, \xa0 is represented by 2 bytes \xc2\xa0. Read … Read more

Normalizing Unicode

May 10, 2022 by Tarik

The unicodedata module offers a .normalize() function, you want to normalize to the NFC form. An example using the same U+0061 LATIN SMALL LETTER – U+0301 A COMBINING ACUTE ACCENT combination and U+00E1 LATIN SMALL LETTER A WITH ACUTE code points you used: >>> print(ascii(unicodedata.normalize(‘NFC’, ‘\u0061\u0301’))) ‘\xe1’ >>> print(ascii(unicodedata.normalize(‘NFD’, ‘\u00e1’))) ‘a\u0301’ (I used the ascii() … Read more

What is the difference between _tmain() and main() in C++?

May 9, 2022 by Tarik

_tmain does not exist in C++. main does. _tmain is a Microsoft extension. main is, according to the C++ standard, the program’s entry point. It has one of these two signatures: int main(); int main(int argc, char* argv[]); Microsoft has added a wmain which replaces the second signature with this: int wmain(int argc, wchar_t* argv[]); … Read more

How does Zalgo text work?

May 9, 2022 by Tarik

The text uses combining characters, also known as combining marks. See section 2.11 of Combining Characters in the Unicode Standard (PDF). In Unicode, character rendering does not use a simple character cell model where each glyph fits into a box with given height. Combining marks may be rendered above, below, or inside a base character … Read more

WChars, Encodings, Standards and Portability

May 9, 2022 by Tarik

Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++ No, and there is no way at all to fulfill all these properties, at least if you want your program to run on Windows. On Windows, you have to ignore the C and C++ standards almost … Read more

Concrete JavaScript regular expression for accented characters (diacritics)

May 9, 2022 by Tarik

The easier way to accept all accents is this: [A-zÀ-ú] // accepts lowercase and uppercase characters [A-zÀ-ÿ] // as above, but including letters with an umlaut (includes [ ] ^ \ × ÷) [A-Za-zÀ-ÿ] // as above but not including [ ] ^ \ [A-Za-zÀ-ÖØ-öø-ÿ] // as above, but not including [ ] ^ \ … Read more

How to use Greek symbols in ggplot2?

May 8, 2022 by Tarik

Here is a link to an excellent wiki that explains how to put greek symbols in ggplot2. In summary, here is what you do to obtain greek symbols Text Labels: Use parse = T inside geom_text or annotate. Axis Labels: Use expression(alpha) to get greek alpha. Facet Labels: Use labeller = label_parsed inside facet. Legend … Read more

Reference: Why are my “special” Unicode characters encoded weird using json_encode?

May 7, 2022 by Tarik

First of all: There’s nothing wrong here. This is how characters can be encoded in JSON. It is in the official standard. It is based on how string literals can be formed in Javascript ECMAScript (section 7.8.4 “String Literals”) and is described as such: Any code point may be represented as a hexadecimal number. The … Read more