surrogate-pairs – Make Me Engineer

Python: Find equivalent surrogate pair from non-BMP unicode char

June 1, 2023 by Tarik

You’ll have to manually replace each non-BMP point with the surrogate pair. You could do this with a regular expression: import re _nonbmp = re.compile(r'[\U00010000-\U0010FFFF]’) def _surrogatepair(match): char = match.group() assert ord(char) > 0xffff encoded = char.encode(‘utf-16-le’) return ( chr(int.from_bytes(encoded[:2], ‘little’)) + chr(int.from_bytes(encoded[2:], ‘little’))) def with_surrogates(text): return _nonbmp.sub(_surrogatepair, text) Demo: >>> with_surrogates(‘\U0001f64f’) ‘\ud83d\ude4f’

Split JavaScript string into array of codepoints? (taking into account “surrogate pairs” but not “grapheme clusters”)

May 27, 2023 by Tarik

@bobince’s answer has (luckily) become a bit dated; you can now simply use var chars = Array.from( text ) to obtain a list of single-codepoint strings which does respect astral / 32bit / surrogate Unicode characters.

What are the most common non-BMP Unicode characters in actual use? [closed]

November 7, 2022 by Tarik

Emoji are now the most common non-BMP characters by far. 😂, otherwise known as U+1F602 FACE WITH TEARS OF JOY, is the most common one on Twitter’s public stream. It occurs more frequently than the tilde!

JavaScript strings outside of the BMP

July 11, 2022 by Tarik

Depends what you mean by ‘support’. You can certainly put non-UCS-2 characters in a JS string using surrogates, and browsers will display them if they can. But, each item in a JS string is a separate UTF-16 code unit. There is no language-level support for handling full characters: all the standard String members (length, split, … Read more

How can I convert surrogate pairs to normal string in Python?

May 14, 2022 by Tarik

You’ve mixed a literal string \ud83d in a json file on disk (six characters: \ u d 8 3 d) and a single character u’\ud83d’ (specified using a string literal in Python source code) in memory. It is the difference between len(r’\ud83d’) == 6 and len(‘\ud83d’) == 1 on Python 3. If you see ‘\ud83d\ude4f’ … Read more

What is a “surrogate pair” in Java?

May 12, 2022 by Tarik

The term “surrogate pair” refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme. In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF. Internally, Java uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. … Read more