Python: Find equivalent surrogate pair from non-BMP unicode char

You’ll have to manually replace each non-BMP point with the surrogate pair. You could do this with a regular expression: import re _nonbmp = re.compile(r'[\U00010000-\U0010FFFF]’) def _surrogatepair(match): char = match.group() assert ord(char) > 0xffff encoded = char.encode(‘utf-16-le’) return ( chr(int.from_bytes(encoded[:2], ‘little’)) + chr(int.from_bytes(encoded[2:], ‘little’))) def with_surrogates(text): return _nonbmp.sub(_surrogatepair, text) Demo: >>> with_surrogates(‘\U0001f64f’) ‘\ud83d\ude4f’

JavaScript strings outside of the BMP

Depends what you mean by ‘support’. You can certainly put non-UCS-2 characters in a JS string using surrogates, and browsers will display them if they can. But, each item in a JS string is a separate UTF-16 code unit. There is no language-level support for handling full characters: all the standard String members (length, split, … Read more

How can I convert surrogate pairs to normal string in Python?

You’ve mixed a literal string \ud83d in a json file on disk (six characters: \ u d 8 3 d) and a single character u’\ud83d’ (specified using a string literal in Python source code) in memory. It is the difference between len(r’\ud83d’) == 6 and len(‘\ud83d’) == 1 on Python 3. If you see ‘\ud83d\ude4f’ … Read more

What is a “surrogate pair” in Java?

The term “surrogate pair” refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme. In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF. Internally, Java uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. … Read more