Python: Find equivalent surrogate pair from non-BMP unicode char
You’ll have to manually replace each non-BMP point with the surrogate pair. You could do this with a regular expression: import re _nonbmp = re.compile(r'[\U00010000-\U0010FFFF]’) def _surrogatepair(match): char = match.group() assert ord(char) > 0xffff encoded = char.encode(‘utf-16-le’) return ( chr(int.from_bytes(encoded[:2], ‘little’)) + chr(int.from_bytes(encoded[2:], ‘little’))) def with_surrogates(text): return _nonbmp.sub(_surrogatepair, text) Demo: >>> with_surrogates(‘\U0001f64f’) ‘\ud83d\ude4f’