character-properties – Make Me Engineer

Matching only a unicode letter in Python re

June 15, 2023 by Tarik

You can construct a new character class: [^\W\d_] instead of \w. Translated into English, it means “Any character that is not a non-alphanumeric character ([^\W] is the same as \w), but that is also not a digit and not an underscore”. Therefore, it will only allow Unicode letters.

Does \w match all alphanumeric characters defined in the Unicode standard?

May 29, 2023 by Tarik

perldoc perlunicode says Character classes in regular expressions match characters instead of bytes and match against the character properties specified in the Unicode properties database. \w can be used to match a Japanese ideograph, for instance. So it looks like the answer to your question is “yes”. However, you might want to use the \p{} … Read more

Regex for names with special characters (Unicode)

November 22, 2022 by Tarik

Try the following regular expression: ^(?:[\p{L}\p{Mn}\p{Pd}\’\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\’\x{2019}]+\s?)+$ In PHP this translates to: if (preg_match(‘~^(?:[\p{L}\p{Mn}\p{Pd}\’\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\’\x{2019}]+\s?)+$~u’, $name) > 0) { // valid } You should read it like this: ^ # start of subject (?: # match this: [ # match a: \p{L} # Unicode letter, or \p{Mn} # Unicode accents, or \p{Pd} # Unicode hyphens, or \’ … Read more

Match any unicode letter?

September 1, 2022 by Tarik

Python’s re module doesn’t support Unicode properties yet. But you can compile your regex using the re.UNICODE flag, and then the character class shorthand \w will match Unicode letters, too. Since \w will also match digits, you need to then subtract those from your character class, along with the underscore: [^\W\d_] will match any Unicode … Read more

How to match Cyrillic characters with a regular expression

June 17, 2022 by Tarik

If your regex flavor supports Unicode blocks ([\p{IsCyrillic}]), you can match Cyrillic characters with: [\p{IsCyrillic}] or [\p{Cyrillic}] Otherwise try using: [U+0400–U+04FF] For PHP use: [\x{0400}-\x{04FF}] Explanation: [\p{IsCyrillic}] Match a character from the Unicode block “Cyrillic” (U+0400–U+04FF) «[\p{IsCyrillic}]» Note: Unicode Characters list and Numeric HTML Entities of [U+0400–U+04FF] .

Python and regular expression with Unicode

May 16, 2022 by Tarik

Are you using python 2.x or 3.0? If you’re using 2.x, try making the regex string a unicode-escape string, with ‘u’. Since it’s regex it’s good practice to make your regex string a raw string, with ‘r’. Also, putting your entire pattern in parentheses is superfluous. re.sub(ur'[\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+’, ”, …) http://docs.python.org/tutorial/introduction.html#unicode-strings Edit: It’s also good practice … Read more

Python regex matching Unicode properties

May 14, 2022 by Tarik

The regex module (an alternative to the standard re module) supports Unicode codepoint properties with the \p{} syntax.

Matching Unicode letter characters in PCRE/PHP

May 1, 2022 by Tarik

I think the problem is much simpler than that: You forgot to specify the u modifier. The Unicode character properties are only available in UTF-8 mode. Your regex should be: // unicode letters, apostrophe, hyphen, space $namePattern = ‘/^[-\’ \p{L}]+$/u’;

How can I use Unicode-aware regular expressions in JavaScript?

April 24, 2022 by Tarik

Situation for ES 6 The ECMAScript language specification, edition 6 (also commonly known as ES2015), includes Unicode-aware regular expressions. Support must be enabled with the u modifier on the regex. See Unicode-aware regular expressions in ES6 for a break-down of the feature and some caveats. ES6 is widely adopted in both browsers and stand-alone Javascript … Read more

Unicode equivalents for \w and \b in Java regular expressions?

April 9, 2022 by Tarik

Source code The source code for the rewriting functions I discuss below is available here. Update in Java 7 Sun’s updated Pattern class for JDK7 has a marvelous new flag, UNICODE_CHARACTER_CLASS, which makes everything work right again. It’s available as an embeddable (?U) for inside the pattern, so you can use it with the String … Read more