Does \w match all alphanumeric characters defined in the Unicode standard?

perldoc perlunicode says Character classes in regular expressions match characters instead of bytes and match against the character properties specified in the Unicode properties database. \w can be used to match a Japanese ideograph, for instance. So it looks like the answer to your question is “yes”. However, you might want to use the \p{} … Read more

Regex for names with special characters (Unicode)

Try the following regular expression: ^(?:[\p{L}\p{Mn}\p{Pd}\’\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\’\x{2019}]+\s?)+$ In PHP this translates to: if (preg_match(‘~^(?:[\p{L}\p{Mn}\p{Pd}\’\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\’\x{2019}]+\s?)+$~u’, $name) > 0) { // valid } You should read it like this: ^ # start of subject (?: # match this: [ # match a: \p{L} # Unicode letter, or \p{Mn} # Unicode accents, or \p{Pd} # Unicode hyphens, or \’ … Read more

Match any unicode letter?

Python’s re module doesn’t support Unicode properties yet. But you can compile your regex using the re.UNICODE flag, and then the character class shorthand \w will match Unicode letters, too. Since \w will also match digits, you need to then subtract those from your character class, along with the underscore: [^\W\d_] will match any Unicode … Read more

How to match Cyrillic characters with a regular expression

If your regex flavor supports Unicode blocks ([\p{IsCyrillic}]), you can match Cyrillic characters with: [\p{IsCyrillic}] or [\p{Cyrillic}] Otherwise try using: [U+0400–U+04FF] For PHP use: [\x{0400}-\x{04FF}] Explanation: [\p{IsCyrillic}] Match a character from the Unicode block “Cyrillic” (U+0400–U+04FF) «[\p{IsCyrillic}]» Note: Unicode Characters list and Numeric HTML Entities of [U+0400–U+04FF] .

Python and regular expression with Unicode

Are you using python 2.x or 3.0? If you’re using 2.x, try making the regex string a unicode-escape string, with ‘u’. Since it’s regex it’s good practice to make your regex string a raw string, with ‘r’. Also, putting your entire pattern in parentheses is superfluous. re.sub(ur'[\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+’, ”, …) http://docs.python.org/tutorial/introduction.html#unicode-strings Edit: It’s also good practice … Read more

How can I use Unicode-aware regular expressions in JavaScript?

Situation for ES 6 The ECMAScript language specification, edition 6 (also commonly known as ES2015), includes Unicode-aware regular expressions. Support must be enabled with the u modifier on the regex. See Unicode-aware regular expressions in ES6 for a break-down of the feature and some caveats. ES6 is widely adopted in both browsers and stand-alone Javascript … Read more

Unicode equivalents for \w and \b in Java regular expressions?

Source code The source code for the rewriting functions I discuss below is available here. Update in Java 7 Sun’s updated Pattern class for JDK7 has a marvelous new flag, UNICODE_CHARACTER_CLASS, which makes everything work right again. It’s available as an embeddable (?U) for inside the pattern, so you can use it with the String … Read more