The UNICODE modifier, usually expressed as
u (PHP, Python) or
U (Java), makes the regex engine treat the pattern and the input string as Unicode strings and patterns, make the pattern shorthand classes like
\s, etc. Unicode-aware.
is a PHP regex to match strings that consist of 1 or more Unicode letters. See the regex demo.
Note that in PHP, the
/u modifier enables the PCRE engine to handle strings as UTF8 strings (by turning on
PCRE_UTF8 verb) and make the shorthand character classes in the pattern Unicode aware (by enabling
PCRE_UCP verb, see more at pcre.org).
Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.
In Python 2.x, the
re.UNICODE only affects the pattern itself: Make
\S dependent on the Unicode character properties database.
An inline version:
(?u) in Python,
(?U) in Java. For example:
print(re.findall(ur"(?u)\w+", u"Dąb")) # [u'D\u0105b'] print(re.findall(r"\w+", u"Dąb")) # [u'D', u'b'] System.out.println("Dąb".matches("(?U)\\w+")); // true System.out.println("Dąb".matches("\\w+")); // false