regex – Character classes – Character class and common problems faced by beginner

1. Character Class

Character class is denoted by []. Content inside a character class is treated as single character separately. e.g. suppose we use

[12345]

In the example above, it means match 1 or 2 or 3 or 4 or 5 . In simple words, it can be understood as or condition for single characters (stress on single character)

1.1 Word of caution

  • In character class, there is no concept of matching a string. So, if you are using regex [cat], it does not mean that it should match the word cat literally but it means that it should match either c or a or t. This is a very common misunderstanding existing among people who are newer to regex.
  • Sometimes people use | (alternation) inside character class thinking it will act as OR condition which is wrong. e.g. using [a|b] actually means match a or | (literally) or b.

2. Range in character class

Range in character class is denoted using - sign. Suppose we want to find any character within English alphabets A to Z. This can be done by using the following character class

[A-Z]

This could be done for any valid ASCII or unicode range. Most commonly used ranges include [A-Z], [a-z] or [0-9]. Moreover these ranges can be combined in character class as

[A-Za-z0-9]

This means that match any character in the range A to Z or a to z or 0 to 9. The ordering can be anything. So the above is equivalent to [a-zA-Z0-9] as long as the range you define is correct.

2.1 Word of caution

  • Sometimes when writing ranges for A to Z people write it as [A-z]. This is wrong in most cases because we are using z instead of Z. So this denotes match any character from ASCII range 65 (of A) to 122 (of z) which includes many unintended character after ASCII range 90 (of Z). HOWEVER, [A-z] can be used to match all [a-zA-Z] letters in POSIX-style regex when collation is set for a particular language.
    [[ "ABCEDEF[]_abcdef" =~ ([A-z]+) ]] && echo "${BASH_REMATCH[1]}" on Cygwin with LC_COLLATE="en_US.UTF-8" yields ABCEDF.
    If you set LC_COLLATE to C (on Cygwin, done with export), it will give the expected ABCEDEF[]_abcdef.

  • Meaning of - inside character class is special. It denotes range as explained above. What if we want to match - character literally? We can’t put it anywhere otherwise it will denote ranges if it is put between two characters. In that case we have to put - in starting of character class like [-A-Z] or in end of character class like [A-Z-] or escape it if you want to use it in middle like [A-Z\-a-z].

3. Negated character class

Negated character class is denoted by [^..]. The caret sign ^ denotes match any character except the one present in character class. e.g.

[^cat]

means match any character except c or a or t.

3.1 Word of caution

  • The meaning of caret sign ^ maps to negation only if its in the starting of character class. If its anywhere else in character class it is treated as literal caret character without any special meaning.
  • Some people write regex like [^]. In most regex engines, this gives an error. The reason being when you are using ^ in the starting position, it expects at least one character that should be negated. In JavaScript though, this is a valid construct matching anything but nothing, i.e. matches any possible symbol (but diacritics, at least in ES5).

if you want to reproduce, please indicate the source:
regex – Character classes – Character class and common problems faced by beginner - CodeDay