Help & DocumentationData Transmission ServiceFAQRegular Expressions for Data Subscription

Regular Expressions for Data Subscription

Last updated: 2018-08-31 11:55:16

PDF

What is a regular expression?

A regular expression is used to retrieve text that meets a certain pattern from text.

It matches a string from left to right. We generally use "regex" or "regexp" for short.
A regular expression can be used to replace text in strings, validate forms, and extract strings from a string based on pattern matching.

Imagine that you are writing an application, and you want to set rules for users to select usernames. We want usernames to contain letters, numbers, underscores, and hyphens.
To make it look good, we also want to limit the number of characters for a username. We can use the following regular expression to verify usernames:

Regular expression

The above regular expression can match john_doe, jo-hn\_doe, and john12\_as. However, it cannot match Jo which contains an uppercase character and is too short.

Contents

Basic Match

Regular expressions are patterns we use to retrieve letters and numbers in text. For example, the regular expression cat indicates: the letter c followed by letters a and t.

"cat" => The cat sat on the mat

The regular expression 123 matches the string "123". Regular matching is done by comparing each character in the regular expression with that in the string to be matched one by one.
Regular expressions are generally case sensitive, so the regular expression Cat does not match the string "cat".

"Cat" => The cat sat on the Cat

Metacharacter

Metacharacters are the basic elements of regular expressions. Metacharacters here are not the same as usual, but are interpreted in a special way. Some metacharacters in square brackets have special meaning.
Here are the metacharacters:

Metacharacter Description
. Match any characters other than line breaks.
[ ] Character class. Match any characters enclosed in square brackets.
[^ ] Negative character class. Match any characters not enclosed in square brackets.
* Match the preceding subexpression zero or more times
+ Match the preceding subexpression one or more times
? Match the previous subexpression zero or one time, or specify a non-greedy qualifier.
{n,m} Curly bracket. Match the preceding character at least n times, but not more than m times.
(xyz) Character group. Match the character xyz in an exact order.
| Branch structure. Match characters before or after the symbol.
\ Escape character. It can restore the original meaning of metacharacters, allowing you to match reserved characters [ ] ( ) { } . * + ? ^ $ \ |
^ Match the start of the line
$ Match the end of the line

Period

The period . is the simplest example of a metacharacter. The metacharacter . can match any single character. It does not match a line break or a new line character. For example, the regular expression .ar indicates: any characters followed by letters a
and r.

".ar" => The car parked in the garage.

Character set

A character set is also called character class. Square brackets are used to specify the character set. Specify the character range using the hyphen within the character set. The order of the character ranges in square brackets can be ignored.
For example, the regular expression [Tt]he indicates: uppercase T or lowercase t followed by the letters h and e.

"[Tt]he" => The car parked in the garage.

However, the period in the character set indicates its literal meaning. The regular expression ar[.] indicates the lowercase letter a followed by the letter r and a period . character.

"ar[.]" => A garage is a good place to park a car.

Negative character set

In general, the insertion character ^ indicates the start of a string. However, if it appears in square brackets, it cancels the character set. For example, the regular expression [^c]ar indicates: any characters other than the letter c followed by the character a and
letter r.

"[^c]ar" => The car parked in the garage.

Repetition

The following metacharacters +, * or ? are used to specify how many times the sub-pattern can appear. These metacharacters work differently in different situations.

Asterisk

The symbol * indicates matching the previous matching rule zero or more times. The regular expression a* indicates that the lowercase a can be repeated zero or more times. But if it appears after a character set or character class, it indicates the repetition of the entire character set.
For example, the regular expression [a-z]* indicates: a line containing any number of lowercase letters.

"[a-z]*" => The car parked in the garage #21.

The symbol * can be used with the meta symbol . to match any string .*. The symbol * can be used with the space character \s to match a string of space characters.
For example, the regular expression \s*cat\s* indicates: zero or more spaces followed by a lowercase letter c, a lowercase letter a, a lowercase letter t, and zero or more spaces.

"\s*cat\s*" => The fat cat sat on the cat.

Plus sign

The symbol + matches the previous character one or more times. For example, the regular expression c.+t indicates: a lowercase letter c followed by any number of characters and a lowercase letter t.

"c.+t" => The fat cat sat on the mat.

Question mark

In regular expressions, the metacharacter ? is used to indicate that the previous character is optional. This symbol matches the previous character zero or one time.
For example, the regular expression [T]?he indicates: the optional uppercase letter T followed by a lowercase letter h and a lowercase letter e.

"[T]he" => The car is parked in the garage.
"[T]?he" => The car is parked in the garage.

Curly bracket

Curly brackets (also called quantifier ?) are used in regular expressions to specify the number of times a character or a group of characters can be repeated. For example, the regular expression [0-9]{2,3} indicates: matching at least 2 numbers but no more than 3 numbers (characters ranging from 0 to 9).

"[0-9]{2,3}" => The number was 9.9997 but we rounded it off to 10.0.

We can omit the second number. For example, the regular expression [0-9]{2,} indicates: matching 2 or more numbers. If we delete the comma, the regular expression [0-9]{2} indicates: matching exactly two-digit numbers.

"[0-9]{2,}" => The number was 9.9997 but we rounded it off to 10.0.
"[0-9]{2}" => The number was 9.9997 but we rounded it off to 10.0.

Character group

A character group is a set of sub-patterns written in parentheses (...). As we discussed in regular expressions, if we put a quantifier after a character, the previous character is repeated.
However, if we put a quantifier after a character group, the entire character group is repeated.
For example, the regular expression (ab)* indicates matching zero or more strings "ab". We can also use the metacharacter | in a character group. For example, the regular expression (c|g|p)ar indicates: the lowercase letter c, g or p followed by letters a and r.

"(c|g|p)ar" => The car is parked in the garage.

Branch structure

The vertical bar | is used to define the branch structure in a regular expression. The branch structure is like the condition between multiple expressions. Now you may think that this character set works in the same way as the branch structure.
But the difference is that the character set is only used at the character level, while the branch structure can be used at the expression level.
For example, the regular expression (T|t)he|car indicates: the uppercase letter T or lowercase letter t is followed by a lowercase letter h, a lowercase letter e or lowercase letter c, then a lowercase letter a, and a lowercase letter r.

"(T|t)he|car" => The car is parked in the garage.

Escape special character

Use the backslash \ in the regular expression to escape the next character. This allows you to use reserved characters as the matching characters { } [ ] / \ + * . $ ^ | ?. You can use it as a matching character by adding a \ before a special character.
For example, the regular expression . is used to match any characters other than line breaks. To match the . character in the input string, the regular expression (f|c|m)at\.? indicates: the lowercase letter f, c, or m followed by a lowercase letter a, a lowercase letter t, and an optional . character.

"(f|c|m)at\.?" => The fat cat sat on the mat.

Locator

In regular expressions, we use locators to check whether the matching symbol is a start or end symbol.
There are two types of locators: ^, which checks if the matching character is the start character, and $, which checks if the matching character is the end character of an input string.

Caret

The caret ^ is used to check if the matching character is the first character of an input string. If we use the regular expression ^a (if "a" is the start symbol) to match the string abc, it matches a.
But if we use the regular expression ^b, it matches nothing, because "b" in the string abc is not the start character.
Take a look at another regular expression ^(T|t)he, which indicates: the uppercase letter T or lowercase letter t is the start symbol of the input string, followed by a lowercase letter h and a lowercase letter e.

"(T|t)he" => The car is parked in the garage.
"^(T|t)he" => The car is parked in the garage.

Dollar sign

Dollar sign $ is used to check if the matching character is the last character of an input string. For example, the regular expression (at\.)$ indicates: the lowercase letter a followed by the lowercase letter t and character ., and this matcher must be the end of the string.

"(at\.)" => The fat cat. sat. on the mat.
"(at\.)$" => The fat cat sat on the mat.

Abbreviated Character Set

Regular expressions provide abbreviations for common character sets and regular expressions. The abbreviated character set is as follows:

Abbreviation Description
. Match any characters other than line breaks
\w Match all alphanumeric characters: [a-zA-Z0-9_]
\W Match non-alphanumeric characters: [^\w]
\d Match numeric characters: [0-9]
\D Match non-numeric characters: [^\d]
\s Match space characters: [\t\n\f\r\p{Z}]
\S Match non-space characters: [^\s]

Assertion

Lookbehind assertions and lookahead assertions are sometimes referred to as assertions, which are special types of non-capturing groups (used for matching pattern, but not included in the matching list). When we use this pattern before or after a particular pattern, we use assertions first.
For example, we want to obtain all the numbers before the character $ in the input string $4.44 and $10.88. We can use this regular expression (?<=\$)[0-9\.]* to indicate: get all numbers before the character $ with the character . included.
The followings are the assertions used in regular expressions:

Symbol Description
?= Positive lookahead assertion
?! Negative lookahead assertion
?<= Positive lookbehind assertion
?<! Negative lookbehind assertion

Positive lookahead assertion

For positive lookahead assertions, the first part of the expression must be a lookahead assertion expression. The returned matching result only contains the text that matches the first part of the expression.
To define a positive lookahead assertion in brackets, the question mark and equal sign in brackets are expressed as (?=...). The lookahead assertion expression is put after the equal sign in brackets.
For example, the regular expression (T|t)he(?=\sfat) indicates: matching uppercase letter T or lowercase letter t, which is followed by letters h and e.
In brackets, we define a positive lookahead assertion that leads the regular expression engine to match The or the which is followed by fat.

"(T|t)he(?=\sfat)" => The fat cat sat on the mat.

Negative lookahead assertion

When we need to obtain the content mismatching the expression from an input string, we use a negative lookahead assertion. Negative lookahead assertion is defined in the same way as positive lookahead assertion.
The only difference is that we use negation symbol ! instead of equal sign =, such as (?!...).
Take a look at the following regular expression (T|t)he(?!\sfat), which indicates: get all The or the mismatching fat from the input string, with a space character added before fat.

"(T|t)he(?!\sfat)" => The fat cat sat on the mat.

Positive lookbehind assertion

Positive lookbehind assertions are used to obtain all matching content before a particular pattern. The positive lookbehind assertion is expressed as (?<=...). For example, the regular expression (?<=(T|t)he\s)(fat|mat) indicates: get all the fat and mat behind the word The or the from the input string.

"(?<=(T|t)he\s)(fat|mat)" => The fat cat sat on the mat.

Negative lookbehind assertion

Negative lookbehind assertions are used to obtain all matching content that are not before a particular pattern. Negative lookbehind assertions are expressed as (?<!...). For example, the regular expression (?<!(T|t)he\s)(cat) indicates: get all the cat that are not behind The or the in the input characters.

"(?<!(T|t)he\s)(cat)" => The cat sat on cat.

Label

Label modifies the output of the regular expression, which is also called modifier. The following labels can be used in any order or combination, and are part of a regular expression.

Label Description
i Case insensitive: Set the matching rule as case insensitive.
g Global search: Search the entire input string for all matching content.
m Multiline match: Match each line of the input string.

Case insensitive

The modifier i is used to perform case-insensitive matching. For example, the regular expression /The/gi indicates: the uppercase letter T followed by a lowercase letter h and a letter e.
But at the end of regular matching, the label i informs the regular expression engine to ignore it. As you can see, we also use the label g because we want to search the entire input string for matching content.

"The" => The fat cat sat on the mat.
"/The/gi" => The fat cat sat on the mat.

The modifier g is used to perform a global match (it finds all matching items, and will not stop until the first one is found).
For example, the regular expression /.(at)/g indicates: any characters other than line breaks followed by a lowercase letter a and a lowercase letter t.
Because we use the label g at the end of the regular expression, it finds each matching item from the entire input string.

".(at)" => The fat cat sat on the mat.
"/.(at)/g" => The fat cat sat on the mat.

Multiline match

The modifier m is used to perform a multiline match. As we discussed earlier about (^, $), use a locator to check whether the matching character is the start or the end of an input string. However, we want to use a locator for each line, so we use the modifier m.
For example, the regular expression /at(.)?$/gm indicates: the lowercase letter a, followed by the lowercase letter t matching any character other than line breaks zero or one time. And because of the label m, the regular expression engine matches the end of each line in the string.

"/.at(.)?$/" => The fat
                cat sat

        on the mat.
"/.at(.)?$/gm" => The fat
                  cat sat
                  on the mat.

Common Regular Expression

Type
Expression
Positive integer
^-\d+$
Negative integer
^-\d+$
Phone number
^+?[\d\s]{3,}$
Phone code
^+?[\d\s]+(?[\d\s]{10,}$
Integer
^-?\d+$
User name
^[\w\d_.]{4,16}$
Alphanumeric characters
^[a-zA-Z0-9]*$
Alphanumeric characters with spaces
^[a-zA-Z0-9 ]*$
Password
^(?=^.{6,}$)((?=.*[A-Za-z0-9])(?=.*[A-Z])(?=.*[a-z]))^.*$
Email
^([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})*$
IPv4 address
^((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))*$`
Lowercase letter
^([a-z])*$
Uppercase letter
^([A-Z])*$
User name
^[\w\d_.]{4,16}$
Website
^(((http|https|ftp):\/\/)?([[a-zA-Z0-9]\-\.])+(\.)([[a-zA-Z0-9]]){2,4}([[a-zA-Z0-9]\/+=%&_\.~?\-]*))*$
VISA credit card number
^(4[0-9]{12}(?:[0-9]{3})?)*$
Date
(MM/DD/YYYY)

^(0?[1-9]|1[012])[- /.](0?[1-9]|[12][0-9]|3[01])[- /.](19|20)?[0-9]{2}$
Date
(YYYY/MM/DD)

^(19|20)?[0-9]{2}[- /.](0?[1-9]|1[012])[- /.](0?[1-9]|[12][0-9]|3[01])$
MasterCard credit card number
^(5[1-5][0-9]{14})*$