JavaZone 2014, Regex dissected

Posted by

An example Regex for validating e-mail addresses
An example Regex for validating e-mail addresses

My relationship to Regex (or Regexp) has always been of the type "google it, copy and paste it, never change it, can't understand anything of it". But so, finally, after years of "copy paste" I went on this talk about Regex on JavaZone and actually began to understand it. And now I'll share this introduction with you.

Learning Regex

On this years JavaZone I attended a talk by Rustam Mehmandarov about Regex. In this talk he broke down Regex into its smallest pieces and explained them all, one by one, with examples. He also went through a few bigger examples. So I'm gonna copy his style of thinking and do the same in this article. Thanks to his talk I began to understand - and actually remember - the fundamentals of Regex. I now understand it so good that I actually could write the last two examples from the top of my mind. And for me versus Regex that's the major breakthrough I've been waiting for.

Normal search

Just writing normal text in a Regex will perform a normal search.

Dan - Matches the word “Dan” in a string.

Hello - Matches the word “Hello” in a string.

But the true power of Regex comes when you start using all of its special features.

Square brackets [x]

Square brackets match something that you kind of don’t know about a string you’re looking for. If you are searching for a name in a string but you’re not sure of the exact name you could use instead of that letter a square bracket. Everything you put inside these brackets are alternatives in place of one character.

Some examples:

[D]an - Matches “Dan”, not very useful.

[DB]an - Matches “Dan” and “Ban” (first letter can be "D" or "B").

[DBTP]an - Matches “Dan”, “Ban”, “Tan”, and “Pan”.

Da[ng] - Matches “Dan” and “Dag” (last letter can be "n" or "g").

[Dan] - Matches single character “D”, or “a”, or “n”. Meaning it will not match the entire string "Dan", only one single character (probably not what you wanted).

You can have multiple brackets in one string, like so:

[DB]a[ng] - Matches “Dan” and “Ban”, and also “Dag” and “Bag”.

Ranges [x-x]

You can also advance your brackets with ranges. These are especially handy when we need to validate against the alphabet or numbers. You specify a range by writing the first character, followed by a dash, and ending with the last character. 

Some examples:

[A-Z]an - Match “Aan”, “Ban”, “Can”, “Dan”, and all the way up to “Zan”.

[A-C]an - Match “Aan”, “Ban” and “Can”.

We can make it more complex by adding more to our brackets. How about multiple ranges with mixed in single characters.

Some examples:

[AC-D]an - Match “Aan”, “Can” and “Dan”.

[A-CF-G]an - Match “Aan”, “Ban”, “Can”, “Fan” and “Gan”.

[A-BF-GZ]an - Match “Aan”, “Ban”, “Fan”, “Gan” and “Zan”.

The Regex looks at each letter like an alternative to match against except when you add the dash, then it understands that the two surrounding characters and everything between them will be matched.

Some examples:

[0-9]an - Match “0an”, “1an”, “2an”, “3an” and so on all the way to “9an”.

[0-9A-Z]an - Match any “an” starting with a number or capital letter.

[a-ZA-Z0-9]an - Match any “an” starting with any number, lower case or capital letter.

Not [^]

Brackets are powerful indeed. And this is just starting it. Maybe you only know what you don’t want.

[^D]an - Match anything that ends with "an" and starts with any character except "D", so “Dan” would not be valid, but "San", "Can" or "man" would.

[^DB]an - Same as above, but also “Ban” is one of the invalid matches.

None of these two examples would allow empty first character, "an" would thus not be valid.

These “not”-searches can also be used with multiple brackets, characters, and ranges.

[^DB]a[ng] - Match anything that ends with “an” or “ag”, but must not start with “D” or “B”.

[^D-G]an - Match anything *but* “Dan”, “Ean”, “Fan”, “Gan”, but "San" and "pan" is for example valid.

Hey this is not bad. You have now taught yourself about the square brackets, multiple alternatives in them, ranges, and the not-search.

Special characters .

It's not always handy to use brackets. The dot (.) placed outside of a bracket will work as the “anything”-character. It will match numbers, letters, question marks, and everything you can think of. 

.an - Match anything that starts with something and then ends with “an”, like “!an”, “7an”, “zan”, “Ran”, “_an” and “[an”. 

The dot only matches one single character (but any type of it). If you need you can use quantifiers with it (explained more in detail pretty soon). Like “.*an” would match any character, for zero or infinite amount of times, ending with “an”. A very wide search, most likely not that useful all the time.

Quantifiers *+?

So you might like the bracket-search you wrote earlier and want to re-use it for more than just one character. Then we can use quantifiers. Let’s say for popular example that you want to validate a western style first name (just letters, for simplicity). 

[a-z]* - Match anything with letters between “a” and “z” following each other, for zero or infinite amount of times.

The star/asterisk immediately behind the brackets means “find what I just typed any amount of times”. While [a-z] only would match “a”, “b”, etc [a-z]* matches “”, “a”, “aaa”, “kjh”, “dlfkajhsdflkajhd”, and everything you can monkey-type on your keyboards alphabetic keys. It will however also accept a completely empty string since * means “any amount”, thus also “zero”. 

[a-z]+ - Match anything with at least one letter between “a” and “z” following each other. 

The plus sign takes care of that empty string problem. It means - almost like the star - that we want to match any amount, but the difference is that it can’t be zero of it!

There’s also another small possibility to improve our Regex for finding names since it doesn’t take capital letters into account. So let’s fix that.

[A-Z][a-z]+ - Match words starting with a capital letter followed by at least another, small letter. 

This would therefor match things like “Da”, “Boo”, “Gaah”, and “Xksajhds”, but not “xmn”, “762”, “c”, or “” (empty string).

With the ? character we will match zero or one, but not more than that.

a[0-1]? - Match a string being “a”, or “a0” and “a1” but never “a11” as that was too many occurrences of the number.

Quantifier summary

Let's recap the basics of quantifiers:

? = Match zero or one occurrences only.

+ = Match infinite occurrences of a character but never zero occurrences.

* = Match zero, one or infinite occurrences.

Quantifiers, part 2 {x}

But, what if you know the exact amount of times this bracket-match should occur? Like when looking for Norwegian zip-codes (always four digits). In that case we can’t use the special * or + but have to set the number explicitly.

Some examples:

[0-9]{4} - Match any number for exactly four times in a row.

[0-9]{1} - Match any number for exactly one time (same as writing [0-9]).

[12][0-9]{2} - Match anything starting with a “1” or a “2” followed by exactly two more numbers between 0 and 9.

But, what if we want to find a year in a string. A year can be written with two or four digits. Can we do that with Regex? Of course!

Example:

[0-9]{2,4} - Match exactly two or exactly four numbers following each other. So “99” is as valid as “1999”.

With the comma we can define more than one quantifier for that match. Thus covering almost any example of matching we can come up with.

So as you can see with brackets, the dot, and the quantifiers we can do pretty cool things. And this is still just the beginning.

More special characters ^$

Used outside of the square brackets the ^ and $ sign will have new meanings. And yes, this is a big part of why Regex is a bit confusing to start up with. Instead of the ^ meaning “not these characters” it means “start of the string”.

Let’s see the difference with two examples:

[0-2]+ - This will match any occurrences of at least one of “0”, “1”, or “2” starting anywhere in the string.

^[0-2]+ - This does the same as the above, except the ^ characters tells the Regex that the match must be at the very start of your string. It would thus match “12hi”, “2goodbye”, “0” but not “hi0”, “cheers 012” or “one 2 three” (which the first one would’ve matched).

Very much like the start-of-string character we have an end-of-string character: $

It’s used in the end of your Regex to signify that we match for the end of our string. If used together they would mean that only what you search for is allowed in that string, nothing else.

^[0-2]$ - Match only strings being “0”, “1”, or “2” without any other text.

Grouping (x)

All these matches you get doesn’t mean much if you need to find a specific text and extract it, or replace it. The previous examples would work fine for validation of data, if the Regex finds your “search” it will return true and thus validating the string. But, you will get even more out of Regex when you start doing replacements in strings, and to do that you need to store your matches in memory. And you do this by wrapping parts of your Regex in parentheses. In Regex this is called grouping.

(hej) - Match the string “hej” and store it in memory.

(hej|hei) - Match the string “hej” or “hei”, and store it in memory.

Use them with special characters:

test(.*)r - Match any string that starts with “test”, ends with “r”, and stores whatever it finds between (using the any-character for any amount of times), be it "" (empty string), “a”, or “osterone”, etc.

Use them with square brackets:

age ([0-9]*) - Match “age “ in string and any following numbers, storing the numbers.

Only use parentheses for things you need to access later.

Dissecting examples

Well I could go on explaining more and more about all things Regex that I've learnt from Rustam's presentation. But I’ll end now with two real-world examples, going through every little part of it.

[\/?&]test=([^&#]*)

The what now!? So let’s just take this slow. 

[\/?&]

First we have the square brackets, this means that we will start at something matching what's inside them. The two first characters are only one, as the first one is the escape character escaping the second one - a forward slash. We want to find things that start with an / but since writing just [/?&] would use the special functionality of the / character, we'll escape it first with the \ slash.

So basically, find something that starts with a "/", "?" or "&".

After that we see “test=” which basically just finds a normal text inside what we are searching. We will thus so far match on “/test=”, “&test=” and “?test=”. Maybe you can figure out that we want to analyse a URL. But what about the last part, that looks crazy.

([^&#]*)

First off, the parentheses are for grouping and remembering what is found. So the Regex-code inside here will find something and store it in memory so we can use it in some way. If we just remove the parentheses for clarity:

[^&#]*

The ^ meant “Not”. So it will not match any character that is the “&” or the “#”. You see how we by starting with an ^ negate all the following characters.

So we will match any characters that is not a "&" or a "#", for infinite amount of characters (even zero/space) - the asterisk, remember. What happens here is that what will be stored in the group is what comes after the “test=” until the string ends, or we come to a "&" or "#".

This Regex finds the QueryString “test” in a URL and stores its content. So if I used the URL labs.enonic.com/page?test=hello I would have stored "hello" with this Regex.

One more example before I let you go:

^([0-9]{2})\.([0-9]{2})\.([0-9]{4})$

Let’s break it apart and add some spaces (always recommended since it makes it more readable and easier to understand in my opinion).

^

( [ 0 - 9 ] { 2 } )

\.

( [ 0 - 9 ] { 2 } )

\.

( [ 0 - 9 ] { 4 } )

$ 

Let’s start with the first line. It means “start of string”, so this makes sure that what we will match comes already on the start of the line in the string and not just anywhere inside the string.

At the end we have the $ character, matching for the “end of string”. So it looks for everything between it being on a line of its own. So our matching code is the only thing allowed on the line that we will get matches on.

([0-9]{2})

I guess this is somewhat easy to dissect now. First [0-9]{2} means any number exactly two times. Remember? And with the parentheses surrounding it we’ll store that match in a separate place in memory.

\.

After that you’ll see this. Remember “.” was a special character meaning “any character”. But here we really only want to match the normal dot character, as used to end sentences etc, and not any character. To do that we need to escape it with the backslash. Backslash is the escape character in Regex.

([0-9]{2})\.

Now we see the pattern repeating, we’ll match and store in memory another set of two numbers. Followed by a dot.

([0-9]{4})$

It all ends with a match for a set of four numbers. Then that line should end.

This gives us a Regex that will find any date formated as a Norwegian date (DD.MM.YYYY). Each part of that date is stored on its own in the memory so if we used this in for instance JavaScript we could access what we stored as an array and then reformat that string! Pretty cool huh!? But that is not covered in this little introduction.

NB! Important to note is that this Regex example is very basic and it will allow a huge amount of invalid dates, like 99 of June in the year 9999, or day 12 of the 79th month in the year 0000, etc. But writing the perfect Regex for dates would make for a pretty over complicated example.

Closing words

That's it for now. If you find this interesting, please comment, I might go for a follow-up article. Also if you have questions or need clarifications just ask.

A very powerful and visual tool for helping you understand Regex better is Regex101. Just paste your Regex there and to the right you'll see an explanation that breaks down your Regex into the smallest possible pieces. I've found this tool extremely helpful when the Regex is a bit too complicated to interpret by hand.

Comments