Tutorial: Creating a reg ex parser with Scala

Posted by

    A line of sheet music with a bluish sky background.

    “Any sufficiently advanced technology is indistinguishable from magic.” - Arthur C. Clarke

    Scala is a wonderfully complex and simple language that combines object oriented programming with functional programming in a way that is especially suited for dealing with strings in different ways. It is complex, because it has so many built-in features, many of which are implementations of concepts that takes years to understand. It is simple because the implementation is so elegantly and powerfully made, and fit so well together, that many problems are easily solved with one command, where other languages would require complex algorithms.

    One of the tasks specifically suited to be solved with Scala is parsing. In Enonic, we have a query language for use in our datasources, that used to have a 10 page Java CC grammar description. Java CC would transform these 10 pages into 7 Java classes that were partially obfuscated and impossible to debug. All this was transformed to a simple one page Scala class.

    The Scala Language

    There are several features of the Scala language that makes it particularly suitable for parsing.

    Regular Expressions

    The first and foremost is regular expressions as a basic part of the language. This makes it easy to match patterns in the string to parse, and separate out the different tokens.

    Extractors

    Scala have an advanced set of objects called extractors, which can be applied to a string. If the extractor can do anything useful with the string, it proceeds. If not, the string is passed on to the next extractor.

    A really cool thing is that these can be applied recursively, so unnesting parenthesis becomes a breeze. Scala comes with several implemented extractors, of which Regex is a very advanced and flexible extractor.

    Simple and effective operators

    Scala have a number of operators that can appear confusing at first looks. For instance, there are multiple uses of normally unused characters, like the hat and the colon. ^, ^^, ^^^, :, :: and ::: all perform different operations. What they do is just a matter of learning, but when each one is understood, it provides for some quite effective programming. Here are two that are especially important to parsing:

    The double hat (^^) is considered the equivalent of defining a function to operate on an argument on the left hand side, like f(x) in mathematics. In Scala, F(x) would be expressed x ^^ F. Example:

    {{number ^^ {s => new ValueExpr(s.toDouble)} }}

    Here, number is a Regex class, which have been matched by an extractor and applied to the function after the dobule hat.  This function assigns it to the variable, s, converts it to a double and creates a new ValueExpr object with it. Obviously, the function on the right can be much more complex, so here we already get a glimpse of how easily an advanced operation can be applied to a matched string or part of a string.

    Another important operator to make the matching simpler, is the tilde (~), the basic parser combinator operator. This can be used with extractors to match multiple regular expressions or simple tokens, for instance, like this:

    fieldExpression~relationalTokens~computedExpression

    where the 3 elements are Regex objects, which have been defined to match separate parts of the query.  Each of these may be a token or a set of tokens, however it makes best sense to split  up the input.

    Java integration

    Scala runs on the JVM and have full access to the Java API classes, in addition to it's own API. For instance, in the double hat example above, the ValueExpr object created, is a Java object, and the s.toDouble , is the toDouble() method on the basic Java String class. Scala does not have a String class on it's own, but simply uses the String class from the Java API. However, there are some additional methods that have been added, so for instance, it is possible to turn a String into a Regex class, simply by calling the r method: s.r  

    Case classes

    These classes could be considered another instance of “simple and effective operators”, but they are classes, not operators, and so unique, they deserve a section by themselves. Case classes are regular classes which export their constructor parameters and provide a recursive decomposition mechanism via pattern matching. In parsing, we use pattern matching to recognize tokens, extractors to specify their legal ordering and case classes to process the tokens when the order have been determined to be legal.

    Several predefined parser classes as part of the standard API

    scala.util.parsing.combinator.Parsers is a standard super class for several parsers in Scala. Most of these are based on defining a set of tokens and applying rules to define the structure of the language. There is also the scala.util.parsing.combinator.RegexParsers class. This class allows for much more freedom. The developer have to define every element to be parsed as a regular expression, allowing for special variations and any combination conceivable. When choosing a parser for the Enonic CMS query language, I first tried with the token parser, but kept running into limitations, so I tried the RegexParsers, and as this article will show, found a very conform and straightforward way to parse our query language.

    Parsing the Enonic CMS query language.

    I have already shared some small bits of code from our Scala parser, but let me start from the beginning. The main goal of parsing text, is to produce an object graph which can be manipulated and optimized, then used to produce a query to the database. For this article, let's take the following Enonic CMS query as an example:

    data/firstname NOT IN ('John', 'Joan', “Jim”) ORDER BY data/lastname, data/birthdate DESC

    It may be a somewhat weird query which would never be used in real life, but it illustrates som good concepts, including negation and arrays. Parsing this query should produce and object graph like this:

    Object Graph From Scala Parser

    So, how do we go about the task of producing the object graph from the string?  This tutorial will now build an entire parser in small steps.  There will be code excerpts along the way, while the entire code may be found at the bottom of this page.

    Recognizing tokens

    The first element of parsing, is recognizing the tokens. Let's start with the easy part, recognizing the string literals like “John” or “Joan”. In our query language, a text literal must be encapsulated by either single quotes or double quotes, so we generate a Regex class for literal like this:

    The triple quote is Scala syntax to avoid escaping. Anything inside a triple quoted string is a string, straight as it is written. For those of you not familiar with regular expressions, this will match two single quotes with anything but a single quote inside, or two double quotes with anything but a double quote inside.

    Next, we define a Parser class that is used to strip off the quotes and produce a ValueExpr object of the recognized string.

    This simple line of code does a lot of things that need explaining. First of all, we define a parser that will produce a ValueExpr object. This parser definition is a function that takes regular expressions that match literals and creates a ValueExpr object of the literal.

    After we get the value of the literal, we no longer need the quotes, so they are stripped off in the process, using the substring function. Since RegEx is also an extractor, the resulting parser can be applied to strings, and do this transformation only on the strings that match. Strings that do not match is passed on to the next extractor, which may be a parser that tries to match numbers:

    In our object parse tree, numbers and literals may both be values, and since we do not have any variable typing in our query language, we can treat these as equal. Again, there's a simple solution, just define a new parser as either of the two:

    From now on, wherever the parser should look for a value, it will use this extractor to try to match the token. First it will apply the numberExpression extractor. If that does not match, it will try the literalExpression . If neither works, an exception will be cast.

    The right hand side of a comparison may be a straight value, but in some situations, it can be a function which may or may not operate on values or variables, or a list of values, encapsulated by parenthesis. I am not going to go through all the details of how they are constructed, as it is just more regular expressions that we have already seen. However, the result is more parsers: functionExpression , listExpression and computedExpression . A computedExpression is either a valueExpression or a functionExpression , just like a valueExpression is either a numberExpression or a literalExpression . For the left hand side, I build a similar fieldExpression .

    Recognizing the valid order of tokens

    So far, quite a few different tokens may be recognized by the different parsers. Also, some sets of tokens have been combined to indicate that they are interchangeable in specific positions, but those positions are not defined. So, we need to define the order of the tokens. Again, we break it down in smaller parts, and start by defining a complete compareExpression .

    In our query language, there really are three different types of comparison operators. All take a fieldExpression on the left hand side, but have slight variations in what they allow on the right hand side. It is the basic mathematical operators, =, !=, >, <, >= and <=, which need a computedExpression on the right hand side, the IN operator which need a listExpression on the right hand side, and LIKE operators (LIKE, CONTAINS, STARTS WITH, ENDS WITH) which need a literalExpression on the right hand side. In addition, the IN operators and LIKE operators may be preceded by a NOT. All these textual operators are supposed to be case insensitive, so the regular expressions become somewhat complex:

    Now, we are ready to supply a complex extractor with five alternative extractors for any valid comparison:

    Now, that's a lot of code, that need some explaining: compareExpression is defined as a Parser that can deal with five variations of token sequences. These five are extractors, that will try to match the input string, and apply their function if matched, or pass execution on to the next one if it fails. If a single string can match multiple extractors, the order becomes important, but in our parser, the five defined token sequences can not be mistaken for each other, so the order is not important.

    Let's look at the NOT IN variation that will be applied to our example, and see what it does in detail. First we have the token sequence, defined as:

    fieldExpression~notToken~inToken~listExpression

    This is 4 different parsers, each representing one or more tokens. These are combined with the parser combinator operator, which makes Scala so powerful. Now, the tokens have a legal order in which they have to appear, and if they match this order, they whole stream is applied to the following function.

    Enters the case class. Case classes have many nice features including another way of doing pattern matching, where separate case classes were defined for each matched pattern. However, in this situation, we are only interesting in the pattern decomposition feature of the case classes, so only one case provided for each matched extractor, and this case class is guaranteed to match the tokens it receives when the extractor have made the first match.

    Now that the tokens have been decomposed, they are passed to a Java object, CompareExpr , that we recognize from the graph we are trying to compose. But, this does not happen immediately. Some care need to be given to the execution order here. Each of the elements are Scala Parser classes, represented by a regular expression that was used to match them in the extractor, but the CompareExpr object is a Java object that takes java Expression objects as input. So, before the execution of the defined function, the Parsers for the fieldExpression and listExpression are applied to produce a FieldExpr object and an ArrayExpr object. These are the objects passed to the CompareExpr constructor that is defined to be the result of function.

    You may have noticed that the two variations with the like tokens, are not passed to a CompareExpr constructor, but to the CompareExpr.likeEvaluation method. This is a method that is doing some early optimization of the like expressions, by transforming the CONTAINS and STARS WITH / ENDS WITH expressions to LIKE expressions with the necessary wildcard at the start, end or both sides of the string literal.

    Wrapping up

    What's left to look at now, is the logical composition of several compare expressions with AND or OR, and parenthesis, and the ORDER BY expression. Neither contains any more elements or features of the Scala language, than what has already been explained, and the ORDER BY expression is really simple, but nesting complicated logical expressions within parenthesis need a few more comments.

    The parsers in Scala are recursive, and the order of their definition is not important. Also, an AND expression has a higher priority than an OR expression, and should therefore be applied first, unless any parenthesis distort the natural order.

    All this unwinds very easily because of the recursive behaviour of the parsers. In this parser, there are 4 levels of expressions that each may be an instance of the previous one: orExpression , andExpression , possiblyNegatedExpression and encapsulatedExpression . encapsulatedExpression may be a compareExpression , or an orExpression in parenthesis:

    A possiblyNegatedExpression may be an encapsulatedExpression with or without a NOT in front of it:

    And so forth, for the andExpression and orExpression , which means that in four simple statements, all the complexity of nested parenthesis have been dealt with, and with this notation that we are now quite familiar with, we can defined a complete queryExpression as an orExpression , an orderByExpression or both:

    Reading man with glasses

    Conclusion

    So, reading text, or parsing, as computers call it, is actually very straight forward.

    Below is the entire code for the whole parser, including both the excerpts above and the rest of the code that shows how it all fits together.

    This is a very basic parser for the fairly simple query language of Enonic CMS.  It is good an example of how the flexibility of regular expressions makes it easy to create parsers in Scala, and should be very simple to adapt and expand to fit any other language that needs parsing.

    Comments