This post originated from Lab 1 of course Compilers: Principles that I’m currently taking, in which we were required to write a flex program to parse a subset of the C language. The multiline comment /* */ was the most troublesome to handle for most of us (excluding me, for sure).

The process

I’ll assume you’ve already drawn a DFA for the multiline-comment structure, so here it is:

DFA for the multiline comment

We’re first going to turn it into “state transformation equations”, so it looks like this:

The first step we’re taking is to realize that $A=S \mid Aa$ is easily found to be equivalent to $A = Sa^*$, where the superscript asterisk means “repeat 0 or more times”. So $B$ can be turned into

Again, the superscript plus means “repeat 1 or more times” as the same in PCRE.

Now it’s time to substitute $B$ with its simplified expression:

Note that there’s a distributive property here, which described using symbols, is that $Aa \mid Ab = A(a\mid b)$, so now we have

Applying the first transformation $A = S \mid Aa = Sa^*$, we have

Now there’s no recursion in the new “state transformation equation”, so we can substitute $A$ with this final expression and get the regular expression for $C$, the result we want:

Converting the above regular expression to code, we now have

C = \/\*([^*]|\*+[^*/])*\*+\/

Try it online with RegEx101!


Now can you imagine how to use regular expressions to match multiples of 3 (base 10)? Yes, it’s entirely possible. See this fantastic article for details, which uses essentially the same techniques to convert a DFA (or a finite-state machine) to a regular expression that does the job.