Regexp : Presentation, optimization, goroutines

Golang has it own regexp implementation. It’s really useful to check syntax and extract values in just few lines.
Let’s take a look !

Basic example : Parsing an email

john@travolta.com

Let’s catch :
-> Name : john
-> Domain : travolta
-> Extension : com

1. Check the syntax

format := `^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$`
 
// The regexp is prepared to be used
reg, err := regexp.Compile(format)
if err != nil {
  panic(err)
}
 
// Some tests
line := "john@travolta.com"
line2 := "nicolas2.cage@hotmail.fr"
line3 := "failure@obvious"
line4 := " RADICAL FREEDOM !"
 
// MatchString method returns true if
// the input matches the regexp
fmt.Println(reg.MatchString(line))
fmt.Println(reg.MatchString(line2))
fmt.Println(reg.MatchString(line3))
fmt.Println(reg.MatchString(line4))

Result :

true
true
false
false
 
Program exited.

2. Extract value

We are going to use regexp named capture group.
As a reminder, the syntax is :

(?P<name>expression)

Now, let’s extract thoses values :

// We add the 3 named capture group : name, domain, extension.
// I split for readibility
name := `(?P<name>[a-zA-Z0-9_.+-]+)`
domain := `(?P<domain>[a-zA-Z0-9-]+)`
extension := `(?P<extension>[a-zA-Z0-9-.]+)`
 
format := `^`+ name + `@` + domain + `\.` + extension + `$`
 
// Note : Backquote rather than double quote prevent Go
// from considering escaped characted as "\."
 
reg, err := regexp.Compile(format)
if err != nil {
  panic(err)
}
 
line := "john@travolta.com"
 
matches := reg.FindStringSubmatch(line)
fmt.Println("Matches :", matches)
 
// If the input matches, FindStringSubmatch
// returns an array containing :
// [0] -> The input string
// [1...] -> The submatches (named or not named)
 
if len(matches) == 0 {
  // 0 match means the expression doesn't match at all
  panic(fmt.Sprint("Invalid rule :", line))
}
 
// At this step, because the matches values are in the same order as we put them in the regexp,
// we could blindly print them by calling "matches[1]", etc...
// But there is a more generic way to do it.
// So, we are going to print the name of each capture group, followed by what it captured
 
names := reg.SubexpNames() // We get here ordered names of capture groups ([]string)
fmt.Println("Names :", names)
 
 
// "matches" and "names" have the same length (both have 1 element for each capture)
for i, name := range names {
    fmt.Println(name, "->", matches[i])
}

Result :

Matches : [john@travolta.com john travolta com]
Names : [ name domain extension]
 -> john@travolta.com
name -> john
domain -> travolta
extension -> com
 
Program exited.
About nameless capture
As you see, the first name is blank, because the entire expression itself is
considered as a capture.
But, each () is a capture, and you could have more capture without name.

The need of optimization

Regexps are useful to deal with a specific litteral pattern with a minimal amount of code.
However, it’s not the only possible use.

Mass processing & short delay service

Regexps are an efficient way to deal with a complex pattern in a massive amount of data, or in service needing to answer as quickly as possible.

  • Extract email adresses from the twitter flux
  • Parsing huge mathematical (or any specific) data files
  • API services with very low-latencies requirement

Basic optimization : Early compilation

The Regexp package is mainly composed of methods for the Regexp struct. It means that you must first compile your expression to get a Regexp struct.

However, some methods have a function equivalent, making the compilation not-mandatory.

func Match(pattern string, b []byte) (matched bool, err error)
func MatchReader(pattern string, r io.RuneReader) (matched bool, err error)
func MatchString(pattern string, s string) (matched bool, err error)

Regexp compilation is not trivial.

Sourcecode quote :

    14	// The regexp implementation provided by this package is
    15	// guaranteed to run in time linear in the size of the input.
    16	// (This is a property not guaranteed by most open source
    17	// implementations of regular expressions.) For more information
    18	// about this property, see
    19	//	http://swtch.com/~rsc/regexp/regexp1.html
    20	// or any book about automata theory.

Once the expression is compiled, the resulting Regexp struct is going to “run in time linear in the size of the input“, which is a great news !

But once compiled only ! It’s why basic regexp optimization is about early compilation ; try to compile all your regexps at the software start, and run it only when needed.

Regexp + Goroutines = Unexpected latencies

Using Regexps in goroutines is a good idea to improve performance.
However….

Regexp are mutexed !

If you read the same regexp from several goroutines, they are going to pass a significant time waiting the mutex to unlock.

It’s why the regexp library offers a Copy method.

reg, err := regexp.Compile(test)
reg2 := reg.Copy()

The code is pretty simple :

111	func (re *Regexp) Copy() *Regexp {
112		r := *re
113		r.mu = sync.Mutex{}
114		r.machine = nil
115		return &r
116	}

Golang offers a lot of libraries for popular formats parsing.
However, you need sometimes to deal with specific syntaxes : your own file format, a specific security check…
Regexps are a simple and fast way to solves this.

Entrepreneur – Cofounder at Golem.ai (Paris, France)

I enjoy sharing Golang interesting patterns, experiments and tips.

1 thought on “Regexp : Presentation, optimization, goroutines

  1. Compile in each goroutine” solution but it avoids the Compile, which although usually cheap may not always be. RegexpForSingleGoroutine would need all the methods that Regexp has, which is unfortunate.

Leave a Reply to 192 168 1 1 Cancel reply

Your email address will not be published. Required fields are marked *