Golang has it own regexp implementation. It’s really useful to check syntax and extract values in just few lines. Let’s take a look !
…
Basic example : Parsing an email
john@travolta.com
Let’s catch :
-> Name : john
-> Domain : travolta
-> Extension : com
1. Check the syntax
format := `^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$` // The regexp is prepared to be used reg, err := regexp.Compile(format) if err != nil { panic(err) } // Some tests line := "john@travolta.com" line2 := "nicolas2.cage@hotmail.fr" line3 := "failure@obvious" line4 := " RADICAL FREEDOM !" // MatchString method returns true if // the input matches the regexp fmt.Println(reg.MatchString(line)) fmt.Println(reg.MatchString(line2)) fmt.Println(reg.MatchString(line3)) fmt.Println(reg.MatchString(line4))
true true false false Program exited.
2. Extract value
We are going to use regexp named capture group.
As a reminder, the syntax is :
(?P<name>expression)
Now, let’s extract thoses values :
// We add the 3 named capture group : name, domain, extension. // I split for readibility name := `(?P<name>[a-zA-Z0-9_.+-]+)` domain := `(?P<domain>[a-zA-Z0-9-]+)` extension := `(?P<extension>[a-zA-Z0-9-.]+)` format := `^`+ name + `@` + domain + `\.` + extension + `$` // Note : Backquote rather than double quote prevent Go // from considering escaped characted as "\." reg, err := regexp.Compile(format) if err != nil { panic(err) } line := "john@travolta.com" matches := reg.FindStringSubmatch(line) fmt.Println("Matches :", matches) // If the input matches, FindStringSubmatch // returns an array containing : // [0] -> The input string // [1...] -> The submatches (named or not named) if len(matches) == 0 { // 0 match means the expression doesn't match at all panic(fmt.Sprint("Invalid rule :", line)) } // At this step, because the matches values are in the same order as we put them in the regexp, // we could blindly print them by calling "matches[1]", etc... // But there is a more generic way to do it. // So, we are going to print the name of each capture group, followed by what it captured names := reg.SubexpNames() // We get here ordered names of capture groups ([]string) fmt.Println("Names :", names) // "matches" and "names" have the same length (both have 1 element for each capture) for i, name := range names { fmt.Println(name, "->", matches[i]) }
Matches : [john@travolta.com john travolta com] Names : [ name domain extension] -> john@travolta.com name -> john domain -> travolta extension -> com Program exited.
As you see, the first name is blank, because the entire expression itself is
considered as a capture.
But, each () is a capture, and you could have more capture without name.
The need of optimization
Regexps are useful to deal with a specific litteral pattern with a minimal amount of code.
However, it’s not the only possible use.
Mass processing & short delay service
Regexps are an efficient way to deal with a complex pattern in a massive amount of data, or in service needing to answer as quickly as possible.
- Extract email adresses from the twitter flux
- Parsing huge mathematical (or any specific) data files
- API services with very low-latencies requirement
Basic optimization : Early compilation
The Regexp package is mainly composed of methods for the Regexp struct. It means that you must first compile your expression to get a Regexp struct.
However, some methods have a function equivalent, making the compilation not-mandatory.
func Match(pattern string, b []byte) (matched bool, err error) func MatchReader(pattern string, r io.RuneReader) (matched bool, err error) func MatchString(pattern string, s string) (matched bool, err error)
Regexp compilation is not trivial.
Sourcecode quote :
14 // The regexp implementation provided by this package is 15 // guaranteed to run in time linear in the size of the input. 16 // (This is a property not guaranteed by most open source 17 // implementations of regular expressions.) For more information 18 // about this property, see 19 // http://swtch.com/~rsc/regexp/regexp1.html 20 // or any book about automata theory.
Once the expression is compiled, the resulting Regexp struct is going to “run in time linear in the size of the input“, which is a great news !
But once compiled only ! It’s why basic regexp optimization is about early compilation ; try to compile all your regexps at the software start, and run it only when needed.
Regexp + Goroutines = Unexpected latencies
Using Regexps in goroutines is a good idea to improve performance.
However….
Regexp are mutexed !
If you read the same regexp from several goroutines, they are going to pass a significant time waiting the mutex to unlock.
It’s why the regexp library offers a Copy method.
reg, err := regexp.Compile(test) reg2 := reg.Copy()
The code is pretty simple :
111 func (re *Regexp) Copy() *Regexp { 112 r := *re 113 r.mu = sync.Mutex{} 114 r.machine = nil 115 return &r 116 }
…
Golang offers a lot of libraries for popular formats parsing.
However, you need sometimes to deal with specific syntaxes : your own file format, a specific security check…
Regexps are a simple and fast way to solves this.
Compile in each goroutine” solution but it avoids the Compile, which although usually cheap may not always be. RegexpForSingleGoroutine would need all the methods that Regexp has, which is unfortunate.