It sounds easy enough. How do you use regular expressions to match URLs from anywhere except
If you know a little about regular expressions you will confidently roll up your sleeves and start typing. You may soon discover it’s harder than it looks. I did.
The use case is pretty simple. In JIRA administrators can specify regular expressions to exclude URLs from acceptable trackbacks. So a support request came to me and the guy was wondering what he was missing. He just wanted to match everything except! Why was this hard?
For the purposes of this challenge, as it met me, we are restricted to the Oro regex library v2.0.8 which implements Perl 5 regexes. Perl 5.6 has some nicer syntax for this problem, but excluded strings can sometimes be tricky. Often in a program you will just negate the match in the main programming language rather than force the regex to “match things that don’t match”. This option wasn’t open to me today.
I tried a couple of other approaches, but here’s what I came up with. It’s not perfect, but it works:
The astute readers of Friedl’s owl book will spot the weakness. It could be tricked into not matching URLs from non domains by putting that substring into the path part of the URL like this:
But i’m not too worried about that as those cases can be properly handled with more of the same gunk.
What’s interesting to me is the combination of a start-anchored zero-width positive look-ahead assertion non matching group with an end-anchored zero-width negative look-ahead assertion.
Does that make me sound smart?
Update: Thanks to Meticulous Matt I corrected two things: first the double backslash was actually a java escape from my test harness and second, the (?: is actually a non capturing group, not an assertion. I would hate for everyone to go to their next cocktail party getting that one wrong!

Fresh ideas, announcements, and inspiration for your team, delivered weekly.

Subscribe now