Regular expressions are extremely powerful. This is something I read at least once or twice every day while reading articles and blogs on the Web.

While browsing today, I found this page which thoroughly describes the use of the regular expression /^1?$|^(11+?)\1+$/ in Perl to check if a number is prime or not!!!

To be frank, I was skeptical. The regular expression looks like magic! And I wanted to understand it better. I rewrote it in Ruby using irb and I tested it:

`Avinash@noulakaz:~$ irb`

irb(main):001:0> def is_prime(n)

irb(main):002:1> ("1" * n) !~ /^1?$|^(11+?)\1+$/

irb(main):003:1> end

=> nil

irb(main):004:0> is_prime(10)

=> false

irb(main):005:0> is_prime(11)

=> true

irb(main):006:0> is_prime(12)

=> false

irb(main):007:0> is_prime(13)

=> true

irb(main):008:0> is_prime(99)

=> false

irb(main):009:0> is_prime(100)

=> false

irb(main):010:0> is_prime(101)

=> true

Great! It also works in Ruby! This means that there is no (Perl) magic going on. The regular expression really works. But how? Let’s try to follow the logic behind it.

**Is 7 prime?**

To know this, the function first generates “1111111” (from “1” * 7) and tries to see if that string **does not match** /^1?$|^(11+?)\1+$/. If there is no match, then the number is prime.

Notice that the regular expression has two parts (separated with the vertical bar |).

The first part is /^1?$/ is trivial and matches with beginning of line (^), an optional 1 (1?)

and end of line ($) which implies that it matches either the empty string or “1”. This simply indicates that calling that function with n==0 or n==1 will correctly return false (as the “1” * n will match with the first part of the regular expression)

The second part is where the magic occurs…

/^(11+?)\1+$/ matches with beginning of line (^) then by (11+?) then by \1+ and finally by end of line ($). I guess that you know that \1 is a variable which is bound to whatever was matched previously (in our case by (11+?)).

Let’s proceed slowly…

(11+?) does two things

- It matches with “1” followed by one or more ones
**minimally**. This means that it matches with “11” initially (notice that if there was no ? (i.e. (11+) was used instead, the whole string would have matched) - The string obtained (“11” initially) is bound to the variable \1.

\1+ then matches with whatever has been matched above (“11” initially) **repeated one or more times**. If this match succeeds then the number is not prime.

If you are following, you’ll realise that this eliminates all even numbers except 2 (for example, 8 is “11111111” and therefore (11+?) will match with “11” and \1+ will match with “111111”).

As for odd numbers (in our case 7), the (11+?) matches with “11” initially but \1+$ cannot be true (notice the $) as there are 5 remaining ones. The regular expression engine will **backtrack** and will make (11+?) match “111” and here also \1+$ won’t be true because there will be 4 remaining ones (and \1+$ will only match with a number of ones which is a multiple of 3 followed by end of line) etc. hence “1111111” will not match the regular expression which implies that 7 will be considered as being prime :-)

When I showed this to Christina this morning (true), she told me that this only checked for a number being odd or not. This is also what I felt at the beginning. But it really works. For instance, let’s try to apply it to 9 (which is obviously not even), “1” * 9 should match the regular expression…

“1” * 9 = “111111111”. (11+?) matches “11” initially. \1+$ cannot match because there are 7 remaining ones. Backtracking occurs. (11+?) now matches “111”. And here \1+$ matches the remaining 6 remaining ones! Hence, 9 is not prime.

Easy… and beautiful at the same time ;-)

[Join the discussion on Hacker News]

unfortunately, I am quite allergic to regular expressions… too many bloody flavors to make any sense in the end.

I fully understand. I was quite allergic to regexpr too. But little by little I have come to like and use them a lot.

Let’s see, I use regexpr when working with vi, grep, sed, awk and, now, ruby…

I’ve been trying to write a bbcode parser for wordpress µ to cure it of kses.php while keeping the entries of the bloggers “safe”. So I have to figure out how to use regex.

So i guees i too will have to learn to like it.

It’s like voodoo.

And I suppose you have realised that it is a small programming language… so it’s bound to be cool :-)

Programming Quiz No. 2

A serial file named contains a series of values, each of which is how many prime numbers ther are below a particular value. Write a program to read the number of primes between and values where and are requested interactively and compare it to the value obtained from FUNCT(limit) which is defined as limit / LOG(limit).

Hi,

Seems that some of the text went missing… or you were seriously drunk :-)

Cool.. thx for your post. I like matematics quiz :P

nice trick!

I find that the “backtracking” part was equivalence to finding out if the number is prime by brute-forcing

e.g. going from 3 to half the number and if any of that is divisible, its not a prime.

thus the ^(11+?)\1+& trick (anyone got it?)

i believe there are a lot of fast algorithms out there to use instead of generating 1s and parsing them with a regular expression which is kinda slow.

however this is truly magic ^^

hey avinash, what is the exact input that goes to the regex constructor. I am not a ruby programmer but my attempt to use this regex in javascript failed. The test always failed for both prime and composite numbers. If you know javascript please help show me a sample to use this in javascript pls

the trick is to apply the regex to a string with n 1’s, for example, for 5, apply the regex to 11111.

A PHP Version:

!preg_match(‘/^1?$|^(11+?)\1+$/x’, str_repeat(‘1’, $n))

Example:

<?php

$n = 1;

while ($n < 1000000) {

if (!preg_match(‘/^1?$|^(11+?)\1+$/x’, str_repeat(‘1’, $n))) {

echo “$n: Prime\n”;

}

$n++;

}

?>

C# version:

static void Main(string[] args)

{

int c;

if (args.Length == 1 && int.TryParse(args[0], out c))

Console.WriteLine(!Regex.IsMatch(new String(‘1’, c), @”^1?$|^(11+?)\1+$”));

}

Java Version, nice tip btw, guess I should learn more about regular expressions. Java requires that the \ be \’d out, otherwise it sees it as a smiley face character and problems happen :P.

public static void main(string[] args){

if (args.length >= 1)

{

int c = Integer.parseInt(args[0]);

String s = “”;

while (c > 0) {c–; s+=”1″;}

System.out.println(!Pattern.matches(“^1?$|^(11+?)\\1+$”, s));

}}

This is just awesome…

Hello,

To be honest – regular expressions are “overused”… And are NOT that good. For example in PHP using just common str_replace(); would be in many cases much faster then preg_replace();

Problem is that people tend to use a lot of “sophisticated” code where it’s not necessary, only because it looks more “professional”. My thought on this is “I don’t care what others think – I just need my code to be fast, clean and as simple as it is possible”.

True.

Sometimes regular expressions are extremely cryptic (like the one above in fact ;-) ) but they are also interesting to understand because they are essentially small programs written in a domain specific language.

The magic is in the backtrack…. if it can find a number a=(11+?) that multiplied N times gives our number, it’s obvious that our number can’t be prime (our number = a*N, i.e. not prime)

How many numbers does this count up to? whats the largest prime that can be found in *real* time?

I’m interested so mail me please.

Hi Roscoe,

I’ve not done any performance test but I guess that if you use a compiled language like C with a good Regex implementation, then, maybe, you can generate prime numbers real fast.

On the other hand, if you really want to generate primes then have a look at The Sieve of Eratosthene.

Avinash. Super, I’ve been looking around for a Regex similar to this for a while. I’ve modified it to suit.

Hi Paul,

Great to know that you find this regular expression useful. I’d like to point out that I’m not the original author though ;-)

if primenumber mod 2 is 0 then

isprime = true

wouldnt that be easier and faster? ^^

sorry, its getting late here heh

if primenumber mod 2 != 0 then

isprime = true

Hi Rasmus,

A prime number is not simply odd. Check this.

I have never been able to understand regular expressions, what is the logic behind them?

Hi Celina,

Regular expressions are not trivial. But they are very powerful. If you want to know more about them, this is a good start.

A lot of people also recommend this book.

While it’s neat, it’s not a great idea to use for big numbers since it uses O(N^2) time and O(N) memory where even a trivial solution uses O(sqrt(N)) time and O(1) memory.

True :-)

I wonder if someone has actually done some benchmarking with large numbers?

This regular expression does not use O(1) and O(N^2), its actually exponential in the size of the input. If it unpacks the number 7 (32 bit) to 7 1’s, thats 7*(32). So a 32 bit number 2147483648 would take up half a gig of memory.

Please follow the link below for fact about primes

http://kadiprimality.blogspot.com/

Pls give me regex to match filename that doesn’t start with a number and have html extension.

Maybe something like “^[^0-9].*html$”

Notice that the two ^ have very different meanings. The first one is an anchor which represents the beginning of the line. The second one is the “inverse” operator i.e. [^0-9] matches with any character excepts a digit.

You could also try this: “^\D\w+.html”

((^)Start of line – (\D)Not a digit – (\w+)several letters – (.html)end with .html)

Works with the .NET flavor of regex…

for calculations with big numbers, try F#

try this: 999999 on the above php regexp it says prime….

it isn’t ….

So it seems that there is a bug lurking somewhere…

I don’t think this is formally a regular expression, because of “?” character. The processing of a real regular expression never requires backtracking (i.e. never requires reading the input multiple times) and hence they are very fast once you generate the deterministic finite automata (DFA) for the given regular expression.

Programming languages usually provide extensions to regular expressions (like ? in this case) for convenience to the programmer. There are only 3 operators in formal regular languages, which are star (“*”), union (“|”) and concatenation. All other operators used in a regular expressions must be representable by using these 3, for example you can represent 0+ as 00* and so on. The operator “?” used in the above expression is something totally different and which in fact violates the rule that machines that process a regular expression can do so with finite amount of memory. Hence this expression is actually not regular and the language generated by it is not a regular language, mathematically.

Mathematically, you may be right (but I am not so sure.) But, from a programming point of view, regular expressions have always allowed backtracking…

php may reject large numbers silently due to recursion depth limit (which I think is 32767 levels), that’s why 999999 fails.

Thanks for pointing this out…

WOW, pretty cool! I have to revisit regular expressions it seems. :)

a language such that a string is made of binary alphabet( of 0s and 1s) and the number of zeros in the string is a 3 digit prime number…….is this language regular…??

can u pls help me out in this ……its a ques appeared in GATE entrance

this was good shit. thanks!

This is really intriguing, but I’m unclear about your explanation about how it works (it definitely works). For example, try the number 49 (not prime) – 1111111111111111111111111111111111111111111111111 – neither (11)+ matches nor (111)+ matches, so you would think the regex would classify 49 as prime (it does not). So how does it really work?

I would rather see pattern like this: “^\d*[13579]$” in my scripts.

It’s clear, easy and I think it’s also faster then the one described. If you want negative numbers it’s easy to change.

Anyway it’s nice work and idea!

David

Oh, hell – my bloody English. Forgot my previous note :( Shit

Wow, that’s impressive. A colleague of mine showed me this regular expression, and I couldn’t really believe it was possible. I’ve never really learned regular expressions, but this has definitely encouraged me to start learning.

@Nithin

Yeah. They are cool. Personally, I practically never use the very complex ones (like the one above) but I routinely use the more usual ones especially when I’m using vi or TextMate on my MacBook.

Nice!!

I implemented it on my website: http://ap2cu.com/tools/prime

This is working great

You can also check my REGEX online tool: http://ap2cu.com/tools/regex

Thanks for this post!

@Pras Sarker: For your example 49, it works because it matches (1111111+) and \1+ (7×7). Basically it starts by matching 11+ (2 times something), then 111+ (3 times something) then 1111+ (4 times something). It starts at 2 and looks at each succeeding number to see if it is a factor (by matching it and then matching multiples of it). But like everyone said this method is very very slow, the bigger the number you are checking the worse the performance, by a LOT.

@Makovec: You and several other people don’t seem to understand what a prime number is. A prime number is not an odd number. The number 21 would match your simplified regexp because it is odd (ends in 1, matching your expression) but it should FAIL the prime number check because it has factors 3 and 7 (3×7=21). You can’t simply look for odd numbers.

This is probably the simplest way to check for prime numbers within a single regexp for sure, but because there are so many non-regexp ways to do it (which are MUCH more efficient), this is just a novelty and not actually useful for any real world application.

Still WAY cool though!! :)

Exactly :-)

Did you notice that you could do this also:

def is_prime(n)

(100-2)%2.0 != 0

end

it’s just amazing, and really easy

awesome..cool..another feather to my knowledge…thanks for sharing..

wait, my code doesn’t work.

This (11+?)\1+$ is more complex than i thought

You’re right.

The regular expression does not test for an odd number. Instead, it tests all divisors which is way more complex and, dare I say, more beautiful :-)

Interesting expression, but I think the real power of regular expressions is that they are effective, because they are implemented as finite state automates (FSA). Although this one cannot be implemented like such, because of this backtracking. In theory FSA and regular expressions CANNOT check for prime numbers (check Pumping Lemma ;) ). I think such implementations of RE are wrong, because they obviously don’t guarantee you the O(n) complexity of FSA and can become very inefficient :(

@Paulo:

Assuming you intend to return the result of that equation, you are effectively returning false all the time, as you simply take the modulus of 98 with respect to 2.0.

I’m not really sure where you’re headed with that but it looks like you too have confused primes with odd numbers.

JavaScript one:

var n = 5; !/^1?$|^(11+?)\1+$/.test((function(n){s=”;while(n–)s+=’1′;return s;})(n))

For the love of God … stop calling it a regular expression because it is NOT a regular expression. Awesome as it may be, this is just a recursive function written using (mostly) only symbols that are often associated with regular expressions.

It REALLY is a proven fact that no regular expression can be devised that accepts all and only prime numbers.

I truly love the idea though.

@Craig

after i wrote the code, I’ve noticed that it just tests for odd numbers. I should pay more attention before posting.

Actually, I didn’t understand how (11+?)\1+$ really works, there is any math paper about this approach to test the primality of number? I’ve never seen that before

@Paula The key is to understand that the matching is being done with a string of ones of lenght N, the number being tested for primality.

Doesn’t work? http://rubular.com/r/PGckdeL1Gq

The RE should be matched with a string of ones of length N, the number being tested.

ahhh, I wan’t paying attention. That’s what I get for skimming over your blog post…

Great work!

Regular expressions aren’t powerful. http://en.wikipedia.org/wiki/Chomsky_hierarchy#The_hierarchy In fact they’re strictly less powerful than most languages which are turing-complete.

:|

Is there anything regex can’t do?

:|

I had a hard time understanding the algorithm used. Your article doesn’t make it clear.

The algorithm detects if the input string of 1 can be split in 2 or more shorter strings of same length and with at least two 1.

It will obviously work, but it’s just a curiosity. Not very efficient.

You might also find amuzing my rather silly article on calculating fibonacci number with regexes:

http://nick.zoic.org/2010/06/01/fibonaci-regex-perversity/

Bogdan @72 is right, it’s not really a “Regular Expression” as such, but I think it is still appropriate to call it a Regex because most regex engines support this kind of silliness :-).

This is a terrible idea. Just because you can do it, it does not mean you should. I am sure I can hammer down a nail with my $2000 guitar :)

Yeah. Regular expressions are not very efficient. But you have to agree that they have a great “wow” effect :-)

fan++ to this blog. :) Awesome post, i’m recently into extensive usage of reg-ex and this works like a charm :) thank you.

I tried the Java version(given by Binks) –

public static void main(string[] args){

if (args.length >= 1)

{

int c = Integer.parseInt(args[0]);

String s = “”;

while (c > 0) {c–; s+=”1″;}

System.out.println(!Pattern.matches(“^1?$|^(11+?)\\1+$”, s));

}}

Doesnt work for me. Can anyone help?

@Ketwaroo

Just use HTMLPurifier.

A very silly, useless algorithm, beautifully expressed. :-p

Try this instead.

public static void main(String[] args){

String[] nums = {“0”, “1”, “3”};

if (nums.length >= 1)

{

int c = Integer.parseInt(nums[2]);

String s = “”;

while (c > 0)

{

c –;

s += “1”;

}

System.out.println(!Pattern.matches(“^1?$|^(11+?)\\1+$”, s));

}

}

Hi,

I think there’s a lot of confusion about what a Regular Expression is (and what it can do)

I don’t know where to start.

The “magic pattern” shown above is not a Regular Expression, but a “regex”, just an input string that a “specific” pattern-patching-engine can process.

“Regular expressions describe regular languages in formal language theory. They have thus the same expressive power as regular grammars. Regular expressions consist of constants and operators that denote sets of strings and operations over these sets, respectively.” – Wikipedia

“Many features found in modern regular expression libraries provide an expressive power that far exceeds the regular languages. For example, many implementations allow grouping subexpressions with parentheses and recalling the value they match in the same expression (backreferences). This means that a pattern can match strings of repeated words like “papa” or “WikiWiki”, called squares in formal language theory. The pattern for these strings is (.*)\1.

…

Many tools, libraries, and engines that provide such constructions still use the term regular expression for their patterns. This has led to a nomenclature where the term regular expression has different meanings in formal language theory and pattern matching. For this reason, some people have taken to using the term regex or simply pattern to describe the latter” ” – Wikipedia

This is not just a linguistic problem.

Looks at algorithm to evaluate “regexes”:

“The third algorithm is to match the pattern against the input string by backtracking. This algorithm is commonly called NFA, but this terminology can be confusing. Its running time can be exponential, which simple implementations exhibit when matching against expressions like (a|aa)*b that contain both alternation and unbounded quantification and force the algorithm to consider an exponentially increasing number of sub-cases. This behavior can cause a security problem called Regular expression Denial of Service – ReDoS, which might be used by hackers who want to attack a regular expression engine.” – Wikipedia

And our “magic pattern” is (11+?)\1+

Now. because an input string, to be declared as “not matching”, must not match ALL of possible paths, the poor engine must evaluate all of them. Here is the problem.

Imagine now to insert a very low number (around a million).

Poor engine should evaluate a string long a million 1s!! Oh my god…

So please, use this funny regex just for study, or to understand how a backtracking engine works, but again, please, think of your children, don’t try this at home.

You’re 100% right.

This regular expression, or more precisely, this

regexis pathetically inefficient yet so beautiful in a geekish sort of way :-)WOW, this is awesome!

Amazing, but not fast at all with big numbers =p

Imagine you have to verify a larger number, like 4294967293 =D

Even I am pretty interested in regular expressions. I learned them by using in VI and linux but now I find its use a lot while coding in php.

Hi,

I didn’t go through all the comments above, but I don’t think anyone actually explained why the regexp works, so I’ll give it a try.

The secret is in the representation of the number. The number is represented as a sequence of ones (‘1’) as long as the number. What the regexp does is try to see if this

sequence of ones “can be broken down into 2 or more equal parts”.

Wait a second, that’s exactly like saying the number is divisible. The dividers are the length of the substring matched by (11+?) and the number of times it occurs in the original string.

By the way, I agree that this is extremely inefficient.

I also agree with @avinash that it’s “beautiful in a geekish sort of way”.

I ran some benchmarks: http://www.chipsquips.com/?p=2418&cpage=1#comment-265857

It’s far too inefficient for practical use, but it’s mathematically brilliant.

Awesome. I love regex. Your regex makes perfect sense from the start. Have you ever tried to draw a table 7-cells wide and fill it row by row with numbers 1-2-3-4 (to infinity – lol) then circle primes? You’ll see beautiful diagonal patterns.

I’ll do that tomorrow, Svetlana :-)

Very cool trick and probably useful for small prime number tasks but of course it’s just not feasible for any really heavy math. Neat trick though.

Everybody loves primes!

Better JS version:

function isPrime(val) {

return !/^1?$|^(11+?)\1+$/.test((new Array(val + 1)).join('1'));

}

# script to check wheather no. is ‘composite no’.

# if nothing print’s then given no. is ‘prime’.

my $no = $ARGV[0];

for(my $i=2; $i<=$no; $i++){

my $result=($no/$i);

if($result=~/^[0-9]+$/ && $result=~/[^1]/){

print"$result \* $i\n";

print"$no is composite number";

exit;

}

}

Thanks for sharing!

I think this is cool. But it’s not magic. What the regex matcher is doing is that it’s trying to split a unary number into equal parts. 11+? is like a dividend and \1+ is repeater of the dividend. If no dividend exists for a unary representation of a number then it is prime. So it will always work, although it’s not different from checking whether given number is divisible first by 2 then by 3 and then by 4 and so on.

Almost agree with @cetin, but he misses some points. He is right that the regular language is some language for whom we are able to construct the DFA. Most of today’s regular expression flavours use NFA, but that doesn’t make the irregular since for every NFA we can construct the corresponding DFA. Regarding the backtracking, this is naturally behavior in NFA, since we have to evaluate all the paths of the NFA.

I am a little bit frustrated that only one person mention UVW theorem. Indeed you can use the Pumping lemma to prove that is irregular or regular language: http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages

Thanks Asen for your comment. I don’t understand everything you write but I feel you are right :-)

That is so sexy. OMG.

I love this, but it’s not a regular expression (not formally, and not even by Perl standards). The regex backtracks and counts, which is not possible for a formal regular expression, and wouldn’t work at all without the implicit loop that turns the number into a string of 1’s.

But a thing of beauty nontheless.

http://blog.kenficara.com/2013/06/30/irregular-language-and-regular-expressions/

Since I’m a web developer (Front-end and Back-end), I always read articles concerning my field. So I was reading about the “Adventures in Javascript development” by Rebecca Murphey. Her blog was powered by Octopress, a blogging framework for hackers. I went to find out more about Octopress and while going through it’s documentation and plugins, I found the actor in this Noulakaz article, that is the regex, being mentioned + a link to the source article (this one) added in the comments.

Here’s the plugin page: http://octopress.org/docs/plugins/backtick-codeblock/

Congrats Avinash !

Thanks for noticing. This post on regular expressions is definitely the most popular post of this blog. I’m happy I’ve managed to write something people find useful. I’m not the one who found the regular expression though :-)

I’m currently working on a ruby gem to generate *examples* of given regular expression. For example:

/this|is|awesome/.examples # => [“this”, “is”, “awesome”]

So, I tried it out on this “prime number validator” regex:

/1?|(11+?)\1+/.examples

#=> [“”, “1”, “1111”, “111111”, “11111111”]

/1?|(11+?)\1+/.examples(max_repeater_variance: 50, max_group_results: 100).uniq.sort.map(&:length).first(10)

#=> [0, 1, 4, 6, 8, 9, 10, 12, 14, 15]

require ‘prime’

Array(1..100) – /1?|(11+?)\1+/.examples(max_repeater_variance: 50, max_group_results: 3000).uniq.sort.map(&:length) == Prime.first(25)

#=> true

Horribly inefficient, but pretty cool! Here’s my gem if case anyone’s interested: https://github.com/tom-lord/regexp-examples

Thanks Tom for sharing.

This thing does not describe a regular language. I refuse to call it a regular expression. Backreferences are not part or ordinary regular expressions.

The language of strings that have prime length over the alphabet {1} is not regular. This follows directly from the pumping lemma.

This means, that a “regex” that find primes relies on extensions and we cannot expect it to work in general.