Kinda related but I'm looking for something that could give me the number of pos...

contravariant · on Sept 11, 2023

I feel like it shouldn't be too hard to calculate from the finite automaton that encodes the regular expression, but surely in most cases it will simply be infinite?

tetha · on Sept 11, 2023

This is hitting back a long time. But the algorithm - if I recall right - is a simple DFS on the determinstic automaton for the regular expression and it can output the full set of matching strings if you're allowed to use *s in the output.

Basically, you need an accumulator of "stuff up to here". If you move from a node to a second node, you add the character annotating that edge to the accumulator. And whenever you end up with an edge to a visited node, you add a '*' and output that, and for leaf nodes, you output the accumulator.

And then you add a silly jumble of parenthesis on entry and output to make it right. This was kinda simple to figure out with stuff like (a(ab)*b)* and such.

This is in O(states) for R and O(2^states) for NR if I recall right.

kadoban · on Sept 11, 2023

Maybe the number of possible matchings for a given length (or range of lengths) might be interesting?

microtonal · on Sept 11, 2023

Say you want to compute all strings of length 5 that the automaton can generate. Conceptually the nicest way is to create an automaton that matches any five characters and then compute the intersection between that automaton and the regex automaton. Then you can generate all the strings in the intersection automaton. Of course, IRL, you wouldn't actually generate the intersection automaton (you can easily do this on the fly), but you get the idea.

Automata are really a lost art in modern natural language processing. We used to do things like store a large vocabulary in an deterministic acyclic minimized automaton (nice and compact, so-called dictionary automaton). And then to find, say all words within Levenshtein distance 2 of hacker, create a Levenshtein automaton for hacker and then compute (on the fly) the intersection between the Levenshtein automaton and the dictionary automaton. The language of the automaton is then all words within the intersection automaton.

I wrote a Java package a decade ago that implements some of this stuff:

https://github.com/danieldk/dictomaton

contravariant · on Sept 11, 2023

> deterministic acyclic minimized automaton

That's basically a Trie right? To be fair I have only heard of them and know they can be used to do neat tricks, I've rarely used one myself.

Someone · on Sept 11, 2023

If you do not plan to update it often and don’t need to store extra data with each word, a dawg (https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_s...) is more compact. You often can merge leaf nodes.

For example, if you have words

   talk
   talked
   talking
   talks
   walk
   walked
   walking
   walks

there’s no need to repeat the “”, “ed”, “ing”, “s” parts.

microtonal · on Sept 11, 2023

No. In a minimized automaton shared string suffixes also share states/transitions.

abecedarius · on Sept 11, 2023

That was one of the short examples in Norvig's Python program-design course for Udacity. https://github.com/darius/cant/blob/master/library/regex-gen... (I don't have the Python handy.)

aebtebeten · on Sept 11, 2023

see https://www.cs.dartmouth.edu/~doug/nfa.pdf

d66 · on Sept 11, 2023

the page actually does give these. for α := [a-z]{2,4} the page gives |α| = 475228.

however, as others have pointed out any non-trivial use of the kleene star means the result will be ∞. in this case the page will list numbers that roughly correspond to "number of strings with N applications of kleene star" in addition to infinity.

rntz · on Sept 11, 2023

Here's a simple Haskell program to do it:

(EDIT: this code is completely wrongheaded and does not work; it assumes that when sequencing regexes, you can take the product of their sizes to find the overall size. This is just not true. See reply, below, for an example.)

    -- https://gist.github.com/rntz/03604e36888a8c6f08bb5e8c665ba9d0

    import qualified Data.List as List

    data Regex = Class [Char]   -- character class
               | Seq [Regex]    -- sequence, ABC
               | Choice [Regex] -- choice, A|B|C
               | Star Regex     -- zero or more, A*
                 deriving (Show)

    data Size = Finite Int | Infinite deriving (Show, Eq)

    instance Num Size where
      abs = undefined; signum = undefined; negate = undefined -- unnecessary
      fromInteger = Finite . fromInteger
      Finite x + Finite y = Finite (x + y)
      _ + _ = Infinite
      Finite x * Finite y = Finite (x * y)
      x * y = if x == 0 || y == 0 then 0 else Infinite

    -- computes size & language (list of matching strings, if regex is finite)
    eval :: Regex -> (Size, [String])
    eval (Class chars) = (Finite (length cset), [[c] | c <- cset])
      where cset = List.nub chars
    eval (Seq regexes) = (product sizes, concat <$> sequence langs)
      where (sizes, langs) = unzip $ map eval regexes
    eval (Choice regexes) = (size, lang)
      where (sizes, langs) = unzip $ map eval regexes
            lang = concat langs
            size = if elem Infinite sizes then Infinite
                   -- finite, so just count 'em. inefficient but works.
                   else Finite (length (List.nub lang))
    eval (Star r) = (size, lang)
      where (rsize, rlang) = eval r
            size | rsize == 0 = 1
                 | rsize == 1 && List.nub rlang == [""] = 1
                 | otherwise = Infinite
            lang = [""] ++ ((++) <$> [x | x <- rlang, x /= ""] <*> lang)

    size :: Regex -> Size
    size = fst . eval

NB. Besides the utter wrong-headedness of the `product` call, the generated string-sets may not be exhaustive for infinite languages, and the original version (I have since edited it) was wrong in several cases for Star (if the argument was nullable or empty).

sebzim4500 · on Sept 11, 2023

Surely that fails for e.g. a?a?a?. I'd imagine you could do some sort of simplification first though to avoid this redundancy.

rntz · on Sept 12, 2023

You're correct, and I don't see any good way to avoid this that doesn't involve enumerating the actual language (at least when the language is finite).

Oof, my hubris.

rntz · on Sept 12, 2023

It turns out to be not that hard to just compute the language of the regex, if it is finite, and otherwise note that it is infinite:

    import Prelude hiding (null)
    import Data.Set (Set, toList, fromList, empty, singleton, isSubsetOf, unions, null)

    data Regex = Class [Char]   -- character class
               | Seq [Regex]    -- sequence, ABC
               | Choice [Regex] -- choice, A|B|C
               | Star Regex     -- zero or more, A*
                 deriving (Show)

    -- The language of a regex is either finite or infinite.
    -- We only care about the finite case.
    data Lang = Finite (Set String) | Infinite deriving (Show, Eq)

    zero = Finite empty
    one = Finite (singleton "")

    isEmpty (Finite s) = null s
    isEmpty Infinite = False

    cat :: Lang -> Lang -> Lang
    cat x y | isEmpty x || isEmpty y = zero
    cat (Finite s) (Finite t) = Finite $ fromList [x ++ y | x <- toList s, y <- toList t]
    cat _ _ = Infinite

    subsingleton :: Lang -> Bool
    subsingleton Infinite = False
    subsingleton (Finite s) = isSubsetOf s (fromList [""])

    eval :: Regex -> Lang
    eval (Class chars) = Finite $ fromList [[c] | c <- chars]
    eval (Seq rs) = foldr cat one $ map eval rs
    eval (Choice rs) | any (== Infinite) langs = Infinite
                     | otherwise = Finite $ unions [s | Finite s <- langs]
      where langs = map eval rs
    eval (Star r) | subsingleton (eval r) = one
                  | otherwise = Infinite

mikhailfranco · on Sept 12, 2023

Another interesting question is: how many possible successful matches are there for a given input string. For example:

How many ways can (a?){m}(a*){m} match the string a{m}

i.e. input m repetitions of the letter 'a'.

https://github.com/mike-french/myrex#ambiguous-example

The answer is a dot product of two vectors sliced from Pascal's Triangle.

For m=9, there are 864,146 successful matches.

someguy101010 · on Sept 11, 2023

might be something like this https://jvns.ca/blog/2016/04/24/how-regular-expressions-go-f... which refs https://en.wikipedia.org/wiki/Brzozowski_derivative

Drup · on Sept 11, 2023

https://regex-generate.github.io/regenerate/ (I'm one of the authors) enumerates all the matching (and non-matching) strings, which incidentally answers the question, but doesn't terminate in the infinite case.

clord · on Sept 11, 2023

I feel like it might be possible with dataflow analysis. Stepping through the regex maintaining a liveness set or something like that. Sort of like computing exemplar inputs, but with repetition as permitted exemplars. Honestly probably end up re-encoding the regex in some other format, perhaps with 'optimizations applied.'

stvltvs · on Sept 11, 2023

The answer is usually an infinite number, except for very, very simple cases. Anything involving * for example means infinity is your answer.

skulk · on Sept 11, 2023

I wonder if it makes sense to compute an "order type" for a regexp. For example, a* is omega, a*b* is 2 omega.

https://en.m.wikipedia.org/wiki/Order_type

https://en.wikipedia.org/wiki/Ordinal_number

contravariant · on Sept 11, 2023

I think that's just the ordinal corresponding to lexicographic order on the words in the language, so yeah that should work. I wonder how easy it is to calculate...

d66 · on Sept 11, 2023

normally you would use an ordinal number [1] to label individual elements of an infinite set while using a cardinal number [2] to measure the size of the set.

i believe the cardinality of a set of words from a finite alphabet (with more than one member) is equivalent to the cardinality of the real numbers. this means that the cardinality of .* is c.

unfortunately, i don't think that cardinality gets us very far when trying to differentiate the "complexity" of expressions like [ab]* from ([ab]*c[de]*)*[x-z]*. probably some other metric should be used (maybe something like kolmogorov complexity).

[1] https://en.wikipedia.org/wiki/Ordinal_number

[2] https://en.wikipedia.org/wiki/Cardinal_number

contravariant · on Sept 11, 2023

I wouldn't say that's their 'normal' usage, I mean sure you can use them like that but fundamentally ordinal numbers are equivalence classes of ordered sets in the same way that cardinal numbers are equivalence classes of sets.

As you've rightly noted the latter equivalence class gets us nothing so throwing away the ordering is a bit of a waste. Of all mathematical concepts 'size' is easily the most subjective so picking one that is interesting is better than trying to be 'correct'.

In particular a*b* is exactly equivalent to ω^2, since a^n b^m < a^x b^y iff n < x or n=x and m<y. This gives an order preserving isomorphism between words of the form a^n b^m and tuples (n,m) with lexicographic ordering.

d66 · on Sept 11, 2023

interesting!

what would [ab]* be? for computing an ordinal number the only real difficulty is how to handle kleene star: given ord(X) how do we calculate ord(X*)?

but as you probably noticed i'm a bit out of my depth when dealing with ordinals.

contravariant · on Sept 11, 2023

Unless I'm very mistaken adding onto the end of a string doesn't affect lexicographic order, so that's effectively [ab]*. The ordinality of [ab] is simple, it's 2, but the kleene star is a bit of an odd one, it's not quite exponentiation.

To reason about the kleene star it's a bit simpler to consider something like R^*n, where you repeat up to n times. Obviously R^*0 = 1 and R^*S(n) can be built from R^*n by picking an element of R^*n and appending either nothing or an element of R, here the element of R^*n determines most of the order and 'nothing' orders in front. For technical reasons the corresponding ordinal is (1+R) R^*n, which is backwards from how you'd expect it and how you'd normally define exponentiation.

The kleene star can be recovered by taking the limit, identifying R^*n with it's image in R^*S(n). Which also doesn't quite work as nicely as you'd hope (normally the image is just a downward closed subset, it's not in this case).

I think [ab]* is equivalent to something like the rational part of the Cantor set. Not sure if there's a simpler way to describe it, it's nowhere near as simple as 2^ω, which is just ω.

contravariant · on Sept 12, 2023

Hmm, turns out this fails to be an ordinal because regular languages aren't well-ordered (except if you reverse the lexicographic order, maybe?). They are what is called an order type, and it looks like it should be possible to identify them with subsets of the rationals, if you wanted to.

Perhaps reversing the lexicographic order makes more sense, in that case longer tuples simply order last so R^* = 1 + R + R^2 + ..., the limit here is much easier since R^*n = 1 + R + ... + R^n is downwards closed as a subset of R^*S(n).

Then again in that scenario [ab]* is simply ω because it is effectively the same as just writing an integer in binary, so it is less interesting in a way.

skulk · on Sept 12, 2023

I would say [ab]* has order type omega. That's because the set of strings it represents is: {epsilon, ab, abab, ababab, abababab, ...} which looks a lot like omega. (epsilon = empty string)

What this exercise really is is finding a canonical way to order a regular language (the set of strings a regexp matches). For example, a*b* could be {epsilon, a, aa, aaa, aaa, aaaa, ..., b, bb, bbb, bbbb, bbbbb, bbbbbb, ..., ab, aab, aaab, aaaab, ..., abb, aabb, ..., ...} which looks a lot like omega ^ 2 (not 2 * omega like I said before). However, you could also re-arrange the set to look like omega: {epsilon, a, b, aa, bb, ab, aaa, bbb, aab, abb, bbb, ...} (strings of length 1, length 2, length 3, etc)

I propose the following: for any two strings in the regular language, the one that comes first is the one whose kleene-star repetition counts come first lexicographically. More concretely, for the language a*b*, aaaab represents a repetition count of (4, 1) which and bbb represents (0, 3). (0, 3) comes before (4, 1) lexicographically, so bbb comes before aaaab. This admits the following ordering: {epsilon, b, bb, bbb, bbbb, ..., a, abb, abbb, abbbb, ..., aa, aab, aabb, aabbb, ..., ...} which is omega ^ 2 which "feels" right to me. Another rule is for the regular language (X|Y) and two strings x from X and y from Y, x should always come before y in our ordered set representation.

Hold on, what about nested kleene-stars? (a*)* is just a*, but (a*b*)* is distinct from a*b*. However, the "kleene-star counts" analysis from above breaks down because there are now multiple ways to parse strings like aab. I don't really know how to classify these regular languages as ordinals yet.

I don't really see any useful applications of this, but it's still fun to think about. The game I'm playing is thinking about a random ordinal and trying to come up with a regular language that, under my ordering rule above, looks like that ordinal. Let's try 2 * omega (which looks like this: {0, 1, 2, 3, 4, 5, ..., omega, omega + 1, omega + 2, omega + 3, ...} e.g. 2 copies of omega "concatenated"):

a*|b* = {epsilon, a, aa, aaa, aaaa, aaaaa, ..., b, bb, bbb, bbbb, ...} => 2 * omega.

Some more examples:

omega ^ 3: a*b*c*

omega ^ 2 + omega: a*b*|c*

Maybe we can write down some composition rules:

let X and Y be regular languages and ord(X) and ord(Y) be their ordinal representations. Then,

X|Y => ord(X) + ord(Y)

XY => ord(X) * ord(Y)

X* => ord(X) * omega

I haven't checked if these actually work, this is just a long rambly comment of dubious mathematical value.

wnoise · on Sept 20, 2023

[ab] is a character class, not a parenthesized catenation. It's (a|b), not (ab). So [ab]* is {epsilon, a, b, aa, ab, ba, bb, ...}

pimlottc · on Sept 11, 2023

What’s your use case?

simlevesque · on Sept 13, 2023

Calculate how long it takes to bruteforce something matching a regexp.