I feel like it shouldn't be too hard to calculate from the finite automaton that encodes the regular expression, but surely in most cases it will simply be infinite?
This is hitting back a long time. But the algorithm - if I recall right - is a simple DFS on the determinstic automaton for the regular expression and it can output the full set of matching strings if you're allowed to use *s in the output.
Basically, you need an accumulator of "stuff up to here". If you move from a node to a second node, you add the character annotating that edge to the accumulator. And whenever you end up with an edge to a visited node, you add a '*' and output that, and for leaf nodes, you output the accumulator.
And then you add a silly jumble of parenthesis on entry and output to make it right. This was kinda simple to figure out with stuff like (a(ab)*b)* and such.
This is in O(states) for R and O(2^states) for NR if I recall right.
Say you want to compute all strings of length 5 that the automaton can generate. Conceptually the nicest way is to create an automaton that matches any five characters and then compute the intersection between that automaton and the regex automaton. Then you can generate all the strings in the intersection automaton. Of course, IRL, you wouldn't actually generate the intersection automaton (you can easily do this on the fly), but you get the idea.
Automata are really a lost art in modern natural language processing. We used to do things like store a large vocabulary in an deterministic acyclic minimized automaton (nice and compact, so-called dictionary automaton). And then to find, say all words within Levenshtein distance 2 of hacker, create a Levenshtein automaton for hacker and then compute (on the fly) the intersection between the Levenshtein automaton and the dictionary automaton. The language of the automaton is then all words within the intersection automaton.
I wrote a Java package a decade ago that implements some of this stuff:
the page actually does give these. for α := [a-z]{2,4} the page gives |α| = 475228.
however, as others have pointed out any non-trivial use of the kleene star means the result will be ∞. in this case the page will list numbers that roughly correspond to "number of strings with N applications of kleene star" in addition to infinity.
(EDIT: this code is completely wrongheaded and does not work; it assumes that when sequencing regexes, you can take the product of their sizes to find the overall size. This is just not true. See reply, below, for an example.)
-- https://gist.github.com/rntz/03604e36888a8c6f08bb5e8c665ba9d0
import qualified Data.List as List
data Regex = Class [Char] -- character class
| Seq [Regex] -- sequence, ABC
| Choice [Regex] -- choice, A|B|C
| Star Regex -- zero or more, A*
deriving (Show)
data Size = Finite Int | Infinite deriving (Show, Eq)
instance Num Size where
abs = undefined; signum = undefined; negate = undefined -- unnecessary
fromInteger = Finite . fromInteger
Finite x + Finite y = Finite (x + y)
_ + _ = Infinite
Finite x * Finite y = Finite (x * y)
x * y = if x == 0 || y == 0 then 0 else Infinite
-- computes size & language (list of matching strings, if regex is finite)
eval :: Regex -> (Size, [String])
eval (Class chars) = (Finite (length cset), [[c] | c <- cset])
where cset = List.nub chars
eval (Seq regexes) = (product sizes, concat <$> sequence langs)
where (sizes, langs) = unzip $ map eval regexes
eval (Choice regexes) = (size, lang)
where (sizes, langs) = unzip $ map eval regexes
lang = concat langs
size = if elem Infinite sizes then Infinite
-- finite, so just count 'em. inefficient but works.
else Finite (length (List.nub lang))
eval (Star r) = (size, lang)
where (rsize, rlang) = eval r
size | rsize == 0 = 1
| rsize == 1 && List.nub rlang == [""] = 1
| otherwise = Infinite
lang = [""] ++ ((++) <$> [x | x <- rlang, x /= ""] <*> lang)
size :: Regex -> Size
size = fst . eval
NB. Besides the utter wrong-headedness of the `product` call, the generated string-sets may not be exhaustive for infinite languages, and the original version (I have since edited it) was wrong in several cases for Star (if the argument was nullable or empty).
You're correct, and I don't see any good way to avoid this that doesn't involve enumerating the actual language (at least when the language is finite).
It turns out to be not that hard to just compute the language of the regex, if it is finite, and otherwise note that it is infinite:
import Prelude hiding (null)
import Data.Set (Set, toList, fromList, empty, singleton, isSubsetOf, unions, null)
data Regex = Class [Char] -- character class
| Seq [Regex] -- sequence, ABC
| Choice [Regex] -- choice, A|B|C
| Star Regex -- zero or more, A*
deriving (Show)
-- The language of a regex is either finite or infinite.
-- We only care about the finite case.
data Lang = Finite (Set String) | Infinite deriving (Show, Eq)
zero = Finite empty
one = Finite (singleton "")
isEmpty (Finite s) = null s
isEmpty Infinite = False
cat :: Lang -> Lang -> Lang
cat x y | isEmpty x || isEmpty y = zero
cat (Finite s) (Finite t) = Finite $ fromList [x ++ y | x <- toList s, y <- toList t]
cat _ _ = Infinite
subsingleton :: Lang -> Bool
subsingleton Infinite = False
subsingleton (Finite s) = isSubsetOf s (fromList [""])
eval :: Regex -> Lang
eval (Class chars) = Finite $ fromList [[c] | c <- chars]
eval (Seq rs) = foldr cat one $ map eval rs
eval (Choice rs) | any (== Infinite) langs = Infinite
| otherwise = Finite $ unions [s | Finite s <- langs]
where langs = map eval rs
eval (Star r) | subsingleton (eval r) = one
| otherwise = Infinite
https://regex-generate.github.io/regenerate/ (I'm one of the authors) enumerates all the matching (and non-matching) strings, which incidentally answers the question, but doesn't terminate in the infinite case.
I feel like it might be possible with dataflow analysis. Stepping through the regex maintaining a liveness set or something like that. Sort of like computing exemplar inputs, but with repetition as permitted exemplars. Honestly probably end up re-encoding the regex in some other format, perhaps with 'optimizations applied.'
I think that's just the ordinal corresponding to lexicographic order on the words in the language, so yeah that should work. I wonder how easy it is to calculate...
normally you would use an ordinal number [1] to label individual elements of an infinite set while using a cardinal number [2] to measure the size of the set.
i believe the cardinality of a set of words from a finite alphabet (with more than one member) is equivalent to the cardinality of the real numbers. this means that the cardinality of .* is c.
unfortunately, i don't think that cardinality gets us very far when trying to differentiate the "complexity" of expressions like [ab]* from ([ab]*c[de]*)*[x-z]*. probably some other metric should be used (maybe something like kolmogorov complexity).
I wouldn't say that's their 'normal' usage, I mean sure you can use them like that but fundamentally ordinal numbers are equivalence classes of ordered sets in the same way that cardinal numbers are equivalence classes of sets.
As you've rightly noted the latter equivalence class gets us nothing so throwing away the ordering is a bit of a waste. Of all mathematical concepts 'size' is easily the most subjective so picking one that is interesting is better than trying to be 'correct'.
In particular a*b* is exactly equivalent to ω^2, since a^n b^m < a^x b^y iff n < x or n=x and m<y. This gives an order preserving isomorphism between words of the form a^n b^m and tuples (n,m) with lexicographic ordering.
Unless I'm very mistaken adding onto the end of a string doesn't affect lexicographic order, so that's effectively [ab]*. The ordinality of [ab] is simple, it's 2, but the kleene star is a bit of an odd one, it's not quite exponentiation.
To reason about the kleene star it's a bit simpler to consider something like R^*n, where you repeat up to n times. Obviously R^*0 = 1 and R^*S(n) can be built from R^*n by picking an element of R^*n and appending either nothing or an element of R, here the element of R^*n determines most of the order and 'nothing' orders in front. For technical reasons the corresponding ordinal is (1+R) R^*n, which is backwards from how you'd expect it and how you'd normally define exponentiation.
The kleene star can be recovered by taking the limit, identifying R^*n with it's image in R^*S(n). Which also doesn't quite work as nicely as you'd hope (normally the image is just a downward closed subset, it's not in this case).
I think [ab]* is equivalent to something like the rational part of the Cantor set. Not sure if there's a simpler way to describe it, it's nowhere near as simple as 2^ω, which is just ω.
Hmm, turns out this fails to be an ordinal because regular languages aren't well-ordered (except if you reverse the lexicographic order, maybe?). They are what is called an order type, and it looks like it should be possible to identify them with subsets of the rationals, if you wanted to.
Perhaps reversing the lexicographic order makes more sense, in that case longer tuples simply order last so R^* = 1 + R + R^2 + ..., the limit here is much easier since R^*n = 1 + R + ... + R^n is downwards closed as a subset of R^*S(n).
Then again in that scenario [ab]* is simply ω because it is effectively the same as just writing an integer in binary, so it is less interesting in a way.
I would say [ab]* has order type omega. That's because the set of strings it represents is: {epsilon, ab, abab, ababab, abababab, ...} which looks a lot like omega. (epsilon = empty string)
What this exercise really is is finding a canonical way to order a regular language (the set of strings a regexp matches). For example, a*b* could be {epsilon, a, aa, aaa, aaa, aaaa, ..., b, bb, bbb, bbbb, bbbbb, bbbbbb, ..., ab, aab, aaab, aaaab, ..., abb, aabb, ..., ...}
which looks a lot like omega ^ 2 (not 2 * omega like I said before). However, you could also re-arrange the set to look like omega: {epsilon, a, b, aa, bb, ab, aaa, bbb, aab, abb, bbb, ...} (strings of length 1, length 2, length 3, etc)
I propose the following: for any two strings in the regular language, the one that comes first is the one whose kleene-star repetition counts come first lexicographically. More concretely, for the language a*b*, aaaab represents a repetition count of (4, 1) which and bbb represents (0, 3). (0, 3) comes before (4, 1) lexicographically, so bbb comes before aaaab. This admits the following ordering: {epsilon, b, bb, bbb, bbbb, ..., a, abb, abbb, abbbb, ..., aa, aab, aabb, aabbb, ..., ...} which is omega ^ 2 which "feels" right to me. Another rule is for the regular language (X|Y) and two strings x from X and y from Y, x should always come before y in our ordered set representation.
Hold on, what about nested kleene-stars? (a*)* is just a*, but (a*b*)* is distinct from a*b*. However, the "kleene-star counts" analysis from above breaks down because there are now multiple ways to parse strings like aab. I don't really know how to classify these regular languages as ordinals yet.
I don't really see any useful applications of this, but it's still fun to think about. The game I'm playing is thinking about a random ordinal and trying to come up with a regular language that, under my ordering rule above, looks like that ordinal. Let's try 2 * omega (which looks like this: {0, 1, 2, 3, 4, 5, ..., omega, omega + 1, omega + 2, omega + 3, ...} e.g. 2 copies of omega "concatenated"):