Regular Expression
pyformlang.regular_expression
This module deals with regular expression.
By default, this module does not use the standard way to write regular expressions. Please read the documentation of Regex for more information.
Available Classes
Regex
A regular expression
PythonRegex
A regular expression closer to Python format
MisformedRegexError
An error occurring when the input regex is incorrect
- exception pyformlang.regular_expression.MisformedRegexError(message: str, regex: str)[source]
Error for misformed regex
- class pyformlang.regular_expression.PythonRegex(python_regex)[source]
Represents a regular expression as used in Python.
It adds the following features to the basic regex:
Set of characters with []
Inverse set of character with [^…]
positive closure +
. for all printable characters
? for optional character/group
Repetition of characters with {m} and {n,m}q
Shortcuts: d, s, w
- Parameters:
python_regex (str) – The regex represented as a string or a compiled regex ( re.compile(…))
- Raises:
MisformedRegexError – If the regular expression is misformed.
Examples
Python regular expressions wrapper
>>> from pyformlang.regular_expression import PythonRegex
>>> p_regex = PythonRegex("a+[cd]") >>> p_regex.accepts(["a", "a", "d"]) True
As the alphabet is composed of single characters, one could also write
>>> p_regex.accepts("aad") True >>> p_regex.accepts(["d"]) False
- class pyformlang.regular_expression.Regex(regex)[source]
Represents a regular expression
Pyformlang implements the operators of textbooks, which deviate slightly from the operators in Python. For a representation closer to Python one, please use
PythonRegex
The concatenation can be represented either by a space or a dot (.)
The union is represented either by | or +
The Kleene star is represented by *
The epsilon symbol can either be “epsilon” or $
It is also possible to use parentheses. All symbols except the space, ., |, +, *, (, ), epsilon and $ can be part of the alphabet. All other common regex operators (such as []) are syntactic sugar that can be reduced to the previous operators. Another main difference is that the alphabet is not reduced to single characters as it is the case in Python. For example, “python” is a single symbol in Pyformlang, whereas it is the concatenation of six symbols in regular Python.
All special characters except epsilon can be escaped with a backslash ( double backslash in strings).
- Parameters:
regex (str) – The regex represented as a string
- Raises:
MisformedRegexError – If the regular expression is misformed.
Examples
>>> regex = Regex("abc|d")
Check if the symbol “abc” is accepted
>>> regex.accepts(["abc"]) True
Check if the word composed of the symbols “a”, “b” and “c” is accepted
>>> regex.accepts(["a", "b", "c"]) False
Check if the symbol “d” is accepted
>>> regex.accepts(["d"]) # True
>>> regex1 = Regex("a b") >>> regex_concat = regex.concatenate(regex1) >>> regex_concat.accepts(["d", "a", "b"]) True
>>> print(regex_concat.get_tree_str()) Operator(Concatenation) Operator(Union) Symbol(abc) Symbol(d) Operator(Concatenation) Symbol(a) Symbol(b)
Give the equivalent finite-state automaton
>>> regex_concat.to_epsilon_nfa()
- accepts(word: Iterable[str]) bool [source]
Check if a word matches (completely) the regex
- Parameters:
word (iterable of str) – The word to check
- Returns:
is_accepted – Whether the word is recognized or not
- Return type:
Examples
>>> regex = Regex("abc|d")
Check if the symbol “abc” is accepted
>>> regex.accepts(["abc"]) True
- concatenate(other: Regex) Regex [source]
Concatenates a regular expression with an other one
- Equivalent to:
>>> regex0 + regex1
- Parameters:
other (
Regex
) – The other regex- Returns:
regex – The concatenation of the two regex
- Return type:
Examples
>>> regex0 = Regex("a b") >>> regex1 = Regex("c") >>> regex_union = regex0.concatenate(regex1) >>> regex_union.accepts(["a", "b"]) False >>> regex_union.accepts(["a", "b", "c"]) True
Or equivalently:
>>> regex_union = regex0 + regex1 >>> regex_union.accepts(["a", "b", "c"]) True
- classmethod from_python_regex(regex)[source]
Creates a regex from a string using the python way to write it.
Careful: Not everything is implemented, check PythonRegex class documentation for more details.
It is equivalent to calling PythonRegex constructor directly.
- Parameters:
regex (str) – The regex given as a string or compile regex
- Returns:
python_regex – The regex
- Return type:
Examples
>>> Regex.from_python_regex("a+[cd]")
- from_string(regex_str: str)[source]
Construct a regex from a string. For internal usage.
Equivalent to the constructor of Regex
- Parameters:
regex_str (str) – The string representation of the regex
- Returns:
regex – The regex
- Return type:
Examples
>>> regex.from_string("a b c")
, which is equivalent to:
>>> Regex("a b c")
- get_number_operators() int [source]
Gives the number of operators in the regex
- Returns:
n_operators – The number of operators in the regex
- Return type:
Examples
>>> regex = Regex("a|b*") >>> regex.get_number_operators() 2
The two operators are “|” and “*”.
- get_number_symbols() int [source]
Gives the number of symbols in the regex
- Returns:
n_symbols – The number of symbols in the regex
- Return type:
Examples
>>> regex = Regex("a|b*") >>> regex.get_number_symbols() 2
The two symbols are “a” and “b”.
- get_tree_str(depth: int = 0) str [source]
Get a string representation of the tree behind the regex
- Parameters:
depth (int) – The current depth, 0 by default
- Returns:
representation – The tree representation
- Return type:
Examples
>>> regex = Regex("abc|d*") >>> print(regex.get_tree_str()) Operator(Union) Symbol(abc) Operator(Kleene Star) Symbol(d)
- kleene_star() Regex [source]
Makes the kleene star of the current regex
- Returns:
regex – The kleene star of the current regex
- Return type:
Examples
>>> regex = Regex("a") >>> regex_kleene = regex.kleene_star() >>> regex_kleene.accepts([]) True >>> regex_kleene.accepts(["a", "a", "a"]) True
- to_cfg(starting_symbol='S') CFG [source]
Turns the regex into a context-free grammar
- Parameters:
starting_symbol (
Variable
, optional) – The starting symbol- Returns:
cfg – An equivalent context-free grammar
- Return type:
Examples
>>> regex = Regex("(a|b)* c") >>> my_cfg = regex.to_cfg() >>> my_cfg.contains(["c"]) True
- to_epsilon_nfa()[source]
Transforms the regular expression into an epsilon NFA
- Returns:
enfa – An epsilon NFA equivalent to the regex
- Return type:
Examples
>>> regex = Regex("abc|d") >>> regex.to_epsilon_nfa()
- union(other: Regex) Regex [source]
Makes the union with another regex
- Equivalent to:
>>> regex0 or regex1
- Parameters:
other (
Regex
) – The other regex- Returns:
regex – The union of the two regex
- Return type:
Examples
>>> regex0 = Regex("a b") >>> regex1 = Regex("c") >>> regex_union = regex0.union(regex1) >>> regex_union.accepts(["a", "b"]) >>> regex_union.accepts(["c"])
Or equivalently:
>>> regex_union = regex0 or regex1 >>> regex_union.accepts(["a", "b"])