Regular Expression

pyformlang.regular_expression

This module deals with regular expression.

By default, this module does not use the standard way to write regular expressions. Please read the documentation of Regex for more information.

Available Classes

Regex

A regular expression

PythonRegex

A regular expression closer to Python format

MisformedRegexError

An error occurring when the input regex is incorrect

exception pyformlang.regular_expression.MisformedRegexError(message: str, regex: str)[source]

Error for misformed regex

class pyformlang.regular_expression.PythonRegex(python_regex)[source]

Represents a regular expression as used in Python.

It adds the following features to the basic regex:

  • Set of characters with [] (no inverse with [^…])

  • positive closure +

  • . for all printable characters

  • ? for optional character/group

  • Shortcuts: d, s, w

Parameters

python_regex (str) – The regex represented as a string or a compiled regex ( re.compile(…))

Raises

MisformedRegexError – If the regular expression is misformed.

Examples

Python regular expressions wrapper

>>> from pyformlang.regular_expression import PythonRegex
>>> p_regex = PythonRegex("a+[cd]")
>>> p_regex.accepts(["a", "a", "d"])
True

As the alphabet is composed of single characters, one could also write

>>> p_regex.accepts("aad")
True
>>> p_regex.accepts(["d"])
False
class pyformlang.regular_expression.Regex(regex)[source]

Represents a regular expression

Pyformlang implements the operators of textbooks, which deviate slightly from the operators in Python. For a representation closer to Python one, please use PythonRegex

  • The concatenation can be represented either by a space or a dot (.)

  • The union is represented either by | or +

  • The Kleene star is represented by *

  • The epsilon symbol can either be “epsilon” or $

It is also possible to use parentheses. All symbols except the space, ., |, +, *, (, ), epsilon and $ can be part of the alphabet. All other common regex operators (such as []) are syntactic sugar that can be reduced to the previous operators. Another main difference is that the alphabet is not reduced to single characters as it is the case in Python. For example, “python” is a single symbol in Pyformlang, whereas it is the concatenation of six symbols in regular Python.

All special characters except epsilon can be escaped with a backslash ( double backslash in strings).

Parameters

regex (str) – The regex represented as a string

Raises

MisformedRegexError – If the regular expression is misformed.

Examples

>>> regex = Regex("abc|d")

Check if the symbol “abc” is accepted

>>> regex.accepts(["abc"])
True

Check if the word composed of the symbols “a”, “b” and “c” is accepted

>>> regex.accepts(["a", "b", "c"])
False

Check if the symbol “d” is accepted

>>> regex.accepts(["d"])  # True
>>> regex1 = Regex("a b")
>>> regex_concat = regex.concatenate(regex1)
>>> regex_concat.accepts(["d", "a", "b"])
True
>>> print(regex_concat.get_tree_str())
Operator(Concatenation)
 Operator(Union)
  Symbol(abc)
  Symbol(d)
 Operator(Concatenation)
  Symbol(a)
  Symbol(b)

Give the equivalent finite-state automaton

>>> regex_concat.to_epsilon_nfa()
accepts(word: Iterable[str]) bool[source]

Check if a word matches (completely) the regex

Parameters

word (iterable of str) – The word to check

Returns

is_accepted – Whether the word is recognized or not

Return type

bool

Examples

>>> regex = Regex("abc|d")

Check if the symbol “abc” is accepted

>>> regex.accepts(["abc"])
True
concatenate(other: Regex) Regex[source]

Concatenates a regular expression with an other one

Equivalent to:
>>> regex0 + regex1
Parameters

other (Regex) – The other regex

Returns

regex – The concatenation of the two regex

Return type

Regex

Examples

>>> regex0 = Regex("a b")
>>> regex1 = Regex("c")
>>> regex_union = regex0.concatenate(regex1)
>>> regex_union.accepts(["a", "b"])
False
>>> regex_union.accepts(["a", "b", "c"])
True

Or equivalently:

>>> regex_union = regex0 + regex1
>>> regex_union.accepts(["a", "b", "c"])
True
classmethod from_python_regex(regex)[source]

Creates a regex from a string using the python way to write it.

Careful: Not everything is implemented, check PythonRegex class documentation for more details.

It is equivalent to calling PythonRegex constructor directly.

Parameters

regex (str) – The regex given as a string or compile regex

Returns

python_regex – The regex

Return type

PythonRegex

Examples

>>> Regex.from_python_regex("a+[cd]")
from_string(regex_str: str)[source]

Construct a regex from a string. For internal usage.

Equivalent to the constructor of Regex

Parameters

regex_str (str) – The string representation of the regex

Returns

regex – The regex

Return type

Regex

Examples

>>> regex.from_string("a b c")

, which is equivalent to:

>>> Regex("a b c")
get_number_operators() int[source]

Gives the number of operators in the regex

Returns

n_operators – The number of operators in the regex

Return type

int

Examples

>>> regex = Regex("a|b*")
>>> regex.get_number_operators()
2

The two operators are “|” and “*”.

get_number_symbols() int[source]

Gives the number of symbols in the regex

Returns

n_symbols – The number of symbols in the regex

Return type

int

Examples

>>> regex = Regex("a|b*")
>>> regex.get_number_symbols()
2

The two symbols are “a” and “b”.

get_tree_str(depth: int = 0) str[source]

Get a string representation of the tree behind the regex

Parameters

depth (int) – The current depth, 0 by default

Returns

representation – The tree representation

Return type

str

Examples

>>> regex = Regex("abc|d*")
>>> print(regex.get_tree_str())
Operator(Union)
 Symbol(abc)
 Operator(Kleene Star)
  Symbol(d)
kleene_star() Regex[source]

Makes the kleene star of the current regex

Returns

regex – The kleene star of the current regex

Return type

Regex

Examples

>>> regex = Regex("a")
>>> regex_kleene = regex.kleene_star()
>>> regex_kleene.accepts([])
True
>>> regex_kleene.accepts(["a", "a", "a"])
True
to_cfg(starting_symbol='S') CFG[source]

Turns the regex into a context-free grammar

Parameters

starting_symbol (Variable, optional) – The starting symbol

Returns

cfg – An equivalent context-free grammar

Return type

CFG

Examples

>>> regex = Regex("(a|b)* c")
>>> my_cfg = regex.to_cfg()
>>> my_cfg.contains(["c"])
True
to_epsilon_nfa()[source]

Transforms the regular expression into an epsilon NFA

Returns

enfa – An epsilon NFA equivalent to the regex

Return type

EpsilonNFA

Examples

>>> regex = Regex("abc|d")
>>> regex.to_epsilon_nfa()
union(other: Regex) Regex[source]

Makes the union with another regex

Equivalent to:
>>> regex0 or regex1
Parameters

other (Regex) – The other regex

Returns

regex – The union of the two regex

Return type

Regex

Examples

>>> regex0 = Regex("a b")
>>> regex1 = Regex("c")
>>> regex_union = regex0.union(regex1)
>>> regex_union.accepts(["a", "b"])
>>> regex_union.accepts(["c"])

Or equivalently:

>>> regex_union = regex0 or regex1
>>> regex_union.accepts(["a", "b"])