Getting Started

Installation

The easiest way to install zparse with from PyPi:

pip install zparse

Basic Usage

Using zparse starts with writing a grammar. Grammars are described in a python-like version of EBNF. Details on writing grammars can be found on the Grammars page. You can use raw multiline strings or store your grammars in files:

grammar = r'''
_INT: ('0'-'9')+
start: '[' _INT (',' _INT)* ']'
'''
grammar = open('grammar.peg').read()

Once you have your grammar, pass it to the make_parser method:

ParserClass = zparse.make_parser(grammar)

make_parser returns a type which must then be instantiated with the code you want to parse. Call the parse method on the parser object to get the parse tree:

parser = ParserClass(code)
parse_tree = parser.parse()

Grammars

zparse accepts PEG grammars. zparse grammars contain four elements: rule definitions, token definitions, token declarations, and fragment definitions.

Fragment Definitions

Fragment definitions look like this:

_INT: ('0'-'9')+

The left hand side of the definitions contains the fragment's name. Fragment names are written in all caps and begin with an underscore. The right hand side contains a regular expression. Regular expression can contain 8 elements:

  1. string literals ('abc', 'if')
  2. ranges ('a'-'z', '0'-'9')
  3. zero or more ('a'*, ('a'-'z')*)
  4. one or more ('a'+, ('a'-'z')+)
  5. optional ('a'?, ('a'-'z')?)
  6. concatenation ('a' 'b', 'a'* 'b'+)
  7. option ('a' | 'b')
  8. reference to another fragment (_FRAG+)

Token Declarations

Token declarations are just a token name followed by a newline. Token names are written in all caps, but do not start with an underscore. Token declarations are useful when you want to use custom code to emit tokens. For example, the Python grammar contains INDENT and DEDENT tokens, but those don't correspond to regular expressions:

IDENT
DEDENT

Token Definitions

Token definitions look like this:

WHITE_SPACE: {self.ws}? (' ' | '\t' | '\n')+ @ignore

Token declarations can contain all the elements from fragments. They can also contain tags (like @ignore) and predicates (like {self.ws}) which are explained on the advanced features page. Unlike fragment definitions, token definitions cannot reference other token definitions.

Rule Definitions

Fragment definitions, token declarations, and token definitions all describe tokenization. Rule definitions describe parsing.

expr
  : INT
  | '(' expr ')'
  | expr '**' expr !right_assoc
  | '-' expr
  | expr ('*' | '/') expr
  | expr ('+' | '-') expr

Rule must contain at least one lowercase letter and cannot start with an underscore. Rule definitions can contains references to other rules, themselves, and tokens, but cannot reference fragments directly. Rules can contain directives (!right_assoc), but only at the top level (so rule: a (b | c !right_assoc) | d would not be allowed). Any definitions can extend past a single line, but subsequent lines must be indented.

zparse API

The zparse API is split into three sections:

Tokenization

make_tokenizer(grammar: str) -> type

The make_tokenizer function creates a tokenizer class from a grammar. For example:

TokenizerClass = zparse.make_tokenizer(grammar)

Once the tokenizer's class is created, we instantiate that class with the code we want to tokenize:

tokenizer = TokenizerClass(code)

TokenizerClass.tokens(self) -> Generator[Token]

The tokens method is called on the tokenizer object to find the tokens. The

tokens = parser.parse()

Token

The Token class represents a token in the token stream.

Token.kind: str

The token name (like 'INT' and 'WS').

Token.text: str

The text captured by the regular expression (like '34' and ' \t').

Token.line: int

The Token's line number. If the token spans over multiple lines, this field contains the starting line.

Token.column: int

The Token's column number.

Token.code: str

A reference to all the code that is being tokenized.

Parsing

make_parser(grammar: str) -> type

The make_parser function creates a parser class from a grammar. For example:

ParserClass = zparse.make_parser(grammar)

Internally, this calls the make_tokenizer method. Once the parser's class is created, we instantiate that class with the code we want to parse:

parser = ParserClass(code)

ParserClass.parse(self) -> Node

The parse method is called on the parser object to find the parse tree:

parse_tree = parser.parse()

Node

The Node class describes the parse tree. Each node contains a str with the name of the rule it was created from. It also contains a list of all of its children.

Node.kind: str

The name of the rule this node was expanded from.

Node.children: list[Node | Token]

A list containing all of this node's terminal and nonterminal children in order.

Errors

zparse defines three errors.

GrammarError

A GrammarError is thrown when a grammar is malformed. This includes syntax errors and logical errors (like recursion in fragment definitions and rules that reference fragments).

GrammarError.msg: str

This field contains a description of the error.

GrammarError.tokens: list[Token]

This field contains a list of the offending tokens. It may be empty if there is a syntax error and tokenization fails.

TokenError

A TokenError is thrown during parsing when none of the token definitions match the input.

TokenError.msg: str

This field contains a description of the error.

ParseError

A ParseError is thrown during parsing when none of the token definitions match the input.

ParseError.msg: str

This field contains a description of the error.

ParseError.tokens: list[Token]

This field contains the offending tokens. It usually contains a single token.

Advanced Features

Advanced Tokenization

Tags are created by passing the keyword argument base: type to the make_tokenizer function. The tokenizer class it outputs will subclass this base class. The base class you input must subclass zparse.BaseTokenizer. Each tag in your grammar must be defined as a function in your base class. As an example, we can take the @ignore tag, which is already defined in zparse.BaseTokenizer. It's definition is simple:

def ignore(self, token: Token):
    return

When a token is matched in the input stream, instead of being emitted directly, it is passed through the corresponding method. In most languages, whitespace is ignored. So if we match a whitespace token, we want to throw it out. The @ignore tac accomplishes this because the ignore method takes in the token and does not emit anything.

More sophisticated tag methods are not hard to imagine. For example, tokenization of Python code requires special examination of whitespace characters because indentation matters. In this case, the @handle_whitespace tag must analyze each whitespace token and emit INDENT and DEDENT tokens as needed using yield statements.

Bibliography

Ford, B. (2002). Packrat Parsing: Simple, Powerful, Lazy, Linear Time.

Warth, A., Douglass, J., & Millstein, T. (2008). Packrat Parsers can Support Left Recursion.

Grimm, R. (2004). Practical Packrat Parsing.