Cell 2

This code defines a lexer for XPath expressions using the ply module, specifying tokens and their corresponding regular expression patterns to recognize XPath operators, separators, and literal values. The lexer can be used as a foundation for building a parser that constructs an abstract syntax tree (AST) for further processing, making it a crucial component for XPath expression analysis.

Cell 2

"""XPath lexing rules. To understand how this module works, it is valuable to have a strong understanding of the `ply <http://www.dabeaz.com/ply/>` module. """ from __future__ import unicode_literals operator_names = { 'or': 'OR_OP', 'and': 'AND_OP', 'div': 'DIV_OP', 'mod': 'MOD_OP', } tokens = [ 'PATH_SEP', 'ABBREV_PATH_SEP', 'ABBREV_STEP_SELF', 'ABBREV_STEP_PARENT', 'AXIS_SEP', 'ABBREV_AXIS_AT', 'OPEN_PAREN', 'CLOSE_PAREN', 'OPEN_BRACKET', 'CLOSE_BRACKET', 'UNION_OP', 'EQUAL_OP', 'REL_OP', 'PLUS_OP', 'MINUS_OP', 'MULT_OP', 'STAR_OP', 'COMMA', 'LITERAL', 'FLOAT', 'INTEGER', 'NCNAME', 'NODETYPE', 'FUNCNAME', 'AXISNAME', 'COLON', 'DOLLAR', ] + list(operator_names.values()) t_PATH_SEP = r'/' t_ABBREV_PATH_SEP = r'//' t_ABBREV_STEP_SELF = r'\.' t_ABBREV_STEP_PARENT = r'\.\.' t_AXIS_SEP = r'::' t_ABBREV_AXIS_AT = r'@' t_OPEN_PAREN = r'\(' t_CLOSE_PAREN = r'\)' t_OPEN_BRACKET = r'\[' t_CLOSE_BRACKET = r'\]' t_UNION_OP = r'\|' t_EQUAL_OP = r'!?=' t_REL_OP = r'[<>]=?' t_PLUS_OP = r'\+' t_MINUS_OP = r'-' t_COMMA = r',' t_COLON = r':' t_DOLLAR = r'\ t_STAR_OP = r'\*' t_ignore = ' \t\r\n' # NOTE: some versions of python cannot compile regular expressions that # contain unicode characters above U+FFFF, which are allowable in NCNames. # These characters can be used in Python 2.6.4, but can NOT be used in 2.6.2 # (status in 2.6.3 is unknown). The code below accounts for that and excludes # the higher character range if Python can't handle it. # Monster regex derived from: # http://www.w3.org/TR/REC-xml/#NT-NameStartChar # http://www.w3.org/TR/REC-xml/#NT-NameChar # EXCEPT: # Technically those productions allow ':'. NCName, on the other hand: # http://www.w3.org/TR/REC-xml-names/#NT-NCName # explicitly excludes those names that have ':'. We implement this by # simply removing ':' from our regexes. # NameStartChar regex without characters about U+FFFF NameStartChar = r'[A-Z]|_|[a-z]|\xc0-\xd6]|[\xd8-\xf6]|[\xf8-\u02ff]|' + \ r'[\u0370-\u037d]|[\u037f-\u1fff]|[\u200c-\u200d]|[\u2070-\u218f]|' + \ r'[\u2c00-\u2fef]|[\u3001-\uD7FF]|[\uF900-\uFDCF]|[\uFDF0-\uFFFD]' # complete NameStartChar regex Full_NameStartChar = r'(' + NameStartChar + r'|[\U00010000-\U000EFFFF]' + r')' # additional characters allowed in NCNames after the first character NameChar_extras = r'[-.0-9\xb7\u0300-\u036f\u203f-\u2040]' try: import re # test whether or not re can compile unicode characters above U+FFFF re.compile(r'[\U00010000-\U00010001]') # if that worked, then use the full ncname regex NameStartChar = Full_NameStartChar except: # if compilation failed, leave NameStartChar regex as is, which does not # include the unicode character ranges above U+FFFF pass NCNAME_REGEX = r'(' + NameStartChar + r')(' + \ NameStartChar + r'|' + NameChar_extras + r')*' NODE_TYPES = set(['comment', 'text', 'processing-instruction', 'node']) t_NCNAME = NCNAME_REGEX def t_LITERAL(t): r""""[^"]*"|'[^']*'""" t.value = t.value[1:-1] return t def t_FLOAT(t): r'\d+\.\d*|\.\d+' t.value = float(t.value) return t def t_INTEGER(t): r'\d+' t.value = int(t.value) return t def t_error(t): raise TypeError("Unknown text '%s'" % (t.value,))

What the code could have been:

import re

# Unicode name start character regex without U+FFFF characters
NameStartChar = r'[A-Z]|_|[a-z]|[\u00C0-\u00D6]|\u00D8-\u00F6|[\u00F8-\u02FF]|[\u0370-\u037D]|[\u037F-\u1FFF]|[\u200C-\u200D]|[\u2070-\u218F]|[\u2C00-\u2FEF]|[\u3001-\uD7FF]|[\uF900-\uFDCF]|[\uFDF0-\uFFFD]'

# Complete name start character regex including U+FFFF characters
Full_NameStartChar = r'(' + NameStartChar + r'|[\U00010000-\U000EFFFF])'

# Additional characters allowed in NCNames after the first character
NameChar_extras = r'[-.0-9\u0300-\u036F\u203F-\u2040]'

try:
    # Test whether or not re can compile unicode characters above U+FFFF
    re.compile(r'[\U00010000-\U00010001]')
    # If compilation is successful, use the full NCName regex
    NCNAME_REGEX = r'(' + Full_NameStartChar + r')(' + Full_NameStartChar + r'|' + NameChar_extras + r')*'
except:
    # If compilation fails, use the NCName regex without U+FFFF characters
    NCNAME_REGEX = r'(' + NameStartChar + r')(' + NameStartChar + r'|' + NameChar_extras + r')*'

NODE_TYPES = set(['comment', 'text', 'processing-instruction', 'node'])

# NCName token
t_NCNAME = NCNAME_REGEX

This code is a lexer definition for XPath using the ply module. It defines a set of tokens and their corresponding regular expression patterns.

This code would be used as a foundation for building a parser that can understand XPath expressions. The parser would use the definitions in this code to recognize the tokens in the XPath expression and construct an abstract syntax tree (AST) that can be used for further processing.