#compiler-theory #programming-languages #parser #lexer #ast

CMPSC 461 Project 1: Recursive Descent Parser

What This Project Does

A complete implementation of a recursive descent parser for a simple imperative programming language. The project demonstrates core compiler construction concepts:

Lexical Analysis (Tokenization) - Breaking source code into meaningful tokens
Syntax Analysis (Parsing) - Building an Abstract Syntax Tree (AST) from tokens
Grammar Implementation - Enforcing language rules through recursive functions

The language supports:

Variable assignments (x = 10)
Conditional statements (if-else)
For loops (for i = 1 to 10)
Print statements (print(x, y))
Arithmetic expressions (+, -, *, /, %)
Logical operations (and, or, not)
Comparison operators (==, !=, <, >)

Key Information

Field	Value
Course	CMPSC 461 - Programming Language Concepts
Semester	Fall 2024
Institution	Penn State
Language	Python 3.x
Test Coverage	7/7 test cases (100%)
Lines of Code	~500 lines

Project Structure

CMPSC_461_Project1/
├── Parser.py              # Main lexer and parser implementation
├── ASTNodeDefs.py         # AST node class definitions
├── checker.py             # Test cases with expected outputs
├── verify.py              # Test runner and validation
├── grammar.txt            # Formal grammar specification
├── tokens.txt             # Token definitions
└── Parser_Quiz_Responses.md  # Technical Q&A documentation

Core Components

1. Lexer (Tokenizer)

Location: Parser.py (lines 9-83)

Purpose: Converts raw source code into a stream of tokens

Implementation Details:

Uses Python regex for pattern matching
Token priority determined by ordering in token_specs
Keywords matched before identifiers to prevent misclassification
Automatic whitespace handling
Number conversion to integers

Token Types (28 total):

- Keywords: IF, ELSE, FOR, TO, PRINT, AND, OR, NOT
- Operators: PLUS, MINUS, MULTIPLY, DIVIDE, MODULO
- Comparisons: EQ, NEQ, GREATER, LESS
- Delimiters: LPAREN, RPAREN, COMMA, COLON
- Literals: NUMBER, IDENTIFIER
- Special: EOF, EQUALS

Key Feature - Keyword vs Identifier Disambiguation:

# Keywords listed BEFORE identifier pattern
('IF',         r'if'),      # Matches first
('IDENTIFIER', r'[A-Za-z_][A-Za-z0-9_]*'),  # Falls through

2. Parser (Syntax Analyzer)

Location: Parser.py (lines 85-418)

Architecture: Recursive Descent with Operator Precedence Climbing

Parsing Functions (hierarchical by precedence):

parse()                    # Entry point: program → statements*
parse_statement()          # Statement dispatcher
├── Assignment             # x = expression
├── parse_if_stmt()       # if-else
├── parse_for_stmt()      # for loops
└── parse_print_stmt()    # print statements

Boolean Expression Hierarchy:
parse_boolean_expression() # OR (lowest precedence)
parse_boolean_term()       # AND
parse_boolean_factor()     # NOT
parse_comparison()         # ==, !=, <, >

Arithmetic Expression Hierarchy:
parse_expression()         # +, - (lowest precedence)
parse_term()              # *, /, %
parse_factor()            # unary +, -
parse_primary()           # literals, identifiers, (expr)

Operator Precedence (high to low):

Parentheses ()
Unary operators (-, +, not)
Multiplicative (*, /, %)
Additive (+, -)
Comparison ( ==, !=, <, >)
Logical AND (and)
Logical OR (or)

3. Abstract Syntax Tree (AST)

Location: ASTNodeDefs.py

Node Types:

ASTNode                    # Base class
├── Assignment             # x = expr
├── BinaryOperation        # left op right
├── LogicalOperation       # left AND/OR right
├── UnaryOperation         # op operand
├── PrintStatement         # print(args)
├── IfStatement           # if cond: then [else: else]
├── ForStatement          # for var = start to end: body
└── Block                 # statement sequence

AST Example:

# Source: x = 10 + 5
Assignment(
    identifier=('IDENTIFIER', 'x'),
    expression=BinaryOperation(
        left=('NUMBER', 10),
        operator=('PLUS', '+'),
        right=('NUMBER', 5)
    )
)

Language Grammar

Formal Specification (from grammar.txt):

program            ::= statement*
statement          ::= assign_stmt | if_stmt | for_stmt | print_stmt
assign_stmt        ::= IDENTIFIER '=' expression
if_stmt            ::= 'if' boolean_expression ':' block ('else' ':' block)?
for_stmt           ::= 'for' IDENTIFIER '=' expression 'to' expression ':' block
print_stmt         ::= 'print' '(' arg_list? ')'
block              ::= statement*
arg_list           ::= expression (',' expression)*

boolean_expression ::= boolean_term ('or' boolean_term)*
boolean_term       ::= boolean_factor ('and' boolean_factor)*
boolean_factor     ::= 'not' boolean_factor | comparison
comparison         ::= expression (('==' | '!=' | '<' | '>') expression)*

expression         ::= term (('+' | '-') term)*
term               ::= factor (('*' | '/' | '%') factor)*
factor             ::= ('+' | '-') factor | primary
primary            ::= NUMBER | IDENTIFIER | '(' expression ')'

Implementation Techniques

1. Operator Precedence

Method: Function hierarchy where lower precedence = higher level function

Example: 3 + 4 * 5

parse_expression()  # handles +, calls parse_term() for right side
  ├── 3
  └── parse_term()  # handles *, returns (4*5) first
        └── (4 * 5)
Result: 3 + (4 * 5)  # Multiplication grouped first

2. Left-Associativity

Method: Iteration with while loops (not recursion)

Example: 10 - 5 - 2

left = parse_term()     # get 10
while token in [MINUS]:
    op = current_token()
    right = parse_term() # get 5
    left = BinOp(left, op, right)  # left = (10-5)
    # Loop continues, builds ((10-5)-2)

3. Unary vs Binary Operator Disambiguation

Context-based determination:

Binary minus (a - b): Encountered in parse_expression() after left operand parsed
Unary minus (-x): Encountered in parse_factor() before any operand

Example: a - -b

parse_expression():   # sees 'a', then '-' → BINARY
    parse_factor():   # sees '-' before 'b' → UNARY

4. Optional Grammar Elements

Method: Lookahead without consuming

Example: Optional else block

else_block = None  # default
if current_token()[0] == 'ELSE':
    advance()      # only consume if present
    else_block = parse_block()

5. Infinite Recursion Prevention

Guarantees:

Every recursive call consumes at least one token (advance())
Token list is finite (ends with EOF)
Base cases (numbers/identifiers) don't recurse
Each call processes fewer remaining tokens → must terminate

Test Suite

Coverage: 7 comprehensive test cases (100% pass rate)

Test Cases

Simple if-else - Basic conditional logic
For loop with if - Nested control flow, modulo operator
Logical AND - Multiple conditions
Logical OR - Alternative conditions with unary minus
Logical NOT - Negation operator
Nested if-else - Multi-level conditionals
Complex logical expressions - Parentheses, mixed operators

Test Example

Input:

for i = 1 to 10:
  if i % 2 == 0:
    print(i)

Expected AST:

ForStatement(
    iterator=('IDENTIFIER', 'i'),
    start=('NUMBER', 1),
    end=('NUMBER', 10),
    body=Block([
        IfStatement(
            condition=BinaryOperation(
                BinaryOperation(
                    ('IDENTIFIER', 'i'),
                    ('MODULO', '%'),
                    ('NUMBER', 2)
                ),
                ('EQ', '=='),
                ('NUMBER', 0)
            ),
            then=Block([PrintStatement([('IDENTIFIER', 'i')])]),
            else_block=None
        )
    ])
)

Technical Insights

Strengths

Clean Separation of Concerns
- Lexer, Parser, and AST are independent modules
- Easy to extend or modify individual components
Proper Precedence Handling
- No ambiguity in expression parsing
- Mathematical and logical operators correctly prioritized
Robust Token Management
- Position tracking enables precise error reporting
- Lookahead support for optional elements
Type-Safe AST
- Structured node classes enable type checking
- Easy to traverse for interpretation/compilation

Limitations & Improvement Opportunities

From Parser_Quiz_Responses.md:

Error Handling
- Current: Panic mode (immediate termination on first error)
- Better: Error accumulation with synchronization points
- Would allow reporting multiple syntax errors at once
Token Metadata
- Current: Simple (type, value) tuples
- Better: Include line/column numbers for debugging
- Example: ('NUMBER', 10, line=3, col=5)
AST Data Structure
- Current: Stores token tuples ('IDENTIFIER', 'x')
- Better: Store values directly 'x'
- Cleaner for downstream processing
Extensibility
- Current: AST traversal requires manual node type checking
- Better: Visitor pattern for operations (type checking, optimization, code generation)

Key Learning Outcomes

Compiler Design Concepts

Two-Phase Design
- Lexical analysis (structure) separate from syntax analysis (meaning)
- Enables independent testing and optimization
Recursive Descent Parsing
- Each grammar rule maps to a function
- Natural implementation of context-free grammars
- Easy to understand and debug
Precedence Climbing
- Function call depth determines operator precedence
- Lower precedence = higher in call stack
- Multiplicative operators deeper than additive
Context-Free Grammar Implementation
- BNF grammar directly translates to code structure
- Demonstrates formal language theory in practice

Python-Specific Techniques

Regex for Tokenization
- Named groups for token classification
- Single compiled regex for efficiency
Type Hints
- Union[ASTNode, Tuple[str, Any]] for mixed types
- Better IDE support and documentation
Object-Oriented AST
- Class hierarchy mirrors grammar structure
- to_string() methods for debugging

Usage

Running the Parser

# Run all tests
python verify.py

# Use programmatically
from Parser import Lexer, Parser

code = """
x = 10
if x > 5:
    print(x)
"""

lexer = Lexer(code)
tokens = lexer.tokenize()
parser = Parser(tokens)
ast = parser.parse()

for node in ast:
    print(node.to_string())

Example Programs

Factorial-style loop:

result = 1
for i = 1 to 5:
    result = result * i
print(result)

Complex conditionals:

x = 15
if (x > 10 and x < 20) or x == 30:
    print(1)
else:
    print(0)

Parser_Quiz_Responses.md - Deep-dive Q&A on implementation details
grammar.txt - Formal BNF grammar specification
tokens.txt - Complete token reference

Technologies

Technology	Purpose
Python 3.x	Implementation language
`re` module	Regular expressions for lexing
Type hints	Static typing support
unittest-style	Test harness (`verify.py`)

Performance Characteristics

Time Complexity: O(n) where n = number of tokens
- Single pass through token stream
- Each token consumed exactly once
Space Complexity: O(d) where d = maximum nesting depth
- Recursive call stack depth
- AST tree height

Future Extensions

Potential enhancements mentioned in quiz responses:

Additional Language Features
- While loops
- Functions/procedures
- Arrays and indexing
- String literals
Advanced Error Recovery
- Error synchronization at statement boundaries
- Continue parsing after errors
- Accumulate all errors before stopping
Semantic Analysis
- Type checking visitor
- Variable scope analysis
- Constant folding optimization
Code Generation
- Compile to bytecode
- Interpreter implementation
- LLVM IR generation

Academic Context

This project demonstrates mastery of fundamental compiler construction concepts taught in CMPSC 461 (Programming Language Concepts):

Formal language theory
Context-free grammars
Lexical and syntax analysis
Parse tree and AST construction
Recursive descent parsing technique

The implementation is production-quality with full test coverage and clear documentation, serving as both a learning exercise and reference implementation.

Status: Complete (100% test coverage, all 7 test cases passing)
Grade Outcome: Project successfully completed with comprehensive implementation