CMPSC 461 Project 1: Recursive Descent Parser
What This Project Does
A complete implementation of a recursive descent parser for a simple imperative programming language. The project demonstrates core compiler construction concepts:
- Lexical Analysis (Tokenization) - Breaking source code into meaningful tokens
- Syntax Analysis (Parsing) - Building an Abstract Syntax Tree (AST) from tokens
- Grammar Implementation - Enforcing language rules through recursive functions
The language supports:
- Variable assignments (
x = 10) - Conditional statements (
if-else) - For loops (
for i = 1 to 10) - Print statements (
print(x, y)) - Arithmetic expressions (
+, -, *, /, %) - Logical operations (
and, or, not) - Comparison operators (
==, !=, <, >)
Key Information
| Field | Value |
|---|---|
| Course | CMPSC 461 - Programming Language Concepts |
| Semester | Fall 2024 |
| Institution | Penn State |
| Language | Python 3.x |
| Test Coverage | 7/7 test cases (100%) |
| Lines of Code | ~500 lines |
Project Structure
CMPSC_461_Project1/
├── Parser.py # Main lexer and parser implementation
├── ASTNodeDefs.py # AST node class definitions
├── checker.py # Test cases with expected outputs
├── verify.py # Test runner and validation
├── grammar.txt # Formal grammar specification
├── tokens.txt # Token definitions
└── Parser_Quiz_Responses.md # Technical Q&A documentation
Core Components
1. Lexer (Tokenizer)
Location: Parser.py (lines 9-83)
Purpose: Converts raw source code into a stream of tokens
Implementation Details:
- Uses Python regex for pattern matching
- Token priority determined by ordering in
token_specs - Keywords matched before identifiers to prevent misclassification
- Automatic whitespace handling
- Number conversion to integers
Token Types (28 total):
- Keywords: IF, ELSE, FOR, TO, PRINT, AND, OR, NOT
- Operators: PLUS, MINUS, MULTIPLY, DIVIDE, MODULO
- Comparisons: EQ, NEQ, GREATER, LESS
- Delimiters: LPAREN, RPAREN, COMMA, COLON
- Literals: NUMBER, IDENTIFIER
- Special: EOF, EQUALS
Key Feature - Keyword vs Identifier Disambiguation:
# Keywords listed BEFORE identifier pattern
('IF', r'if'), # Matches first
('IDENTIFIER', r'[A-Za-z_][A-Za-z0-9_]*'), # Falls through
2. Parser (Syntax Analyzer)
Location: Parser.py (lines 85-418)
Architecture: Recursive Descent with Operator Precedence Climbing
Parsing Functions (hierarchical by precedence):
parse() # Entry point: program → statements*
parse_statement() # Statement dispatcher
├── Assignment # x = expression
├── parse_if_stmt() # if-else
├── parse_for_stmt() # for loops
└── parse_print_stmt() # print statements
Boolean Expression Hierarchy:
parse_boolean_expression() # OR (lowest precedence)
parse_boolean_term() # AND
parse_boolean_factor() # NOT
parse_comparison() # ==, !=, <, >
Arithmetic Expression Hierarchy:
parse_expression() # +, - (lowest precedence)
parse_term() # *, /, %
parse_factor() # unary +, -
parse_primary() # literals, identifiers, (expr)
Operator Precedence (high to low):
- Parentheses
() - Unary operators (
-,+,not) - Multiplicative (
*,/,%) - Additive (
+,-) - Comparison (
==,!=,<,>) - Logical AND (
and) - Logical OR (
or)
3. Abstract Syntax Tree (AST)
Location: ASTNodeDefs.py
Node Types:
ASTNode # Base class
├── Assignment # x = expr
├── BinaryOperation # left op right
├── LogicalOperation # left AND/OR right
├── UnaryOperation # op operand
├── PrintStatement # print(args)
├── IfStatement # if cond: then [else: else]
├── ForStatement # for var = start to end: body
└── Block # statement sequence
AST Example:
# Source: x = 10 + 5
Assignment(
identifier=('IDENTIFIER', 'x'),
expression=BinaryOperation(
left=('NUMBER', 10),
operator=('PLUS', '+'),
right=('NUMBER', 5)
)
)
Language Grammar
Formal Specification (from grammar.txt):
program ::= statement*
statement ::= assign_stmt | if_stmt | for_stmt | print_stmt
assign_stmt ::= IDENTIFIER '=' expression
if_stmt ::= 'if' boolean_expression ':' block ('else' ':' block)?
for_stmt ::= 'for' IDENTIFIER '=' expression 'to' expression ':' block
print_stmt ::= 'print' '(' arg_list? ')'
block ::= statement*
arg_list ::= expression (',' expression)*
boolean_expression ::= boolean_term ('or' boolean_term)*
boolean_term ::= boolean_factor ('and' boolean_factor)*
boolean_factor ::= 'not' boolean_factor | comparison
comparison ::= expression (('==' | '!=' | '<' | '>') expression)*
expression ::= term (('+' | '-') term)*
term ::= factor (('*' | '/' | '%') factor)*
factor ::= ('+' | '-') factor | primary
primary ::= NUMBER | IDENTIFIER | '(' expression ')'
Implementation Techniques
1. Operator Precedence
Method: Function hierarchy where lower precedence = higher level function
Example: 3 + 4 * 5
parse_expression() # handles +, calls parse_term() for right side
├── 3
└── parse_term() # handles *, returns (4*5) first
└── (4 * 5)
Result: 3 + (4 * 5) # Multiplication grouped first
2. Left-Associativity
Method: Iteration with while loops (not recursion)
Example: 10 - 5 - 2
left = parse_term() # get 10
while token in [MINUS]:
op = current_token()
right = parse_term() # get 5
left = BinOp(left, op, right) # left = (10-5)
# Loop continues, builds ((10-5)-2)
3. Unary vs Binary Operator Disambiguation
Context-based determination:
- Binary minus (
a - b): Encountered inparse_expression()after left operand parsed - Unary minus (
-x): Encountered inparse_factor()before any operand
Example: a - -b
parse_expression(): # sees 'a', then '-' → BINARY
parse_factor(): # sees '-' before 'b' → UNARY
4. Optional Grammar Elements
Method: Lookahead without consuming
Example: Optional else block
else_block = None # default
if current_token()[0] == 'ELSE':
advance() # only consume if present
else_block = parse_block()
5. Infinite Recursion Prevention
Guarantees:
- Every recursive call consumes at least one token (
advance()) - Token list is finite (ends with
EOF) - Base cases (numbers/identifiers) don't recurse
- Each call processes fewer remaining tokens → must terminate
Test Suite
Coverage: 7 comprehensive test cases (100% pass rate)
Test Cases
- Simple if-else - Basic conditional logic
- For loop with if - Nested control flow, modulo operator
- Logical AND - Multiple conditions
- Logical OR - Alternative conditions with unary minus
- Logical NOT - Negation operator
- Nested if-else - Multi-level conditionals
- Complex logical expressions - Parentheses, mixed operators
Test Example
Input:
for i = 1 to 10:
if i % 2 == 0:
print(i)
Expected AST:
ForStatement(
iterator=('IDENTIFIER', 'i'),
start=('NUMBER', 1),
end=('NUMBER', 10),
body=Block([
IfStatement(
condition=BinaryOperation(
BinaryOperation(
('IDENTIFIER', 'i'),
('MODULO', '%'),
('NUMBER', 2)
),
('EQ', '=='),
('NUMBER', 0)
),
then=Block([PrintStatement([('IDENTIFIER', 'i')])]),
else_block=None
)
])
)
Technical Insights
Strengths
-
Clean Separation of Concerns
- Lexer, Parser, and AST are independent modules
- Easy to extend or modify individual components
-
Proper Precedence Handling
- No ambiguity in expression parsing
- Mathematical and logical operators correctly prioritized
-
Robust Token Management
- Position tracking enables precise error reporting
- Lookahead support for optional elements
-
Type-Safe AST
- Structured node classes enable type checking
- Easy to traverse for interpretation/compilation
Limitations & Improvement Opportunities
From Parser_Quiz_Responses.md:
-
Error Handling
- Current: Panic mode (immediate termination on first error)
- Better: Error accumulation with synchronization points
- Would allow reporting multiple syntax errors at once
-
Token Metadata
- Current: Simple (type, value) tuples
- Better: Include line/column numbers for debugging
- Example:
('NUMBER', 10, line=3, col=5)
-
AST Data Structure
- Current: Stores token tuples
('IDENTIFIER', 'x') - Better: Store values directly
'x' - Cleaner for downstream processing
- Current: Stores token tuples
-
Extensibility
- Current: AST traversal requires manual node type checking
- Better: Visitor pattern for operations (type checking, optimization, code generation)
Key Learning Outcomes
Compiler Design Concepts
-
Two-Phase Design
- Lexical analysis (structure) separate from syntax analysis (meaning)
- Enables independent testing and optimization
-
Recursive Descent Parsing
- Each grammar rule maps to a function
- Natural implementation of context-free grammars
- Easy to understand and debug
-
Precedence Climbing
- Function call depth determines operator precedence
- Lower precedence = higher in call stack
- Multiplicative operators deeper than additive
-
Context-Free Grammar Implementation
- BNF grammar directly translates to code structure
- Demonstrates formal language theory in practice
Python-Specific Techniques
-
Regex for Tokenization
- Named groups for token classification
- Single compiled regex for efficiency
-
Type Hints
Union[ASTNode, Tuple[str, Any]]for mixed types- Better IDE support and documentation
-
Object-Oriented AST
- Class hierarchy mirrors grammar structure
to_string()methods for debugging
Usage
Running the Parser
# Run all tests
python verify.py
# Use programmatically
from Parser import Lexer, Parser
code = """
x = 10
if x > 5:
print(x)
"""
lexer = Lexer(code)
tokens = lexer.tokenize()
parser = Parser(tokens)
ast = parser.parse()
for node in ast:
print(node.to_string())
Example Programs
Factorial-style loop:
result = 1
for i = 1 to 5:
result = result * i
print(result)
Complex conditionals:
x = 15
if (x > 10 and x < 20) or x == 30:
print(1)
else:
print(0)
Related Documentation
Parser_Quiz_Responses.md- Deep-dive Q&A on implementation detailsgrammar.txt- Formal BNF grammar specificationtokens.txt- Complete token reference
Technologies
| Technology | Purpose |
|---|---|
| Python 3.x | Implementation language |
re module |
Regular expressions for lexing |
| Type hints | Static typing support |
| unittest-style | Test harness (verify.py) |
Performance Characteristics
-
Time Complexity: O(n) where n = number of tokens
- Single pass through token stream
- Each token consumed exactly once
-
Space Complexity: O(d) where d = maximum nesting depth
- Recursive call stack depth
- AST tree height
Future Extensions
Potential enhancements mentioned in quiz responses:
-
Additional Language Features
- While loops
- Functions/procedures
- Arrays and indexing
- String literals
-
Advanced Error Recovery
- Error synchronization at statement boundaries
- Continue parsing after errors
- Accumulate all errors before stopping
-
Semantic Analysis
- Type checking visitor
- Variable scope analysis
- Constant folding optimization
-
Code Generation
- Compile to bytecode
- Interpreter implementation
- LLVM IR generation
Academic Context
This project demonstrates mastery of fundamental compiler construction concepts taught in CMPSC 461 (Programming Language Concepts):
- Formal language theory
- Context-free grammars
- Lexical and syntax analysis
- Parse tree and AST construction
- Recursive descent parsing technique
The implementation is production-quality with full test coverage and clear documentation, serving as both a learning exercise and reference implementation.
Status: Complete (100% test coverage, all 7 test cases passing)
Grade Outcome: Project successfully completed with comprehensive implementation