CMPSC 461 Project 1: Recursive Descent Parser

What This Project Does

A complete implementation of a recursive descent parser for a simple imperative programming language. The project demonstrates core compiler construction concepts:

The language supports:

Key Information

Field Value
Course CMPSC 461 - Programming Language Concepts
Semester Fall 2024
Institution Penn State
Language Python 3.x
Test Coverage 7/7 test cases (100%)
Lines of Code ~500 lines

Project Structure

CMPSC_461_Project1/
├── Parser.py              # Main lexer and parser implementation
├── ASTNodeDefs.py         # AST node class definitions
├── checker.py             # Test cases with expected outputs
├── verify.py              # Test runner and validation
├── grammar.txt            # Formal grammar specification
├── tokens.txt             # Token definitions
└── Parser_Quiz_Responses.md  # Technical Q&A documentation

Core Components

1. Lexer (Tokenizer)

Location: Parser.py (lines 9-83)

Purpose: Converts raw source code into a stream of tokens

Implementation Details:

Token Types (28 total):

- Keywords: IF, ELSE, FOR, TO, PRINT, AND, OR, NOT
- Operators: PLUS, MINUS, MULTIPLY, DIVIDE, MODULO
- Comparisons: EQ, NEQ, GREATER, LESS
- Delimiters: LPAREN, RPAREN, COMMA, COLON
- Literals: NUMBER, IDENTIFIER
- Special: EOF, EQUALS

Key Feature - Keyword vs Identifier Disambiguation:

# Keywords listed BEFORE identifier pattern
('IF',         r'if'),      # Matches first
('IDENTIFIER', r'[A-Za-z_][A-Za-z0-9_]*'),  # Falls through

2. Parser (Syntax Analyzer)

Location: Parser.py (lines 85-418)

Architecture: Recursive Descent with Operator Precedence Climbing

Parsing Functions (hierarchical by precedence):

parse()                    # Entry point: program → statements*
parse_statement()          # Statement dispatcher
├── Assignment             # x = expression
├── parse_if_stmt()       # if-else
├── parse_for_stmt()      # for loops
└── parse_print_stmt()    # print statements

Boolean Expression Hierarchy:
parse_boolean_expression() # OR (lowest precedence)
parse_boolean_term()       # AND
parse_boolean_factor()     # NOT
parse_comparison()         # ==, !=, <, >

Arithmetic Expression Hierarchy:
parse_expression()         # +, - (lowest precedence)
parse_term()              # *, /, %
parse_factor()            # unary +, -
parse_primary()           # literals, identifiers, (expr)

Operator Precedence (high to low):

  1. Parentheses ()
  2. Unary operators (-, +, not)
  3. Multiplicative (*, /, %)
  4. Additive (+, -)
  5. Comparison ( ==, !=, <, >)
  6. Logical AND (and)
  7. Logical OR (or)

3. Abstract Syntax Tree (AST)

Location: ASTNodeDefs.py

Node Types:

ASTNode                    # Base class
├── Assignment             # x = expr
├── BinaryOperation        # left op right
├── LogicalOperation       # left AND/OR right
├── UnaryOperation         # op operand
├── PrintStatement         # print(args)
├── IfStatement           # if cond: then [else: else]
├── ForStatement          # for var = start to end: body
└── Block                 # statement sequence

AST Example:

# Source: x = 10 + 5
Assignment(
    identifier=('IDENTIFIER', 'x'),
    expression=BinaryOperation(
        left=('NUMBER', 10),
        operator=('PLUS', '+'),
        right=('NUMBER', 5)
    )
)

Language Grammar

Formal Specification (from grammar.txt):

program            ::= statement*
statement          ::= assign_stmt | if_stmt | for_stmt | print_stmt
assign_stmt        ::= IDENTIFIER '=' expression
if_stmt            ::= 'if' boolean_expression ':' block ('else' ':' block)?
for_stmt           ::= 'for' IDENTIFIER '=' expression 'to' expression ':' block
print_stmt         ::= 'print' '(' arg_list? ')'
block              ::= statement*
arg_list           ::= expression (',' expression)*

boolean_expression ::= boolean_term ('or' boolean_term)*
boolean_term       ::= boolean_factor ('and' boolean_factor)*
boolean_factor     ::= 'not' boolean_factor | comparison
comparison         ::= expression (('==' | '!=' | '<' | '>') expression)*

expression         ::= term (('+' | '-') term)*
term               ::= factor (('*' | '/' | '%') factor)*
factor             ::= ('+' | '-') factor | primary
primary            ::= NUMBER | IDENTIFIER | '(' expression ')'

Implementation Techniques

1. Operator Precedence

Method: Function hierarchy where lower precedence = higher level function

Example: 3 + 4 * 5

parse_expression()  # handles +, calls parse_term() for right side
  ├── 3
  └── parse_term()  # handles *, returns (4*5) first
        └── (4 * 5)
Result: 3 + (4 * 5)  # Multiplication grouped first

2. Left-Associativity

Method: Iteration with while loops (not recursion)

Example: 10 - 5 - 2

left = parse_term()     # get 10
while token in [MINUS]:
    op = current_token()
    right = parse_term() # get 5
    left = BinOp(left, op, right)  # left = (10-5)
    # Loop continues, builds ((10-5)-2)

3. Unary vs Binary Operator Disambiguation

Context-based determination:

Example: a - -b

parse_expression():   # sees 'a', then '-' → BINARY
    parse_factor():   # sees '-' before 'b' → UNARY

4. Optional Grammar Elements

Method: Lookahead without consuming

Example: Optional else block

else_block = None  # default
if current_token()[0] == 'ELSE':
    advance()      # only consume if present
    else_block = parse_block()

5. Infinite Recursion Prevention

Guarantees:

  1. Every recursive call consumes at least one token (advance())
  2. Token list is finite (ends with EOF)
  3. Base cases (numbers/identifiers) don't recurse
  4. Each call processes fewer remaining tokens → must terminate

Test Suite

Coverage: 7 comprehensive test cases (100% pass rate)

Test Cases

  1. Simple if-else - Basic conditional logic
  2. For loop with if - Nested control flow, modulo operator
  3. Logical AND - Multiple conditions
  4. Logical OR - Alternative conditions with unary minus
  5. Logical NOT - Negation operator
  6. Nested if-else - Multi-level conditionals
  7. Complex logical expressions - Parentheses, mixed operators

Test Example

Input:

for i = 1 to 10:
  if i % 2 == 0:
    print(i)

Expected AST:

ForStatement(
    iterator=('IDENTIFIER', 'i'),
    start=('NUMBER', 1),
    end=('NUMBER', 10),
    body=Block([
        IfStatement(
            condition=BinaryOperation(
                BinaryOperation(
                    ('IDENTIFIER', 'i'),
                    ('MODULO', '%'),
                    ('NUMBER', 2)
                ),
                ('EQ', '=='),
                ('NUMBER', 0)
            ),
            then=Block([PrintStatement([('IDENTIFIER', 'i')])]),
            else_block=None
        )
    ])
)

Technical Insights

Strengths

  1. Clean Separation of Concerns

    • Lexer, Parser, and AST are independent modules
    • Easy to extend or modify individual components
  2. Proper Precedence Handling

    • No ambiguity in expression parsing
    • Mathematical and logical operators correctly prioritized
  3. Robust Token Management

    • Position tracking enables precise error reporting
    • Lookahead support for optional elements
  4. Type-Safe AST

    • Structured node classes enable type checking
    • Easy to traverse for interpretation/compilation

Limitations & Improvement Opportunities

From Parser_Quiz_Responses.md:

  1. Error Handling

    • Current: Panic mode (immediate termination on first error)
    • Better: Error accumulation with synchronization points
    • Would allow reporting multiple syntax errors at once
  2. Token Metadata

    • Current: Simple (type, value) tuples
    • Better: Include line/column numbers for debugging
    • Example: ('NUMBER', 10, line=3, col=5)
  3. AST Data Structure

    • Current: Stores token tuples ('IDENTIFIER', 'x')
    • Better: Store values directly 'x'
    • Cleaner for downstream processing
  4. Extensibility

    • Current: AST traversal requires manual node type checking
    • Better: Visitor pattern for operations (type checking, optimization, code generation)

Key Learning Outcomes

Compiler Design Concepts

  1. Two-Phase Design

    • Lexical analysis (structure) separate from syntax analysis (meaning)
    • Enables independent testing and optimization
  2. Recursive Descent Parsing

    • Each grammar rule maps to a function
    • Natural implementation of context-free grammars
    • Easy to understand and debug
  3. Precedence Climbing

    • Function call depth determines operator precedence
    • Lower precedence = higher in call stack
    • Multiplicative operators deeper than additive
  4. Context-Free Grammar Implementation

    • BNF grammar directly translates to code structure
    • Demonstrates formal language theory in practice

Python-Specific Techniques

  1. Regex for Tokenization

    • Named groups for token classification
    • Single compiled regex for efficiency
  2. Type Hints

    • Union[ASTNode, Tuple[str, Any]] for mixed types
    • Better IDE support and documentation
  3. Object-Oriented AST

    • Class hierarchy mirrors grammar structure
    • to_string() methods for debugging

Usage

Running the Parser

# Run all tests
python verify.py

# Use programmatically
from Parser import Lexer, Parser

code = """
x = 10
if x > 5:
    print(x)
"""

lexer = Lexer(code)
tokens = lexer.tokenize()
parser = Parser(tokens)
ast = parser.parse()

for node in ast:
    print(node.to_string())

Example Programs

Factorial-style loop:

result = 1
for i = 1 to 5:
    result = result * i
print(result)

Complex conditionals:

x = 15
if (x > 10 and x < 20) or x == 30:
    print(1)
else:
    print(0)

Technologies

Technology Purpose
Python 3.x Implementation language
re module Regular expressions for lexing
Type hints Static typing support
unittest-style Test harness (verify.py)

Performance Characteristics

Future Extensions

Potential enhancements mentioned in quiz responses:

  1. Additional Language Features

    • While loops
    • Functions/procedures
    • Arrays and indexing
    • String literals
  2. Advanced Error Recovery

    • Error synchronization at statement boundaries
    • Continue parsing after errors
    • Accumulate all errors before stopping
  3. Semantic Analysis

    • Type checking visitor
    • Variable scope analysis
    • Constant folding optimization
  4. Code Generation

    • Compile to bytecode
    • Interpreter implementation
    • LLVM IR generation

Academic Context

This project demonstrates mastery of fundamental compiler construction concepts taught in CMPSC 461 (Programming Language Concepts):

The implementation is production-quality with full test coverage and clear documentation, serving as both a learning exercise and reference implementation.


Status: Complete (100% test coverage, all 7 test cases passing)
Grade Outcome: Project successfully completed with comprehensive implementation