UFC Sell-Through Prediction Project
What This Project Does
A PySpark-based machine learning system that predicts UFC event ticket sell-through rates. The main metric is:
sell_through = tickets_sold / venue_capacity
For example, if a venue holds 20,000 people and 18,000 tickets sell, that's a 90% sell-through.
Related Concepts: Software engineering | MapReduce & Spark | PySpark RDDs | Cloud computing | Regression Analysis
Key Information
| Field | Value |
|---|---|
| Course | DS/CMPSC 410 - Programming Models for Big Data |
| Semester | Fall 2025 |
| Institution | Penn State |
| Platform | ICDS Roar HPC Cluster |
| Primary Language | Python (PySpark) |
| Data Volume | ~2.8 million records (~430 MB) |
| Best Model R² | 0.4763 (47.6% variance explained) |
Project Goals
The project demonstrates proficiency in:
- Processing large datasets with PySpark
- Using window functions for rolling calculations
- Training ML models with Spark MLlib
- Running jobs on the Roar cluster
- Working with graph data using GraphFrames
- Integrating multiple external data sources
Core Features
Data Pipeline
- ETL System - Spark-based extraction, transformation, and loading
- Multi-Source Integration - UFC stats, Wikipedia, betting odds, Google Trends, Reddit
- Parquet Storage - Columnar format for efficient analytics
Feature Engineering
- Window Functions - Rolling statistics (last 5 fights)
- Physical Features - Reach/height/age differentials
- Graph Features - PageRank, community detection
- External Features - Betting spreads, search trends, sentiment
Machine Learning
- Gradient Boosted Trees - Primary regression model
- Random Forest - Alternative model
- Cross-Validation - 3-fold with hyperparameter tuning
- Time-Based Splits - Train on pre-2024, test on 2024+
Optimization
- Card Optimizer - Greedy + 2-opt algorithm for fight card building
- Constraint Satisfaction - Weight class variety, fighter uniqueness
Data Sources Summary
| Source | Records | Size | Description |
|---|---|---|---|
| UFC Stats | ~53,000 | ~11 MB | Events, fights, round stats |
| Betting Odds | ~84,000 | ~50 MB | Historical betting lines |
| Google Trends | ~2.3M | ~200 MB | Fighter search interest |
| ~380,000 | ~150 MB | r/MMA sentiment | |
| Graph Features | ~4,400 | ~20 MB | Network analytics |
| Total | ~2.8M | ~430 MB |
Technology Stack
| Technology | Purpose |
|---|---|
| PySpark 3.4+ | Distributed data processing |
| Spark MLlib | Machine learning |
| GraphFrames | Graph analytics |
| pandas/NumPy | Local data manipulation |
| BeautifulSoup | Web scraping |
| pytrends | Google Trends API |
| TextBlob | Sentiment analysis |
| matplotlib/seaborn | Visualization |
Model Performance
Best Model (36 Features)
- Test RMSE: 0.3231
- Test MAE: 0.1801
- Test R²: 0.4763
Baseline Model (9 Features)
- Test RMSE: 0.3829
- Test MAE: 0.3460
- Test R²: 0.2644
Key Insight: The extended feature set shows an 80% improvement in R² compared to the baseline, demonstrating the value of comprehensive feature engineering.
Top Features by Importance
is_ppv- Pay-per-view eventhas_title- Title fight on cardnum_title_fights- Number of title fightsreddit_hype- Pre-event Reddit engagementavg_buzz_7d- 7-day pre-event search interestmax_combined_pagerank- Star power metricis_vegas- Las Vegas locationavg_betting_spread- Match competitivenessnum_fights- Card sizeavg_win_rate- Fighter quality
Quick Start
# Install dependencies
pip install pyspark pandas numpy beautifulsoup4 requests pytrends textblob
# Run full pipeline
sbatch scripts/run_pipeline.slurm
# Or step by step:
python src/etl/ingest.py --data-dir ./data
spark-submit src/etl/spark_etl.py --data-dir ./data
spark-submit src/features/feature_engineering.py --data-dir ./data
spark-submit src/models/train_improved.py --data-dir ./data --test-year 2024
Related Documentation
- 01-Project-Structure - Directory layout
- 02-Data-Sources - Data formats and sources
- 03-ETL-Pipeline - Spark ETL implementation
- 04-Feature-Engineering - Feature creation
- 05-Graph-Analytics - Fighter network analysis
- 06-External-Data - Betting, trends, sentiment
- 07-Model-Training - ML models
- 08-Card-Optimizer - Fight card optimization
- 09-Visualization - Charts and plots
- 10-HPC-Deployment - Roar cluster setup