UFC Sell-Through Prediction Project

What This Project Does

A PySpark-based machine learning system that predicts UFC event ticket sell-through rates. The main metric is:

sell_through = tickets_sold / venue_capacity

For example, if a venue holds 20,000 people and 18,000 tickets sell, that's a 90% sell-through.

Related Concepts: Software engineering | MapReduce & Spark | PySpark RDDs | Cloud computing | Regression Analysis

Key Information

Field Value
Course DS/CMPSC 410 - Programming Models for Big Data
Semester Fall 2025
Institution Penn State
Platform ICDS Roar HPC Cluster
Primary Language Python (PySpark)
Data Volume ~2.8 million records (~430 MB)
Best Model R² 0.4763 (47.6% variance explained)

Project Goals

The project demonstrates proficiency in:

Core Features

Data Pipeline

Feature Engineering

Machine Learning

Optimization

Data Sources Summary

Source Records Size Description
UFC Stats ~53,000 ~11 MB Events, fights, round stats
Betting Odds ~84,000 ~50 MB Historical betting lines
Google Trends ~2.3M ~200 MB Fighter search interest
Reddit ~380,000 ~150 MB r/MMA sentiment
Graph Features ~4,400 ~20 MB Network analytics
Total ~2.8M ~430 MB

Technology Stack

Technology Purpose
PySpark 3.4+ Distributed data processing
Spark MLlib Machine learning
GraphFrames Graph analytics
pandas/NumPy Local data manipulation
BeautifulSoup Web scraping
pytrends Google Trends API
TextBlob Sentiment analysis
matplotlib/seaborn Visualization

Model Performance

Best Model (36 Features)

Baseline Model (9 Features)

Key Insight: The extended feature set shows an 80% improvement in R² compared to the baseline, demonstrating the value of comprehensive feature engineering.

Top Features by Importance

  1. is_ppv - Pay-per-view event
  2. has_title - Title fight on card
  3. num_title_fights - Number of title fights
  4. reddit_hype - Pre-event Reddit engagement
  5. avg_buzz_7d - 7-day pre-event search interest
  6. max_combined_pagerank - Star power metric
  7. is_vegas - Las Vegas location
  8. avg_betting_spread - Match competitiveness
  9. num_fights - Card size
  10. avg_win_rate - Fighter quality

Quick Start

# Install dependencies
pip install pyspark pandas numpy beautifulsoup4 requests pytrends textblob

# Run full pipeline
sbatch scripts/run_pipeline.slurm

# Or step by step:
python src/etl/ingest.py --data-dir ./data
spark-submit src/etl/spark_etl.py --data-dir ./data
spark-submit src/features/feature_engineering.py --data-dir ./data
spark-submit src/models/train_improved.py --data-dir ./data --test-year 2024