#ufc-documentation #overview #index #data-science #pyspark

UFC Sell-Through Prediction Project

What This Project Does

A PySpark-based machine learning system that predicts UFC event ticket sell-through rates. The main metric is:

sell_through = tickets_sold / venue_capacity

For example, if a venue holds 20,000 people and 18,000 tickets sell, that's a 90% sell-through.

Key Information

Field	Value
Course	DS/CMPSC 410 - Programming Models for Big Data
Semester	Fall 2025
Institution	Penn State
Platform	ICDS Roar HPC Cluster
Primary Language	Python (PySpark)
Data Volume	~2.8 million records (~430 MB)
Best Model R²	0.4763 (47.6% variance explained)

Project Goals

The project demonstrates proficiency in:

Processing large datasets with PySpark
Using window functions for rolling calculations
Training ML models with Spark MLlib
Running jobs on the Roar cluster
Working with graph data using GraphFrames
Integrating multiple external data sources

Core Features

Data Pipeline

ETL System - Spark-based extraction, transformation, and loading
Multi-Source Integration - UFC stats, Wikipedia, betting odds, Google Trends, Reddit
Parquet Storage - Columnar format for efficient analytics

Feature Engineering

Window Functions - Rolling statistics (last 5 fights)
Physical Features - Reach/height/age differentials
Graph Features - PageRank, community detection
External Features - Betting spreads, search trends, sentiment

Machine Learning

Gradient Boosted Trees - Primary regression model
Random Forest - Alternative model
Cross-Validation - 3-fold with hyperparameter tuning
Time-Based Splits - Train on pre-2024, test on 2024+

Optimization

Card Optimizer - Greedy + 2-opt algorithm for fight card building
Constraint Satisfaction - Weight class variety, fighter uniqueness

Data Sources Summary

Source	Records	Size	Description
UFC Stats	~53,000	~11 MB	Events, fights, round stats
Betting Odds	~84,000	~50 MB	Historical betting lines
Google Trends	~2.3M	~200 MB	Fighter search interest
Reddit	~380,000	~150 MB	r/MMA sentiment
Graph Features	~4,400	~20 MB	Network analytics
Total	~2.8M	~430 MB

Technology Stack

Technology	Purpose
PySpark 3.4+	Distributed data processing
Spark MLlib	Machine learning
GraphFrames	Graph analytics
pandas/NumPy	Local data manipulation
BeautifulSoup	Web scraping
pytrends	Google Trends API
TextBlob	Sentiment analysis
matplotlib/seaborn	Visualization

Model Performance

Best Model (36 Features)

Test RMSE: 0.3231
Test MAE: 0.1801
Test R²: 0.4763

Baseline Model (9 Features)

Test RMSE: 0.3829
Test MAE: 0.3460
Test R²: 0.2644

Key Insight: The extended feature set shows an 80% improvement in R² compared to the baseline, demonstrating the value of comprehensive feature engineering.

Top Features by Importance

is_ppv - Pay-per-view event
has_title - Title fight on card
num_title_fights - Number of title fights
reddit_hype - Pre-event Reddit engagement
avg_buzz_7d - 7-day pre-event search interest
max_combined_pagerank - Star power metric
is_vegas - Las Vegas location
avg_betting_spread - Match competitiveness
num_fights - Card size
avg_win_rate - Fighter quality

Quick Start

# Install dependencies
pip install pyspark pandas numpy beautifulsoup4 requests pytrends textblob

# Run full pipeline
sbatch scripts/run_pipeline.slurm

# Or step by step:
python src/etl/ingest.py --data-dir ./data
spark-submit src/etl/spark_etl.py --data-dir ./data
spark-submit src/features/feature_engineering.py --data-dir ./data
spark-submit src/models/train_improved.py --data-dir ./data --test-year 2024

01-Project-Structure - Directory layout
02-Data-Sources - Data formats and sources
03-ETL-Pipeline - Spark ETL implementation
04-Feature-Engineering - Feature creation
05-Graph-Analytics - Fighter network analysis
06-External-Data - Betting, trends, sentiment
07-Model-Training - ML models
08-Card-Optimizer - Fight card optimization
09-Visualization - Charts and plots
10-HPC-Deployment - Roar cluster setup