UFC Sell-Through Project - Project Structure
Root Directory
UFC_SellThrough_Project/
├── README.md # Project documentation
├── requirements.txt # Python dependencies
├── data/ # All data files
├── src/ # Source code
├── scripts/ # SLURM and shell scripts
└── notebooks/ # Jupyter notebooks
Data Directory (data/)
data/
├── raw/ # Original CSV files
│ ├── events.csv # UFC events (756 rows)
│ ├── fight_results.csv # Fight outcomes (~8,500 rows)
│ ├── fight_stats.csv # Round-by-round stats (~40,000 rows)
│ ├── fight_details.csv # Fight metadata
│ ├── fighter_details.csv # Fighter names (~4,500 rows)
│ ├── fighter_tott.csv # Fighter attributes (height, reach)
│ ├── README.md # Data documentation
│ └── scraper_tools/ # UFC stats scraper
│ ├── scrape_ufc_stats_library.py
│ ├── scrape_ufc_stats_config.yaml
│ ├── requirements.txt
│ └── LICENSE
│
├── external/ # External data sources
│ ├── attendance.csv # Wikipedia attendance data
│ ├── attendance_full.csv # Complete attendance scrape
│ ├── betting_odds.csv # Historical betting lines
│ ├── google_trends.csv # Fighter search interest
│ ├── fighter_buzz.csv # Pre-event buzz metrics
│ ├── event_sentiment.csv # Reddit sentiment aggregated
│ ├── reddit_comments.csv # Raw Reddit comments
│ └── ufc_attendance_sample.csv
│
├── processed/ # Cleaned Parquet files
│ ├── events/ # Cleaned events
│ ├── fighters/ # Cleaned fighter data
│ ├── fights/ # Cleaned fight records
│ └── fight_stats/ # Cleaned statistics
│
├── features/ # Feature tables
│ ├── event_features/ # Event-level features (Parquet)
│ ├── fighter_features/ # Fighter-level features (Parquet)
│ ├── matchup_features/ # Fight matchup features (Parquet)
│ └── graph_features/ # GraphFrames output (Parquet)
│
└── models/ # Trained models
├── gbt_model/ # Baseline GBT model
├── gbt_model_improved/ # Improved GBT model (36 features)
├── metrics.json # Baseline metrics
├── metrics_improved.json # Improved model metrics
├── predictions.csv # Test set predictions
└── features_used.txt # Feature list
Source Code (src/)
ETL Module (src/etl/)
| File |
Purpose |
Lines |
__init__.py |
Package marker |
- |
spark_etl.py |
Main Spark ETL pipeline |
~900 |
ingest.py |
Download base UFC data |
~200 |
scrape_attendance.py |
Wikipedia attendance scraper |
~350 |
scrape_all_attendance.py |
Batch attendance scraping |
~400 |
scrape_betting_odds.py |
Betting odds collector |
~450 |
fetch_google_trends.py |
Google Trends fetcher |
~300 |
scrape_reddit_sentiment.py |
Reddit sentiment analyzer |
~400 |
merge_attendance.py |
Merge attendance data |
~150 |
Features Module (src/features/)
| File |
Purpose |
Lines |
__init__.py |
Package marker |
- |
feature_engineering.py |
Rolling stats, external features |
~600 |
Graph Module (src/graph/)
| File |
Purpose |
Lines |
__init__.py |
Package marker |
- |
fighter_network.py |
GraphFrames analysis |
~500 |
Models Module (src/models/)
| File |
Purpose |
Lines |
__init__.py |
Package marker |
- |
train.py |
Baseline GBT training |
~300 |
train_improved.py |
Improved model (36 features) |
~500 |
rolling_validation.py |
Time-series validation |
~200 |
Optimizer Module (src/optimizer/)
| File |
Purpose |
Lines |
__init__.py |
Package marker |
- |
card_optimizer.py |
Fight card optimization |
~400 |
Visualization Module (src/visualization/)
| File |
Purpose |
Lines |
__init__.py |
Package marker |
- |
create_plots.py |
Generate all visualizations |
~500 |
Scripts (scripts/)
| File |
Purpose |
Platform |
run_pipeline.slurm |
Full pipeline for Roar |
SLURM |
run_pipeline_ufc.slurm |
UFC-specific pipeline |
SLURM |
run_visualization.slurm |
Generate plots on cluster |
SLURM |
get_results.sh |
Collect results from cluster |
Bash |
create_samples.sh |
Create sample datasets |
Bash |
create_samples.ps1 |
Create samples (Windows) |
PowerShell |
Notebooks (notebooks/)
| File |
Purpose |
spark_etl_colab.ipynb |
Google Colab version of ETL |
Key Configuration Files
requirements.txt
pyspark>=3.4.0
pandas>=2.0.0
numpy>=1.24.0
beautifulsoup4>=4.12.0
requests>=2.31.0
lxml>=4.9.0
pytrends>=4.9.0
textblob>=0.17.0
matplotlib>=3.7.0
seaborn>=0.12.0
scrape_ufc_stats_config.yaml
completed_events_all_url: http://ufcstats.com/statistics/events/completed?page=all
file_names:
event_details: ufc_event_details.csv
fight_details: ufc_fight_details.csv
fight_results: ufc_fight_results.csv
fight_stats: ufc_fight_stats.csv
fighter_details: ufc_fighter_details.csv
fighter_tott: ufc_fighter_tott.csv
columns:
fight_stats:
- EVENT
- BOUT
- ROUND
- FIGHTER
- KD
- SIG.STR.
- SIG.STR. %
- TOTAL STR.
- TD
- TD %
- SUB.ATT
- REV.
- CTRL
Data Flow
Raw CSVs → Spark ETL → Processed Parquet → Feature Engineering → ML Model
↓ ↑
External Data (Wikipedia, Betting, Trends, Reddit) ┘
↓
GraphFrames → Graph Features ─────────────────────┘
Output Structure
visualizations/
├── 1_sellthrough_distribution.png
├── 2_feature_importance.png
├── 3_sellthrough_over_time.png
├── 4_sellthrough_by_event_type.png
├── 5_sellthrough_by_location.png
├── 6_actual_vs_predicted.png
├── 7_model_comparison.png
├── 8_title_fight_impact.png
├── 9_residuals.png
└── 10_correlation_heatmap.png