UFC Sell-Through Project - Project Structure

Root Directory

UFC_SellThrough_Project/
├── README.md                 # Project documentation
├── requirements.txt          # Python dependencies
├── data/                     # All data files
├── src/                      # Source code
├── scripts/                  # SLURM and shell scripts
└── notebooks/                # Jupyter notebooks

Data Directory (data/)

data/
├── raw/                      # Original CSV files
│   ├── events.csv            # UFC events (756 rows)
│   ├── fight_results.csv     # Fight outcomes (~8,500 rows)
│   ├── fight_stats.csv       # Round-by-round stats (~40,000 rows)
│   ├── fight_details.csv     # Fight metadata
│   ├── fighter_details.csv   # Fighter names (~4,500 rows)
│   ├── fighter_tott.csv      # Fighter attributes (height, reach)
│   ├── README.md             # Data documentation
│   └── scraper_tools/        # UFC stats scraper
│       ├── scrape_ufc_stats_library.py
│       ├── scrape_ufc_stats_config.yaml
│       ├── requirements.txt
│       └── LICENSE
│
├── external/                 # External data sources
│   ├── attendance.csv        # Wikipedia attendance data
│   ├── attendance_full.csv   # Complete attendance scrape
│   ├── betting_odds.csv      # Historical betting lines
│   ├── google_trends.csv     # Fighter search interest
│   ├── fighter_buzz.csv      # Pre-event buzz metrics
│   ├── event_sentiment.csv   # Reddit sentiment aggregated
│   ├── reddit_comments.csv   # Raw Reddit comments
│   └── ufc_attendance_sample.csv
│
├── processed/                # Cleaned Parquet files
│   ├── events/               # Cleaned events
│   ├── fighters/             # Cleaned fighter data
│   ├── fights/               # Cleaned fight records
│   └── fight_stats/          # Cleaned statistics
│
├── features/                 # Feature tables
│   ├── event_features/       # Event-level features (Parquet)
│   ├── fighter_features/     # Fighter-level features (Parquet)
│   ├── matchup_features/     # Fight matchup features (Parquet)
│   └── graph_features/       # GraphFrames output (Parquet)
│
└── models/                   # Trained models
    ├── gbt_model/            # Baseline GBT model
    ├── gbt_model_improved/   # Improved GBT model (36 features)
    ├── metrics.json          # Baseline metrics
    ├── metrics_improved.json # Improved model metrics
    ├── predictions.csv       # Test set predictions
    └── features_used.txt     # Feature list

Source Code (src/)

ETL Module (src/etl/)

File Purpose Lines
__init__.py Package marker -
spark_etl.py Main Spark ETL pipeline ~900
ingest.py Download base UFC data ~200
scrape_attendance.py Wikipedia attendance scraper ~350
scrape_all_attendance.py Batch attendance scraping ~400
scrape_betting_odds.py Betting odds collector ~450
fetch_google_trends.py Google Trends fetcher ~300
scrape_reddit_sentiment.py Reddit sentiment analyzer ~400
merge_attendance.py Merge attendance data ~150

Features Module (src/features/)

File Purpose Lines
__init__.py Package marker -
feature_engineering.py Rolling stats, external features ~600

Graph Module (src/graph/)

File Purpose Lines
__init__.py Package marker -
fighter_network.py GraphFrames analysis ~500

Models Module (src/models/)

File Purpose Lines
__init__.py Package marker -
train.py Baseline GBT training ~300
train_improved.py Improved model (36 features) ~500
rolling_validation.py Time-series validation ~200

Optimizer Module (src/optimizer/)

File Purpose Lines
__init__.py Package marker -
card_optimizer.py Fight card optimization ~400

Visualization Module (src/visualization/)

File Purpose Lines
__init__.py Package marker -
create_plots.py Generate all visualizations ~500

Scripts (scripts/)

File Purpose Platform
run_pipeline.slurm Full pipeline for Roar SLURM
run_pipeline_ufc.slurm UFC-specific pipeline SLURM
run_visualization.slurm Generate plots on cluster SLURM
get_results.sh Collect results from cluster Bash
create_samples.sh Create sample datasets Bash
create_samples.ps1 Create samples (Windows) PowerShell

Notebooks (notebooks/)

File Purpose
spark_etl_colab.ipynb Google Colab version of ETL

Key Configuration Files

requirements.txt

pyspark>=3.4.0
pandas>=2.0.0
numpy>=1.24.0
beautifulsoup4>=4.12.0
requests>=2.31.0
lxml>=4.9.0
pytrends>=4.9.0
textblob>=0.17.0
matplotlib>=3.7.0
seaborn>=0.12.0

scrape_ufc_stats_config.yaml

completed_events_all_url: http://ufcstats.com/statistics/events/completed?page=all

file_names:
  event_details: ufc_event_details.csv
  fight_details: ufc_fight_details.csv
  fight_results: ufc_fight_results.csv
  fight_stats: ufc_fight_stats.csv
  fighter_details: ufc_fighter_details.csv
  fighter_tott: ufc_fighter_tott.csv

columns:
  fight_stats:
    - EVENT
    - BOUT
    - ROUND
    - FIGHTER
    - KD
    - SIG.STR.
    - SIG.STR. %
    - TOTAL STR.
    - TD
    - TD %
    - SUB.ATT
    - REV.
    - CTRL

Data Flow

Raw CSVs → Spark ETL → Processed Parquet → Feature Engineering → ML Model
    ↓                                              ↑
External Data (Wikipedia, Betting, Trends, Reddit) ┘
    ↓
GraphFrames → Graph Features ─────────────────────┘

Output Structure

visualizations/
├── 1_sellthrough_distribution.png
├── 2_feature_importance.png
├── 3_sellthrough_over_time.png
├── 4_sellthrough_by_event_type.png
├── 5_sellthrough_by_location.png
├── 6_actual_vs_predicted.png
├── 7_model_comparison.png
├── 8_title_fight_impact.png
├── 9_residuals.png
└── 10_correlation_heatmap.png