UFC Sell-Through Project - Data Sources
Overview
The project integrates five data sources totaling ~2.8 million records:
| Source | Records | Size | Update Frequency |
|---|---|---|---|
| UFC Stats | ~53,000 | ~11 MB | Daily |
| Betting Odds | ~84,000 | ~50 MB | Per-event |
| Google Trends | ~2.3M | ~200 MB | Weekly |
| ~380,000 | ~150 MB | Per-event | |
| Graph Features | ~4,400 | ~20 MB | Derived |
Primary Data: UFCStats.com
Source
Downloaded from Greco1899/scrape_ufc_stats which scrapes UFCStats.com daily.
events.csv (~756 rows)
| Column | Type | Description |
|---|---|---|
| EVENT | string | Event name (e.g., "UFC 300") |
| URL | string | UFCStats.com event URL |
| DATE | string | Event date (various formats) |
| LOCATION | string | City, state/country |
fight_results.csv (~8,500 rows)
| Column | Type | Description |
|---|---|---|
| EVENT | string | Event name |
| BOUT | string | "Fighter A vs. Fighter B" |
| OUTCOME | string | W/L/D/NC |
| WEIGHTCLASS | string | Weight class + bout type |
| METHOD | string | KO/TKO, Submission, Decision |
| ROUND | int | Ending round |
| TIME | string | Time in round |
| TIME FORMAT | string | Round format (e.g., "3 Rnd (5-5-5)") |
| REFEREE | string | Referee name |
| DETAILS | string | Finish details or judge scores |
fight_stats.csv (~40,000 rows)
Round-by-round statistics:
| Column | Type | Description |
|---|---|---|
| EVENT | string | Event name |
| BOUT | string | Fight matchup |
| ROUND | int | Round number |
| FIGHTER | string | Fighter name |
| KD | int | Knockdowns |
| SIG.STR. | string | "Landed of Attempted" |
| SIG.STR. % | string | Accuracy percentage |
| TOTAL STR. | string | Total strikes |
| TD | string | Takedowns "Landed of Attempted" |
| TD % | string | Takedown accuracy |
| SUB.ATT | int | Submission attempts |
| REV. | int | Reversals |
| CTRL | string | Control time (MM:SS) |
| HEAD | string | Head strikes |
| BODY | string | Body strikes |
| LEG | string | Leg strikes |
| DISTANCE | string | Distance strikes |
| CLINCH | string | Clinch strikes |
| GROUND | string | Ground strikes |
fighter_details.csv (~4,500 rows)
| Column | Type | Description |
|---|---|---|
| FIRST | string | First name |
| LAST | string | Last name |
| NICKNAME | string | Ring name |
| URL | string | Fighter profile URL |
fighter_tott.csv (~4,500 rows)
"Tale of the Tape" physical attributes:
| Column | Type | Description |
|---|---|---|
| FIGHTER | string | Full name |
| HEIGHT | string | Height (e.g., "5' 11"") |
| WEIGHT | string | Weight in lbs |
| REACH | string | Reach in inches |
| STANCE | string | Orthodox/Southpaw/Switch |
| DOB | string | Date of birth |
Attendance Data: Wikipedia
Source
Scraped from individual UFC event Wikipedia pages using BeautifulSoup.
attendance.csv (~177 rows)
| Column | Type | Description |
|---|---|---|
| event_name | string | Event name |
| event_date | date | Event date |
| venue | string | Venue name |
| location | string | City, state/country |
| attendance | int | Tickets sold |
| gate_revenue | float | Gate revenue in USD |
| venue_capacity | int | Maximum capacity |
| sell_through | float | attendance / capacity |
| ppv_buys | int | PPV purchases (if available) |
| source | string | "Wikipedia" |
Data Quality Notes
- Some COVID-era events (2020-2021) have attendance = 0 or very low
- UFC Apex events have ~500 capacity (no public attendance)
- Some events missing gate revenue data
- Sell-through can exceed 1.0 for oversold events
Betting Odds Data
Source
Historical betting lines from Kaggle dataset or BestFightOdds.com.
betting_odds.csv (~84,000 rows)
| Column | Type | Description |
|---|---|---|
| fight_id | string | Unique fight identifier |
| event_name | string | Event name |
| fighter1_name | string | First fighter |
| fighter2_name | string | Second fighter |
| f1_open_odds | int | Opening odds (American) |
| f2_open_odds | int | Opening odds (American) |
| f1_close_odds | int | Closing odds |
| f2_close_odds | int | Closing odds |
| line_movement_f1 | int | Change in odds |
| implied_prob_f1 | float | Implied win probability |
| implied_prob_f2 | float | Implied win probability |
| odds_spread | float | |
| is_competitive_matchup | bool | spread < 0.10 |
| has_heavy_favorite | bool | max(prob) > 0.70 |
| source | string | "kaggle" or "synthetic" |
American Odds Conversion
def american_to_probability(odds):
if odds > 0:
return 100 / (odds + 100)
else:
return abs(odds) / (abs(odds) + 100)
Google Trends Data
Source
Fighter search interest via pytrends API.
google_trends.csv (~2.3M rows)
| Column | Type | Description |
|---|---|---|
| date | date | Week starting date |
| fighter_name | string | Fighter name |
| search_interest | int | 0-100 relative interest |
fighter_buzz.csv (~4,400 rows)
Pre-aggregated buzz metrics:
| Column | Type | Description |
|---|---|---|
| fighter_name | string | Fighter name |
| event_date | date | Fight date |
| buzz_7d | float | 7-day pre-event interest |
| buzz_30d | float | 30-day pre-event interest |
| buzz_trend | float | 7d / 30d ratio |
Synthetic Data Generation
When API is unavailable, synthetic trends are generated:
def generate_synthetic_trends(fighter_names, events_df):
# Base interest varies by "star power"
base_interest = random.randint(5, 40)
# Spike around events (7 days before)
if 0 <= days_to_event <= 7:
spike = random.uniform(1.5, 3.0)
search_interest *= (1 + spike)
# Post-event decay
elif -14 <= days_to_event < 0:
search_interest *= random.uniform(0.8, 1.0)
Reddit Sentiment Data
Source
r/MMA discussion threads via PRAW (Reddit API) or synthetic generation.
reddit_comments.csv (~380,000 rows)
| Column | Type | Description |
|---|---|---|
| event_name | string | Event name |
| thread_title | string | Reddit thread title |
| comment_text | string | Comment text (truncated) |
| score | int | Reddit score (upvotes - downvotes) |
| created_utc | datetime | Comment timestamp |
| polarity | float | -1.0 to 1.0 sentiment |
| subjectivity | float | 0.0 to 1.0 objectivity |
event_sentiment.csv (~200 rows)
Aggregated per-event:
| Column | Type | Description |
|---|---|---|
| event_name | string | Event name |
| avg_polarity | float | Mean sentiment |
| avg_subjectivity | float | Mean subjectivity |
| comment_count | int | Total comments |
| positive_ratio | float | % positive comments |
| negative_ratio | float | % negative comments |
| hype_score | float | Engagement-weighted sentiment |
| sentiment_category | string | Positive/Neutral/Negative |
Sentiment Analysis
Using TextBlob or custom word lists:
# Positive MMA words
positive = ["hype", "excited", "amazing", "banger", "war", "knockout", "goat"]
# Negative MMA words
negative = ["boring", "trash", "robbery", "overrated", "ducking"]
polarity = (pos_count - neg_count) / (pos_count + neg_count)
Data Lineage
UFCStats.com (daily) ──► raw/events.csv ──► processed/events/ ──► features/event_features/
└──► raw/fight_*.csv ──► processed/fights/ ──► features/fighter_features/
└──► raw/fighter_*.csv ──► processed/fighters/ ──────────┐
▼
Wikipedia ──────────► external/attendance.csv ────────────────────► Model Training
▲
Google Trends ──────► external/google_trends.csv ──► external/fighter_buzz.csv ──┤
│
Reddit r/MMA ───────► external/reddit_comments.csv ──► external/event_sentiment.csv ─┤
│
BestFightOdds ──────► external/betting_odds.csv ────────────────────────────┘