UFC Sell-Through Project - Data Sources

Overview

The project integrates five data sources totaling ~2.8 million records:

Source Records Size Update Frequency
UFC Stats ~53,000 ~11 MB Daily
Betting Odds ~84,000 ~50 MB Per-event
Google Trends ~2.3M ~200 MB Weekly
Reddit ~380,000 ~150 MB Per-event
Graph Features ~4,400 ~20 MB Derived

Source

Downloaded from Greco1899/scrape_ufc_stats which scrapes UFCStats.com daily.

events.csv (~756 rows)

Column Type Description
EVENT string Event name (e.g., "UFC 300")
URL string UFCStats.com event URL
DATE string Event date (various formats)
LOCATION string City, state/country

fight_results.csv (~8,500 rows)

Column Type Description
EVENT string Event name
BOUT string "Fighter A vs. Fighter B"
OUTCOME string W/L/D/NC
WEIGHTCLASS string Weight class + bout type
METHOD string KO/TKO, Submission, Decision
ROUND int Ending round
TIME string Time in round
TIME FORMAT string Round format (e.g., "3 Rnd (5-5-5)")
REFEREE string Referee name
DETAILS string Finish details or judge scores

fight_stats.csv (~40,000 rows)

Round-by-round statistics:

Column Type Description
EVENT string Event name
BOUT string Fight matchup
ROUND int Round number
FIGHTER string Fighter name
KD int Knockdowns
SIG.STR. string "Landed of Attempted"
SIG.STR. % string Accuracy percentage
TOTAL STR. string Total strikes
TD string Takedowns "Landed of Attempted"
TD % string Takedown accuracy
SUB.ATT int Submission attempts
REV. int Reversals
CTRL string Control time (MM:SS)
HEAD string Head strikes
BODY string Body strikes
LEG string Leg strikes
DISTANCE string Distance strikes
CLINCH string Clinch strikes
GROUND string Ground strikes

fighter_details.csv (~4,500 rows)

Column Type Description
FIRST string First name
LAST string Last name
NICKNAME string Ring name
URL string Fighter profile URL

fighter_tott.csv (~4,500 rows)

"Tale of the Tape" physical attributes:

Column Type Description
FIGHTER string Full name
HEIGHT string Height (e.g., "5' 11"")
WEIGHT string Weight in lbs
REACH string Reach in inches
STANCE string Orthodox/Southpaw/Switch
DOB string Date of birth

Attendance Data: Wikipedia

Source

Scraped from individual UFC event Wikipedia pages using BeautifulSoup.

attendance.csv (~177 rows)

Column Type Description
event_name string Event name
event_date date Event date
venue string Venue name
location string City, state/country
attendance int Tickets sold
gate_revenue float Gate revenue in USD
venue_capacity int Maximum capacity
sell_through float attendance / capacity
ppv_buys int PPV purchases (if available)
source string "Wikipedia"

Data Quality Notes

Betting Odds Data

Source

Historical betting lines from Kaggle dataset or BestFightOdds.com.

betting_odds.csv (~84,000 rows)

Column Type Description
fight_id string Unique fight identifier
event_name string Event name
fighter1_name string First fighter
fighter2_name string Second fighter
f1_open_odds int Opening odds (American)
f2_open_odds int Opening odds (American)
f1_close_odds int Closing odds
f2_close_odds int Closing odds
line_movement_f1 int Change in odds
implied_prob_f1 float Implied win probability
implied_prob_f2 float Implied win probability
odds_spread float
is_competitive_matchup bool spread < 0.10
has_heavy_favorite bool max(prob) > 0.70
source string "kaggle" or "synthetic"

American Odds Conversion

def american_to_probability(odds):
    if odds > 0:
        return 100 / (odds + 100)
    else:
        return abs(odds) / (abs(odds) + 100)

Source

Fighter search interest via pytrends API.

Column Type Description
date date Week starting date
fighter_name string Fighter name
search_interest int 0-100 relative interest

fighter_buzz.csv (~4,400 rows)

Pre-aggregated buzz metrics:

Column Type Description
fighter_name string Fighter name
event_date date Fight date
buzz_7d float 7-day pre-event interest
buzz_30d float 30-day pre-event interest
buzz_trend float 7d / 30d ratio

Synthetic Data Generation

When API is unavailable, synthetic trends are generated:

def generate_synthetic_trends(fighter_names, events_df):
    # Base interest varies by "star power"
    base_interest = random.randint(5, 40)
    
    # Spike around events (7 days before)
    if 0 <= days_to_event <= 7:
        spike = random.uniform(1.5, 3.0)
        search_interest *= (1 + spike)
    
    # Post-event decay
    elif -14 <= days_to_event < 0:
        search_interest *= random.uniform(0.8, 1.0)

Reddit Sentiment Data

Source

r/MMA discussion threads via PRAW (Reddit API) or synthetic generation.

reddit_comments.csv (~380,000 rows)

Column Type Description
event_name string Event name
thread_title string Reddit thread title
comment_text string Comment text (truncated)
score int Reddit score (upvotes - downvotes)
created_utc datetime Comment timestamp
polarity float -1.0 to 1.0 sentiment
subjectivity float 0.0 to 1.0 objectivity

event_sentiment.csv (~200 rows)

Aggregated per-event:

Column Type Description
event_name string Event name
avg_polarity float Mean sentiment
avg_subjectivity float Mean subjectivity
comment_count int Total comments
positive_ratio float % positive comments
negative_ratio float % negative comments
hype_score float Engagement-weighted sentiment
sentiment_category string Positive/Neutral/Negative

Sentiment Analysis

Using TextBlob or custom word lists:

# Positive MMA words
positive = ["hype", "excited", "amazing", "banger", "war", "knockout", "goat"]

# Negative MMA words  
negative = ["boring", "trash", "robbery", "overrated", "ducking"]

polarity = (pos_count - neg_count) / (pos_count + neg_count)

Data Lineage

UFCStats.com (daily) ──► raw/events.csv ──► processed/events/ ──► features/event_features/
                   └──► raw/fight_*.csv ──► processed/fights/ ──► features/fighter_features/
                   └──► raw/fighter_*.csv ──► processed/fighters/ ──────────┐
                                                                            ▼
Wikipedia ──────────► external/attendance.csv ────────────────────► Model Training
                                                                            ▲
Google Trends ──────► external/google_trends.csv ──► external/fighter_buzz.csv ──┤
                                                                            │
Reddit r/MMA ───────► external/reddit_comments.csv ──► external/event_sentiment.csv ─┤
                                                                            │
BestFightOdds ──────► external/betting_odds.csv ────────────────────────────┘