#stat #stat380 #math

Regression

Regression is the process of finding a model that best fits a set of data.
For example

Prediction

Predict the most accurate answer such as the price of the car

Inference

Understand the association between variables

Simple linear regression

A model for predicting a quantitative response (Y) based on a single predictor variable (X).

R^2

The proportion of total variation in the response (y) explained by the least-squares regression line. For example How much of the up and down of a Camry can be explained by just it’s mileage.

Dummy variables

A technique to convert categorial data into numerical indicators (0s and 1s) for a model.

Model Is_camry Is_tacoma
Camry 1 0
Tacoma 0 1
Volt 0 0
Using the combination of 0’s and 1’s we can mathematically represent categories for a regression model

Multiple linear regression

This is when we have more than one predictor variable (X1, X2, …, Xp) to predict a quantitative response (Y).

Clustering

This is an unsupervised learning technique that involved finding subgroups in a dataset where there is no supervising output, which matches the problem description

Y^

Y^ (or y^) is the predicted or estimated value of the response. For Simple Linear Regression (SLR), the estimated regression equation is y^i=a+bxi.
 Prediction problems fall under the category of Supervised Learning.

Steps to get MSE

MSE=1ni=1n(yiy^i)2

The necessary order of operations for calculation is:

  1. Obtain predictions (y^).

  2. Find the residuals (yy^).

  3. Square the residuals.

  4. Sum the squared residuals (this is the Residual Sum of Squares, RSS).

  5. Divide the sum by n (or multiply by 1/n).

(Note: Steps 4 and 5 can be combined by finding the mean of the squared residuals).

Practice

Since the notation Y^=f^(X) is the core of prediction, let's focus on using an existing model to generate predictions (y^i) and then evaluate those predictions using the Mean Squared Error (MSE).

We will use the Simple Linear Regression (SLR) example from your sources for Camry vehicles.

R Exercise: Predicting Price and Calculating MSE

We will use the estimated regression equation for Camry's price based on mileage: $$ \hat{y} = 21.312 - 0.133x_{\text{miles}} $$

The goal of this exercise is to calculate the MSE, which is defined as: $$ MSE = \frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2} $$

Your Task: Imagine you have a new set of 4 Camry vehicles. Use the equation above to find the predicted price (y^i) for each car, and then calculate the MSE for this small dataset.

Car Miles (xi) (in thousands) Observed Price (yi) (in thousands) Predicted Price (y^i) Residual (yiy^i) Squared Residual
1 50 14.5 ? ? ?
2 100 6.1 ? ? ?
3 75 11.0 ? ? ?
4 20 18.9 ? ? ?

Step 1: Calculate the Predicted Price (y^i) for Car 1.
Predictions: y1^=21.3120.133(50)miles=14.67
Residuals: y1y1^=14.514.67=0.17.
Square the residuals R2: (0.17)2=0.0289

Step 2: Calculate the Predicted Price (y^i) for Car 2.
Predictions: y1^=21.3120.133(100)miles=8.02
Residuals: y1y1^=14.58.02=6.48.
Square the residuals R2: (6.84)2=41.9904

Step 3: Calculate the Predicted Price (y^i) for Car 3.
Predictions: y1^=21.3120.133(75)miles=11.345
Residuals: y1y1^=14.511.345=3.155.
Square the residuals R2: (6.84)2=9.954025

Step 4: Calculate the Predicted Price (y^i) for Car 4.
Predictions: y1^=21.3120.133(20)miles=18.66
Residuals: y1y1^=14.518.66=4.16.
Square the residuals R2: (6.84)2=17.3056

Step 5: Calculate the MSE

MSE=0.0289+41.9904+9.954025+17.30564=17.31975625