Least squares regression: Part 1

Introduction to least squares regression.

By Cecina Babich Morrow in statistics R

July 2, 2024

Inspiration for this post

As evidenced by the depressing three year lag-time since my last blog post on this website, it is well past time to get back into gear and get posting once more. I am coming to the end of the first year of my PhD in Computational Statistics and Data Science, and I thought I would share some of what I learned in my courses here. Huge thanks to Dr. Song Liu for pretty much all of the material presented here (mistakes are my own).

I’m starting out with that beloved classic, least squares regression.

Making good choices

We’ll start out by thinking a little bit about what we need in order to make “good” decisions.

How can we make Jamie Lee Curtis proud, statistically speaking?

In order to make decisions rationally, a decision-making framework must fulfill the following criteria: predictions should be precise and data-driven, while accounting for both the cost of making an error in the specific situation and the randomness of the underlying data. Statistical decision-making (theoretically should) fulfill these criteria and thus provides a useful and rational framework for guiding predictions about the world.

Today, we’ll focus on how we might make good choices in a regression setting, i.e. predicting a continuous result based on a set of inputs. We have the following setting, in mathy terms: For an output variable \(y \in \mathbb{R}\) and a set of \(d\) input variables \(\textbf{x}\in \mathbb{R}^d\), we have a dataset \(D\) consisting of \(n\) pairs of observed inputs and outputs \(\{(\textbf{x}_i, y_i\}^n_{i=1}\).

Least squares regression

Least squares regression tackles this regression problem by finding a prediction function \(f(\textbf{x})\) that minimizes the sum of squared differences (hence the name) between the observed outputs \(y_i\) and our predictions: \(\sum^n_{i=1}(y_i-f(\textbf{x}_i;\textbf{w}))^2\).

Why squares? By trying to minimize the sum of squared differences between prediction and reality, rather than doing something like minimizing the raw distance between prediction and reality (see least absolute deviations), we are penalizing the times are predictions are very far from reality even more heavily.

Our prediction function is defined by \(f(\textbf{x};\textbf{w}) = \langle\textbf{w}_1,\textbf{x}\rangle+w_0\), where \(\textbf{w}:= [\textbf{w}_1, w_0]^\top\), and we choose parameters \(\textbf{w}\) such that \(\textbf{w}_{LS} := \text{argmin}_\textbf{w} \sum_{i=1}^{n}[y_i-f(\textbf{x}_i;\textbf{w})]^2\). Solving this equation yields \(\textbf{w}_{LS}=(\textbf{X}\textbf{X}^\top)^{-1}\textbf{X}\textbf{y}^\top\), where \(\textbf{X}:=\begin{bmatrix} \textbf{x}_1,&...,&\textbf{x}_n\\1,&...,&1\end{bmatrix}\). Why? Well, let’s prove it.

Theorem 1 (Least squares) \(\textbf{w}_{LS}=(\textbf{X}\textbf{X}^\top)^{-1}\textbf{X}\textbf{y}^\top\)

Let \(\textbf{X}:=\begin{bmatrix} \textbf{x}_1,&...,&\textbf{x}_n\\1,&...,&1\end{bmatrix}\), \(\textbf{y}:=[y_1,...,y_n]\). We have defined \(\textbf{w}_{LS}\) as follows: \(\textbf{w}_{LS} := \text{argmin}_\textbf{w} \sum_{i=1}^{n}[y_i-f(\textbf{x}_i;\textbf{w})]^2\).

\[\begin{align*} \textbf{w}_{LS} &:= \text{argmin}_\textbf{w} \sum_{i=1}^{n}[y_i-f(\textbf{x}_i;\textbf{w})]^2 \\ &= \text{argmin}_\textbf{w} ||\textbf{y}-\textbf{w}^\top\textbf{X}||^2 \\ &= \text{argmin}_\textbf{w} (\textbf{y}-\textbf{w}^\top\textbf{X})(\textbf{y}-\textbf{w}^\top\textbf{X})^\top \\ &= \text{argmin}_\textbf{w} (\textbf{y}-\textbf{w}^\top\textbf{X})(\textbf{y}^\top-(\textbf{w}^\top\textbf{X})^\top) \\ &= \text{argmin}_\textbf{w} (\textbf{y}-\textbf{w}^\top\textbf{X})(\textbf{y}^\top-\textbf{X}^\top\textbf{w}) \\ &= \text{argmin}_\textbf{w} (\textbf{y}\textbf{y}^\top - 2\textbf{w}^\top\textbf{X}\textbf{y}^\top+\textbf{w}^\top\textbf{X}\textbf{X}^\top\textbf{w}) \end{align*}\]

We differentiate with respect to \(\textbf{w}\) and set the result equal to zero:

\[\begin{align*} 0 &= \frac{\partial}{\partial \textbf{w}}(\textbf{y}\textbf{y}^\top - 2\textbf{w}^\top\textbf{X}\textbf{y}^\top+\textbf{w}^\top\textbf{X}\textbf{X}^\top\textbf{w}) \\ 0 &= -2\textbf{X}\textbf{y}^\top+2\textbf{X}\textbf{X}^\top\textbf{w}_{LS} \\ 2\textbf{X}\textbf{y}^\top &= 2\textbf{X}\textbf{X}^\top\textbf{w}_{LS} \end{align*}\]

So \(\textbf{w}_{LS}=(\textbf{X}\textbf{X}^\top)^{-1}\textbf{X}\textbf{y}^\top\).

In this result, we can see that we can only solve \(\textbf{w}_{LS}\) if \(\textbf{X}\textbf{X}^\top\) is invertible: in situations when the sample size \(n\) is less than the number of inputs \(d\), \(\textbf{X}\textbf{X}^\top\) is rank deficient and thus non-invertible and the least squares solution cannot be calculated:

If \(n < d\):

\[\begin{align*} \text{rank}(\textbf{XX}^\top) &= \text{rank}(\textbf{X}) \\ &\leq \text{min}(d+1, n) \\ &= n \end{align*}\]

The matrix \(\textbf{X}\textbf{X}^\top \in \mathbb{R}^{d+1 \times d+1}\) and \(n < d+1\), so \(\textbf{X}\textbf{X}^\top\) is rank deficient and thus non-invertible.

R implementation

While R obviously has built-in ways to perform least-squares regression, let’s write our own anyways:

# Function to calculate the coefficients w_LS
least_squares_solver <- function(input_variables, output_variable, lambda = 0, b = 1) {
  # Set up variables
  y <- output_variable
  X <- t(as.matrix(input_variables)) # convert each x_i into column vector
  # Add intercept
  X <- rbind(X, rep(1, ncol(X)))
  
  # Compute w_LS
  # Using what we just proved above
  w_LS <- solve(X %*% t(X)) %*% X %*% as.matrix(y)
  return(w_LS)
}

We’ll also want a function to use the \(\textbf{w}_{LS}\) we find to make predictions on a new dataset:

# Function to apply coefficients to a new set of data
least_squares_predict <- function(input_variables, coefficients) {
  # Set up variables
  X <- t(as.matrix(input_variables)) # convert each x_i into column vector
  # Add intercept
  X <- rbind(X, rep(1, ncol(X)))
  
  preds <- t(as.matrix(coefficients)) %*% X
  return(preds)
}

Finally, we’ll write a function to calculate the sum of squared errors to check how well our predictions are doing:

# Function to calculate sum of squared errors
sum_squared_errors <- function(prediction, actual) {
  squared_error <- (actual - prediction)^2
  return(sum(squared_error))
}

Let’s test it out on some data. We will use a prostate cancer dataset:

library(readr)

# Import data
prostate_data <- read_csv("prostate_data.csv")
## Rows: 97 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45, lpsa
## lgl (1): train
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Split data into training and test data
train_data <- prostate_data[prostate_data$train == TRUE,]
test_data <- prostate_data[!prostate_data$train,]

Now, let’s use our function to make a model predicting lpsa using all features:

# Model using all features -----------------------------------------------------
# Fit linear least squares on the training data
train_least_squares <- least_squares_solver(train_data[,1:8], train_data$lpsa)

# Apply the model to the testing data
preds <- least_squares_predict(test_data[,1:8], train_least_squares)
# Calculate cross-validation error
cross_val_error <- sum_squared_errors(preds, test_data$lpsa)
cross_val_error
## [1] 15.63822

Let’s compare to the results of the built-in lm function from R to check that we’re on the right track:

# Compare to lm results --------------------------------------------------------
lm_model <- lm(lpsa ~ ., data = train_data[,1:9])
summary(lm_model)
## 
## Call:
## lm(formula = lpsa ~ ., data = train_data[, 1:9])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.64870 -0.34147 -0.05424  0.44941  1.48675 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.429170   1.553588   0.276  0.78334    
## lcavol       0.576543   0.107438   5.366 1.47e-06 ***
## lweight      0.614020   0.223216   2.751  0.00792 ** 
## age         -0.019001   0.013612  -1.396  0.16806    
## lbph         0.144848   0.070457   2.056  0.04431 *  
## svi          0.737209   0.298555   2.469  0.01651 *  
## lcp         -0.206324   0.110516  -1.867  0.06697 .  
## gleason     -0.029503   0.201136  -0.147  0.88389    
## pgg45        0.009465   0.005447   1.738  0.08755 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7123 on 58 degrees of freedom
## Multiple R-squared:  0.6944,	Adjusted R-squared:  0.6522 
## F-statistic: 16.47 on 8 and 58 DF,  p-value: 2.042e-12
lm_preds <- predict(lm_model, test_data)
# Check if our predictions are the same
all.equal(unname(lm_preds),
          as.vector(preds))
## [1] TRUE

We have the same predictions on the test set!

Accounting for randomness

So, are we making good choices yet? Considering our criteria for rational decision-making, least squares regression is clearly precise (motivated by mathematical principles), is computed based on observed data, and accounts for the cost of error by measuring that cost in terms of squared difference between prediction and reality. So all that is missing now is to make sure we are accounting for the randomness of the underlying data.

To do this, we can motivate least squares regression in a probabilistic way by assuming that our output variables \(y_i\) are normally distributed with mean \(f(\textbf{x}_i;w)\) and variance \(\sigma^2\) and the observations \((\textbf{x}_i, y_i)\) are independent and identically distributed. Then \(p(y_1, ..., y_n | \textbf{x}_1, ..., \textbf{x}_n, \textbf{w}, \sigma) = \prod_{i=1}^n N_{y_i}(f(\textbf{x}_i; w), \sigma^2)\).

To tune the parameters \(\textbf{w}\) and \(\sigma\), we can use a technique known as maximum likelihood estimation (MLE). MLE maximizes the probability density function evaluated at the dataset \(D\) with respect to our unknown parameters. Intuitively, the goal of MLE is to find parameters that make it most likely that we would observe our actual data.

By performing MLE to determine the parameter \(\textbf{w}_{ML}\), we find that \(\textbf{w}_{ML} = \text{argmin}_\textbf{w} \sum_{i=1}^{n}[y_i-f(\textbf{x}_i;\textbf{w})]^2 = \textbf{w}_{LS}\), so our least squares parameter is also the maximum likelihood estimator. We can also perform MLE to find the uncertainty of our prediction function \(\sigma^2_{ML} = \frac{1}{n}[y - f(\textbf{x}; \textbf{w}_{ML})]^2\), accounting for randomness in our underlying data.

Posted on:
July 2, 2024
Length:
7 minute read, 1311 words
Categories:
statistics R
Tags:
statistics R
See Also:
Bias-variance decomposition
brat and it's the same but it's a blog post so it's not
Least squares regression: Part 2
comments powered by Disqus