March 2019 · ·

Coordinate Descent for Logistic Regression

A walk-through of two coordinate descent variants on cross-entropy loss for a binary UCI wine task, compared against scikit-learn's logistic regression baseline. Coordinate descent is a fundamental optimization tool in machine learning because it scales to high-dimensional problems, lets us update one parameter at a time, and underpins many modern large-scale training routines.

Introduction

This post showcases two different approaches to implementing coordinate descent on a model defined with cross-entropy loss. Coordinate descent can struggle with non-smooth multivariate functions, as the algorithm may get stuck at non-stationary points where level curves are not smooth. However, it has gained significant interest for large-scale problems, where it has proven competitive with other algorithms for training support vector machines and performing non-negative matrix factorization.

In this experiment I implement coordinate descent using the following approaches:

Randomly optimize along a random coordinate
Optimize along coordinate with the largest gradient in magnitude

To benchmark the solutions I implemented a standard logistic regression model without focusing on validation and was just used to obtain scikit-learn's implementation of gradient descent and final loss. The model is trained with the wine.data dataset provided by UCI machine learning repository.

Procedure

I use the cross-entropy loss \(L(w)\) in the coordinate descent method. Let \(\hat{y}\) be the probability of having label 1, and let \(y\) be the actual label. Then the cross-entropy loss and gradient is defined below:

\[ L(w) = -\frac{1}{n}\sum_{i=1}^{n} y_{i} \log \hat{y}_{i} + (1 - y_{i}) \log(1 - \hat{y}_{i}) \]

And thus the gradients with respect to each weight \(w_{j}\) is given by:

\[ \frac{\partial L(w)}{\partial w_{j}} = \sum_{i=1}^{n}(\hat{y}_{i} - y_{i})x_{j} \]

The loss function \(L(\bullet)\) should be differentiable and preferably convex. The procedure of the coordinate descent method is outlined below:

Initialize weights to zero
Get predictions from sigmoid function
Calculate loss and gradients
Find coordinate \(i\) with the largest gradient in absolute value (best coordinate descent) or randomly choose weight (random coordinate descent)
Update weight \(i\) via: \(w_{i} = w_{i} - \eta \frac{\partial L(w)}{\partial w_{i}}\)
Repeat steps until convergence

Data Preparation

import numpy as np
from tqdm import tqdm
from random import randint
import matplotlib.pyplot as plt
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

data = np.loadtxt('./data/wine.data', delimiter=',')

X, y = data[:, 1:], data[:, 0]

# transform problem into binary classification task
idxs = [i for i in range(len(y)) if y[i] == 1 or y[i] == 2]

X, y = X[idxs], y[idxs]

# normalize data
X = (X - X.mean(axis=0))/(X.max(axis=0) - X.min(axis=0))
X = np.hstack((X,np.ones(len(X)).reshape(len(X),1)))

# transform target variable
y = np.array(list(map(lambda x: 0 if x == 1 else 1, y)))

Now that the data is prepared, we can use scikit-learn's implementation as our benchmark:

# scikit learn implementation (our benchmark)
reg = LogisticRegression(solver='sag', C=100000, max_iter=10000).fit(X,y)
L_star = log_loss(y,reg.predict_proba(X))

print("Loss L* = {:<16f}".format(log_loss(y,reg.predict_proba(X))))

Loss \(L^{*} = 0.000393\). We can now benchmark our coordinate descent implementations against this baseline \(L^{*} \approx 0.0004\). First let's define a few helper functions:

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def cross_entropy_loss(y,y_pred):
    loss = -(1/len(y))*np.sum(y*np.log(y_pred) + (1 - y)*np.log(1 - y_pred))
    return loss

def cross_entropy_grad(y,y_pred,x):
    return list(np.dot((y_pred - y),X)[0])

Best Index

For each iteration, only update weight \(w_{j}\) such that \(|\frac{\partial L(w)}{\partial w_{j}}|\) is the largest for all \(j\).

# initial weights
w = np.zeros(14).reshape(14,1)
w_rand = w

eta = 0.01
loss = []
loss_rand = []
num_iter = 1005000 # just take a bunch of iterations
for t in tqdm(range(num_iter)):

    # predict step
    y_pred = sigmoid(np.dot(w.T,X.T))
    grad = cross_entropy_grad(y,y_pred,X)

    largest = np.argmax(np.abs(grad)) # idx with largest magnitude
    loss.append(cross_entropy_loss(y,y_pred))

    # update that coordinate with largest gradient in magnitude
    w[largest] = w[largest] - eta*grad[largest]

y_pred = sigmoid(np.dot(w.T,X.T))
y_pred = np.array(list(map(lambda x: 1 if x >= 0.5 else 0, y_pred.flatten())))

print("Accuracy for best coordinate descent {0}".format(accuracy_score(y, y_pred)))
print("Loss L = {0}".format(loss[-1]))

Accuracy for best coordinate descent 1.0
Loss = 0.0004698390523920375

Random Index

For each iteration, update weight \(w_{j}\) for \(j \in 0 \dots 13\) (e.g., randomly choose a coordinate to minimize on).

# initial weights
w = np.zeros(14).reshape(14,1)
w_rand = w

eta = 0.01
loss_rand = []
num_iter = 1005000
for t in tqdm(range(num_iter)):

    # predict step
    y_pred = sigmoid(np.dot(w.T,X.T))
    loss_rand.append(cross_entropy_loss(y,y_pred))
    grad = cross_entropy_grad(y,y_pred,X)
    random = randint(0,13)

    # update random coordinate
    w_rand[random] = w_rand[random] - eta*grad[random]

y_pred = sigmoid(np.dot(w_rand.T,X.T))
y_pred = np.array(list(map(lambda x: 1 if x >= 0.5 else 0, y_pred.flatten())))

print("Accuracy for random coordinate descent {0}".format(accuracy_score(y, y_pred)))

Accuracy for random coordinate descent 1.0

plt.figure(figsize=(10,10))
plt.plot(loss, 'g-', label='Best Coordinate Descent')
plt.plot(loss_rand, 'r-', label='Random Feature Coordinate Descent')
plt.axhline(y=L_star, color='y', label='L*')
plt.title('iteration vs loss', fontsize=25)
plt.xlabel('iteration', fontsize=15)
plt.xlim(0,1000000)
plt.ylim((-0.01,0.04))
plt.ylabel('loss', fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.grid()
plt.legend()

Coordinate descent loss curves for random vs best coordinate updates

Experimental Results

In this experiment I began by running an unregularized standard logistic regression solver from scikit-learn. The final cross-entropy of the logistic regression model is \(L^{*} \approx 0.0004\).

In my results, after 1.5 million updates, the coordinate descent method yields a loss \(L = 0.00047\), showing convergence to \(L^{*}\). It is important to note that both implementations of coordinate descent (random feature selection and best feature selection) converge to the loss \(L^{*}\), but at different rates.

We conclude that coordinate descent is more optimally implemented if the coordinate chosen has the largest gradient in magnitude. A clear advantage can be seen above from randomly choosing coordinates to update on.