cs188 - deep learning for computer vision
course notes from winter '24
Image Classification
As humans, we find it relatively easy to reason about what we see an image. For a computer, however, all it sees is a matrix of numbers, which presents an inherent semantic gap between reality and its numerical representation.
- image classification
- a core computer vision task that assigns a semantic label to an input image
We can immediately identify some of the primary challenges of image classification:
- viewpoint variation: raw pixel values can change significantly just by varying viewing direction of the same object
- intraclass variation: there are oftentimes many variations of the same class of objects (e.g. multitude of chair designs)
- fine-graind categories: equally challenging is to identify specific subclasses
- context: occlusion, non-canonical views, scene clutter
- domain changes: visual styles (e.g. painting, clip art, photo)
As you might imagine, image classification is an essential building block for accomplishing other downstream vision tasks like object detection, image captioning, semantic segmentation, and others.
- mnist
- a \(28 \times 28\) grayscale handwritten digit dataset with \(10\) classes (one for each digit), \(50k\) training images, and \(10k\) test images
- cifar-10
- a \(32 \times 32\) RGB image dataset with \(10\) classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck), \(50k\) training images (\(5k\) per class), \(10k\) test images (\(1k\) per class)
- places
- a variable size image dataset covering \(365\) scene classes with \(1.8M\) training images on the standard set \(8M\) training images on the challenge set, \(18.25k\) validation images (\(50\) per class), and \(328.5K\) test images (\(900\) per class)
Nearest Neighbors
- nearest neighbors
- memorize all the training images and at test time predict the label of the closest image using a predefined distance metric
The distance metric quantifies the similarity between two images. Some common metrics include the manhattan distance (i.e. \(\mathcal{l}_1\)-norm) and the euclidean distance (i.e. \(\mathcal{l}_2\)-norm). We can visualize the difference in these metrics below.
import numpy as np
class NearestNeighbor:
def __init__(self):
pass
def train(self, x: np.array, y: np.array):
"""Train nearest neighbor classifier by memorizing training data.
Args:
x (np.array) - N x D matrix of input images
y (np.array) - N corresponding labels
"""
self.x_train = x
self.y_train = y
def predict(self, x: np.array):
"""Make predictions based on the label of the nearest neighbor.
Args:
x (np.array) - N x D matrix of input images
Returns:
y_pred (np.array) - N predicted labels
"""
y_pred = np.zeros_like(self.y_train)
for i in range(x.shape[0]):
dist = np.sum(np.abs(self.x_train - x[i, :]), axis=1)
y_pred[i] = self.y_train[np.argmin(dist)]
return y_pred
A key observation is that the training time is \(O(1)\), while inference time is \(O(n)\). This is quite undesirable. We almost always want the opposite (i.e. long training and fast inference).
We can also extend this logic to \(k\)-nearest neighbors, where we take a majority vote of their labels.
How do we select the best \(k\) and distance metric to use? To do so, we can follow a standard machine learning framework. We can split our dataset into a train set, validation set, and test set. We train our algorithm on the train set, tune for hyperparameters with the validation set, and save the test set for final evaluation. For smaller datasets, we can further improve this with \(k\)-fold cross validation, in which we split the data into folds, try each fold as validation, and average the results.
Linear Classifier
- linear classifier
- learn a set of weights \(W \in \mathbb{R}^{C \times D}\) and biases \(b \in \mathbb{R}^{C}\) such that \(f(x, W) = x W^T + b\) for \(x \in \mathbb{R}^{N \times D}\), where \(D\) is the input dimension and \(C\) is the number of output classes
In our case, we’d like our input \(x\) to be an image and the output to be the corresponding scene type, either living room, highway, or mountain. Recall that a RGB image can be represented by an \(H \times W \times 3\) matrix, where \(H\) is the image height and \(W\) is the image width. To use a linear classifier, we want to first flatten each of our \(N\) images into a \(1\)-dimensional vector of size \(H \cdot W \cdot 3\). It follows that the input dimension is \(D = H \times W \times 3\) and since there are \(3\) types of scenes, the output dimension is \(C = 3\).
Let’s suppose we have \(N = 1, H = 5\), and \(W = 5\).