CS221 Problem Workout Solutions
Week 2
Key Takeaways from this Week
The goal of ML is to learn a function f parameterized by w s.t. fw (x) is very close to y.
Each algorithm is a triplet of three design decisions:
1. Hypothesis class – How will I write down my prediction for y as a function of x?
Which parameters w do I need to learn?
2. Loss function – How do I measure how far my prediction is from the real y?
3. Optimization algorithm – What algorithm will I use to minimize my loss function?
Hypothesis class Loss function Optimization algorithm
y∈R Linear regression fw (x) := w · ϕ(x) Squared loss: (fw (x) − y)2 GD or SGD
0-1 loss: 1[fw (x) ̸= y] Cannot use GD, SGD
y ∈ {−1, 1} (Binary) linear classification fw (x) := sign (w · ϕ(x)) Hinge loss: max{1 − (w · ϕ(x))y, 0} GD or SGD
Logistic loss: log 1 + e−(w·ϕ(x))y GD or SGD
Dimension check. Above, w, ϕ(x) ∈ Rd , while y is a scalar.
1
1) Problem 1: Non-linear features
Consider the following two training datasets of (x, y) pairs:
• D1 = {(−1, +1), (0, −1), (1, +1)}.
• D2 = {(−1, −1), (0, +1), (1, −1)}.
Observe that neither dataset is linearly separable if we use ϕ(x) = x, so let’s fix that.
Define a two-dimensional feature function ϕ(x) such that:
• There exists a weight vector w1 that classifies D1 perfectly (meaning that w1 ·
ϕ(x) > 0 if x is labeled +1 and w1 · ϕ(x) < 0 if x is labeled −1); and
• There exists a weight vector w2 that classifies D2 perfectly.
Note that the weight vectors can be different for the two datasets, but the features
ϕ(x) must be the same.
Solution One option is ϕ(x) = [1, x2 ], and using w1 = [−1, 2] and w2 = [1, −2].
Then in D1 :
• For x = −1, w1 · ϕ(x) = [−1, 2] · [1, 1] = 1 > 0
• For x = 0, w1 · ϕ(x) = [−1, 2] · [1, 0] = −1 < 0
• For x = 1, w1 · ϕ(x) = [−1, 2] · [1, 1] = 1 > 0
In D2 :
• For x = −1, w2 · ϕ(x) = [1, −2] · [1, 1] = −1 < 0
• For x = 0, w2 · ϕ(x) = [1, −2] · [1, 0] = 1 > 0
• For x = 1, w2 · ϕ(x) = [1, −2] · [1, 1] = −1 < 0
Note that there are many options that work, so long as -1 and 1 are separated from 0.
Some additional food for thought: Is every dataset linearly separable in some feature
space? In other words, given pairs (x1 , y1 ), . . . , (xn , yn ), can we find a feature extractor
ϕ such that we can perfectly classify (ϕ(x1 ), y1 ), . . . , (ϕ(xn ), yn ) for some linear model
w? If so, is this a good feature extractor to use?
Solution In theory, yes we can. If we assume that our inputs x1 , . . . , xn are distinct,
then we can construct a feature map ϕ : xi 7→ yi for i = 1, . . . , n. By setting w⋆ = [1],
it’s clear that
yi w⋆ · ϕ(xi ) = yi ∗ yi = 1 > 0, i = 1, . . . , n, (1)
2
so w⋆ correctly classifies all the points in the dataset.
Hopefully, it’s clear that this is a poor choice of feature map. For one, this feature ex-
tractor is undefined for any points outside of the training set! But even more broadly,
this process is not at all generalizeable. We are essentially just memorizing our dataset
instead of learning patterns and structures within the data that will allow us to accu-
rately predict new points in the future. While minimizing training loss is an important
part of the machine learning process (the aforementioned procedure gives you zero
training loss!), it does not guarantee you good performance in the future.
3
2) Problem 2: Backpropagation
Consider the following function
Loss(x, y, z, w) = 2(xy + max{w, z})
Run the backpropagation algorithm to compute the four gradients (each with respect
to one of the individual variables) at x = 3, y = −4, z = 2 and w = −1. Use the
following nodes: addition, multiplication, max, multiplication by a constant.
Solution When calculating the gradients, we run backpropagation from the root
node to the leaves nodes. As shown on the computation graph below, the purple
values are the gradients of Loss with respect to each node. The yellow values are the
computed values of each term for the forward pass. The green values are the partial
derivative of the loss with respect to that node.
4
3) Problem 3: K-means
Consider doing ordinary K-means clustering with K = 2 clusters on the following
set of 3 one-dimensional points:
{−2, 0, 10}. (2)
Recall that K-means can get stuck in local optima. Describe the precise conditions on
the initialization µ1 ∈ R and µ2 ∈ R such that running K-means will yield the global
optimum of the objective function. Notes:
• Assume that µ1 < µ2 .
• Assume that if in step 1 of K-means, no points are assigned to some cluster j,
then in step 2, that centroid µj is set to ∞.
• Hint: try running K-means from various initializations µ1 , µ2 to get some intu-
ition; for example, if we initialize µ1 = 1 and µ2 = 9, then we converge to µ1 = −1
and µ2 = 10.
Solution The objective is minimized for µ1 = −1 and µ2 = 10. First, note that if
all three points end up in one cluster, K-means definitely fails to recover the global
optimum. Therefore, −2 must be assigned to the first cluster, and 10 must be assigned
to the second cluster. 0 can be assigned to either: If 0 is assigned to cluster 1, then
we’re done. If it is assigned to cluster 2, then we have µ1 = −2, µ2 = 5; in the next
iteration, 0 will be assigned to cluster 1 since its closer. Therefore, the condition on
the initialization written formally is | − 2 − µ1 | < | − 2 − µ2 | and |10 − µ1 | > |10 − µ2 |.
5
4) [optional] Problem 4: Non-linear decision boundaries
Suppose we are performing classification where the input points are of the form (x1 , x2 ) ∈
R2 . We can choose any subset of the following set of features:
1 1
2 2
F = x1 , x2 , x1 x2 , x1 , x2 , , , 1, 1[x1 ≥ 0], 1[x2 ≥ 0] (3)
x1 x 2
For each subset of features F ⊆ F, let D(F ) be the set of all decision boundaries
corresponding to linear classifiers that use features F .
For each of the following sets of decision boundaries E, provide the minimal F such
that D(F ) ⊇ E. If no such F exists, write ‘none’.
For example the set of features F = {x21 , x2 } allows the decision boundary of parabolas
opening in the x2 axis, centered at the origin:
• E is all lines [CA hint]:
(4)
• E is all circles centered at the origin:
(5)
• E is all circles:
(6)
• E is all axis-aligned rectangles:
(7)
• E is all axis-aligned rectangles whose lower-right corner is at (0, 0):
(8)
6
Solution
• Lines: x1 , x2 , 1 (ax1 + bx2 + c = 0)
• Circles centered at the origin: x21 , x22 , 1 (x21 + x22 = r2 )
• Circles centered anywhere in the plane: x21 , x22 , x1 , x2 , 1 ((x1 − a)2 + (x2 − b)2 = r2 )
• Axis aligned rectangles: not possible (need features of the form 1[x1 ≤ a])
• Axis aligned rectangles with lower right corner at (0, 0): not possible