W1D2.1: Linear Deep Learning

Section 1: Gradient Descent Algorithm

Untitled

DL的输入输出项

Untitled

y = fw(x) 输入一个x（x可以是high-dimensional、 pixels、image或者text from NLP task,huozhe info about the structure of a molecule），然后function can summarize the whole architecture of the neural network。w是一个tunable或者trainable parameters，这个参数需要调整to make the system perform better。最后输出的结果y可能是某个vector，如例子中的是猫还是狗。
L = l(y,data) 用来评估上面function，how good the network’s current output（就是上面的y） is on some given reference dataset。也就是比较y和dataset。其中dataset是y的标准答案。比如示例中就是表示是cat还是dog的标签。错得越多，L越高，表示function不能很好预测。
w* = argmin wl(fw(x),data) 寻找一个w（weight）将loss function的失误最小化

如何优化function

Gradient descent：to minimize a function by taking many small steps each of which points downhill

Untitled

通过梯度下降法来一步步的迭代求解，得到最小化的损失函数，和模型参数值。反过来，如果我们需要求解损失函数的最大值，这时就需要用梯度上升法来迭代了。首先来看看梯度下降的一个直观的解释。比如我们在一座大山上的某处位置，由于我们不知道怎么下山，于是决定走一步算一步，也就是在每走到一个位置的时候，求解当前位置的梯度，沿着梯度的负方向，也就是当前最陡峭的位置向下走一步，然后继续求解当前位置梯度，向这一步所在位置沿着最陡峭最易下山的位置走一步。这样一步步的走下去，一直走到觉得我们已经到了山脚。当然这样走下去，有可能我们不能走到山脚，而是到了某一个局部的山峰低处。从上面的解释可以看出，梯度下降不一定能够找到全局的最优解，有可能是一个局部最优解。当然，如果损失函数是凸函数，梯度下降法得到的解就一定是全局最优解。https://www.cnblogs.com/pinard/p/5970503.html

The gradients of the subjective function

Untitled

loss function在某个w点上的gradient，是loss function的partial derivative（偏导数）比上每个weight组成的一个vector在w点上的偏导数。

Gradient Descent Algorithm

def fun_z(x, y):
  """
  Implements function sin(x^2 + y^2)

  Args:
    x: (float, np.ndarray)
      Variable x
    y: (float, np.ndarray)
      Variable y

  Returns:
    z: (float, np.ndarray)
      sin(x^2 + y^2)
  """
  z = np.sin(x**2 + y**2)
  return z

def fun_dz(x, y):
  """
  Implements function sin(x^2 + y^2)

  Args:
    x: (float, np.ndarray)
      Variable x
    y: (float, np.ndarray)
      Variable y

  Returns:
    Tuple of gradient vector for sin(x^2 + y^2)
  """
  #################################################
  ## Implement the function which returns gradient vector
  ## Complete the partial derivatives dz_dx and dz_dy
  # Complete the function and remove or comment the line below
  # raise NotImplementedError("Gradient function `fun_dz`")
  #################################################
  dz_dx = 2*x*np.cos(x**2+y**2)
  dz_dy = 2*y*np.cos(x**2+y**2)
  return (dz_dx, dz_dy)

## Uncomment to run
ex1_plot(fun_z, fun_dz)

Untitled

这个方向是所有gradient vector的指向。这些向量总会指向the direction of the steepest ascent，locally最大程度增加particular function。因为我们想要的是the direction of the steepest descent，所以前面有个负号。

eta是step size，也叫做learning rate

make small change in weights that most rapidly improves task performance ↔ change each weight in proportion to the negative gradient of the loss