Batch Gradient Descent vs Mini-Batch Gradient Descent vs Stochastic gradient descent

Batch Gradient Descent vs Mini-Batch Gradient Descent vs Stochastic Gradient Descent

Batch Gradient Descent

  • Each step of gradient descent uses all the training examples.
  • Advantage: Achieve global optimum after enough iteration.
  • Disadvantage: Large data set. Computationally expensive, or even fail to complete.

Stochastic Gradient Descent

  • Each step uses one training example.
  • Learning rate α\alpha is typically held constant. Can slowly decrease α\alpha over time is we want θ\theta to converge
    Batch Gradient Descent vs Mini-Batch Gradient Descent vs Stochastic gradient descent
  • Advantage: Robot for large data set.
  • Disadvantage: Unstable. Move “around” to the optimum, not go straight to the optimum(Batch).
  • NOTE: Shuffling is really important. To avoid ending up at local optimum.
    Batch Gradient Descent vs Mini-Batch Gradient Descent vs Stochastic gradient descent

Mini-batch Gradient Descent

  • Combine Batch with Stochastic: Use b examples in each iteration. Batch size
  • More smoothly, compared to Stochastic.
  • Additional parameter: batch_size