Research other ways to evaluate training dynamics/convexity

Question

Research other ways to evaluate training dynamics/convexity

dgcnz opened this issue 6 months ago · comments

Diego commented 6 months ago

Currently we are only looking at training dynamics from [Park et al. 2022] framework of hessian spectra.

However, we might also want to look at other possible ways in which model architectures affect the loss landscape:

Large width affects existence of local minimas:

"Other results show that if the network is wide enough, local minima where the loss is higher than the global minimum are rare (see Choromanska et al., 2015; Pascanu et al., 2014; Pennington & Bahri, 2017). "[Prince 2023]
Redundancy and multiple global minima:

"We expect loss functions for deep networks to have a large family of equivalent global minima. In fully connected networks, the hidden units at each layer and their associated weights can be permuted without changing the output. In convolutional networks, permuting the channels and convolution kernels appropriately doesn’t change the output." [Prince 2023]
Route to the minimum (near-convexity):

"Goodfellow et al. (2015b) considered a straight line between the initial parameters and the final values. They show that the loss function along this line usually decreases monotonically (except for a small bump near the start sometimes). This phenomenon is observed for several different types of networks and activation functions (figure 20.5a). Of course, real optimization trajectories do not proceed in a straight line. However, Li et al. (2018b) find that they do lie in low-dimensional subspaces. They attribute this to the existence of large, nearly convex regions in the loss landscape that capture the trajectory early on and funnel it in a few important directions. Surprisingly, Li et al. (2018a) showed that networks still train well if optimization is constrained to lie in a random low-dimensional subspace" [Prince 2023]
Curvature of loss surface:

"Dauphin et al. (2014) searched for saddle points in a neural network loss function and similarly found a correlation between the loss and the number of negative eigenvalues (figure 20.8). Baldi & Hornik (1989) analyzed the error surface of a shallow network and found that there were no local minima but only saddle points. These results suggest that there are few or no bad local minima." [Prince 2023]`

Comments:

Not directly relevant as overparametrization is not something we're controlling, but:

"Moreover, recent theory shows that there is a trade-off between the model’s Lipschitz constant and overparameterization; Bubeck & Sellke (2021) proved that in D dimensions, smooth interpolation requires D times more parameters than mere interpolation. They argue that current models for large datasets (e.g., ImageNet) aren’t overparameterized enough; increasing model capacity further may be key to improving performance." [Prince 2023]