The evolution of a deep neural network trained by the gradient descent can be
described by its neural tangent kernel (NTK) as introduced in 20, where it
was proven that in the infinite width limit the NTK converges to an explicit
limiting kernel and it stays constant during training. The NTK was also
implicit in some other recent papers 6,13,14. In the overparametrization
regime, a fully-trained deep neural network is indeed equivalent to the kernel
regression predictor using the limiting NTK. And the gradient descent achieves
zero training loss for a deep overparameterized neural network. However, it was
observed in 5 that there is a performance gap between the kernel regression
using the limiting NTK and the deep neural networks. This performance gap is
likely to originate from the change of the NTK along training due to the finite
width effect. The change of the NTK along the training is central to describe
the generalization features of deep neural networks.
In the current paper, we study the dynamic of the NTK for finite width deep
fully-connected neural networks. We derive an infinite hierarchy of ordinary
differential equations, the neural tangent hierarchy (NTH) which captures the
gradient descent dynamic of the deep neural network. Moreover, under certain
conditions on the neural network width and the data set dimension, we prove
that the truncated hierarchy of NTH approximates the dynamic of the NTK up to
arbitrary precision. This description makes it possible to directly study the
change of the NTK for deep neural networks, and sheds light on the observation
that deep neural networks outperform kernel regressions using the corresponding
limiting NTK.
Description
[1909.08156] Dynamics of Deep Neural Networks and Neural Tangent Hierarchy
%0 Journal Article
%1 huang2019dynamics
%A Huang, Jiaoyang
%A Yau, Horng-Tzer
%D 2019
%K debugging dynamic optimization
%T Dynamics of Deep Neural Networks and Neural Tangent Hierarchy
%U http://arxiv.org/abs/1909.08156
%X The evolution of a deep neural network trained by the gradient descent can be
described by its neural tangent kernel (NTK) as introduced in 20, where it
was proven that in the infinite width limit the NTK converges to an explicit
limiting kernel and it stays constant during training. The NTK was also
implicit in some other recent papers 6,13,14. In the overparametrization
regime, a fully-trained deep neural network is indeed equivalent to the kernel
regression predictor using the limiting NTK. And the gradient descent achieves
zero training loss for a deep overparameterized neural network. However, it was
observed in 5 that there is a performance gap between the kernel regression
using the limiting NTK and the deep neural networks. This performance gap is
likely to originate from the change of the NTK along training due to the finite
width effect. The change of the NTK along the training is central to describe
the generalization features of deep neural networks.
In the current paper, we study the dynamic of the NTK for finite width deep
fully-connected neural networks. We derive an infinite hierarchy of ordinary
differential equations, the neural tangent hierarchy (NTH) which captures the
gradient descent dynamic of the deep neural network. Moreover, under certain
conditions on the neural network width and the data set dimension, we prove
that the truncated hierarchy of NTH approximates the dynamic of the NTK up to
arbitrary precision. This description makes it possible to directly study the
change of the NTK for deep neural networks, and sheds light on the observation
that deep neural networks outperform kernel regressions using the corresponding
limiting NTK.
@article{huang2019dynamics,
abstract = {The evolution of a deep neural network trained by the gradient descent can be
described by its neural tangent kernel (NTK) as introduced in [20], where it
was proven that in the infinite width limit the NTK converges to an explicit
limiting kernel and it stays constant during training. The NTK was also
implicit in some other recent papers [6,13,14]. In the overparametrization
regime, a fully-trained deep neural network is indeed equivalent to the kernel
regression predictor using the limiting NTK. And the gradient descent achieves
zero training loss for a deep overparameterized neural network. However, it was
observed in [5] that there is a performance gap between the kernel regression
using the limiting NTK and the deep neural networks. This performance gap is
likely to originate from the change of the NTK along training due to the finite
width effect. The change of the NTK along the training is central to describe
the generalization features of deep neural networks.
In the current paper, we study the dynamic of the NTK for finite width deep
fully-connected neural networks. We derive an infinite hierarchy of ordinary
differential equations, the neural tangent hierarchy (NTH) which captures the
gradient descent dynamic of the deep neural network. Moreover, under certain
conditions on the neural network width and the data set dimension, we prove
that the truncated hierarchy of NTH approximates the dynamic of the NTK up to
arbitrary precision. This description makes it possible to directly study the
change of the NTK for deep neural networks, and sheds light on the observation
that deep neural networks outperform kernel regressions using the corresponding
limiting NTK.},
added-at = {2019-09-19T20:09:58.000+0200},
author = {Huang, Jiaoyang and Yau, Horng-Tzer},
biburl = {https://www.bibsonomy.org/bibtex/29747fef82597b9a9161f91822c17d9f3/kirk86},
description = {[1909.08156] Dynamics of Deep Neural Networks and Neural Tangent Hierarchy},
interhash = {e4522a06f574f3ac51b9648ec9b989c1},
intrahash = {9747fef82597b9a9161f91822c17d9f3},
keywords = {debugging dynamic optimization},
note = {cite arxiv:1909.08156},
timestamp = {2019-09-19T20:09:58.000+0200},
title = {Dynamics of Deep Neural Networks and Neural Tangent Hierarchy},
url = {http://arxiv.org/abs/1909.08156},
year = 2019
}