Notes On Optimization On Stiefel Manifolds

8/12/2019

Recurrent neural networks are powerful models for processing sequential data, but they are generally plagued by vanishing and exploding gradient problems. Unitary recurrent neural networks (uRNNs), which use unitary recurrence matrices, have recently been proposed as a means to avoid these issues. However, in previous experiments, the recurrence matrices were restricted to be a product of parameterized unitary matrices, andan open question remains:when does such a parameterization fail to represent all unitary matrices, and how does this restricted representational capacity limit what can be learned?To address this question,we propose full-capacity uRNNs that optimize their recurrence matrix over all unitary matrices, leading to significantly improved performance over uRNNs that use a restricted-capacity recurrence matrix.Our contribution consists of two main components.First, we provide a theoretical argument to determine if a unitary parameterization has restricted capacity. Using this argument, we show that a recently proposed unitary parameterization has restricted capacity for hidden state dimension greater than 7.Second, we show how a complete, full-capacity unitary recurrence matrix can be optimized over the differentiable manifold of unitary matrices. The resulting multiplicative gradient step is very simple and does not require gradient clipping or learning rate adaptation.We confirm the utility of our claims by empirically evaluating our new full-capacity uRNNs on both synthetic and natural data, achieving superior performance compared to both LSTMs and the original restricted-capacity uRNNs.

Balogh, J., Csendes, T. and Rapcsak, T. (2002), Global optimization on Stiefel manifolds, NMCM-2002 Book of Abstracts, Miskolc, Hungary, 19â€“21.Google Scholar
Bolla, M., Michaletzky, G., Tusnady, G. and Ziermann, M. (1998), Extrema of sums of heterogeneous quadratic forms, Linear Algebra and its Applications 269, 331â€“365.Google Scholar
Casado, L.G., GarcÃ¬a, I. and Csendes, T. (2002), A new multisection technique in interval methods for global optimization. Computing65, 263â€“269.Google Scholar
Corliss, G.F. and Kearfott, R.B. (1999), Rigorous global search: Industrial applications. In: Csendes, T. (ed.), Developments in Reliable Computing, Kluwer, Dordrecht, pp. 1â€“16.Google Scholar
Csendes, T. (1998), Nonlinear parameter estimation by global optimizationâ€”efficiency and reliability, Acta Cybernetica8, 361â€“370.Google Scholar
Floudas, C.A. and Pardalos, P.M. (1990), A Collection of Test Problems for Constrained Global Optimization Algorithms, Lecture Notes in Computer Science455, Springer-Verlag, Berlin.Google Scholar
Floudas, C.A., Pardalos, P.M., Adjiman, C.S., Esposito, W.R., Gumus, Z.H., Harding, S.T., Klepeis, J.L., Meyer, C.A. and Schweiger, C.A. (1999), Handbook of Test Problems for Local and Global Optimization, Kluwer, Dordrecht. 277â€“290.Google Scholar
Horst, R. and Pardalos, P.M. (1995), Handbook of Global Optimization, Kluwer, Dordrecht.Google Scholar
Kearfott, R.B. (1996), Rigorous Global Search: Continuous Problems, Kluwer, Dordrecht.Google Scholar
MarkÃ“t, M. Cs., J. Fernandez, L.G. CasadoÂ´ and T. Csendes (submitted for publication), New interval methods for constrained optimization. Available at http://www.inf.u-szeged.hu/ ~markot/.Google Scholar
Pardalos, P., Rendl, F. and Wolkowicz, H. (1994), The quadratic assignment problem: A survey and recent developments, Proceedings of the DIMACS Workshop on Quadratic Assignment Problems, AMS, pp. 1â€“41.Google Scholar
Rapcsak, T. (2001), On minimization of sums of heterogeneous quadratic functions on Stiefel manifolds, In. Pardalos, P., Migdalas, A. and Varbrand, P. (eds.), From local to global optimization, Kluwer, Dordrecht, pp. 277â€“290.Google Scholar
Rapcsak, T. (2002), On minimization on Stiefel manifolds, European Journal of Operational Research143, 365â€“376.Google Scholar
Rapcsak, T. (2004), Some optimization problems in statistics, Journal of Global Optimization28, 217â€“228.Google Scholar
Sahni, S. and Gonzalez, T. (1996), P-complete approximation problems, Journal of the Association for Computing Machinery23, 555â€“565.Google Scholar
Stiefel, E. (1935â€“1936), Richtungsfelder und Fernparallelismus in n-dimensionalen Mannigfaltigkeiten, Commentarii Math. Helvetici8, 305â€“353.Google Scholar

Notes On Optimization On Stifel Manifolds Youtube
Notes On Optimization On Stiefel Manifolds 1
Notes On Optimization On Stifel Manifolds Free

Abstract

setcitestyle

1 Introduction

Deep feed-forward and recurrent neural networks have been shown to be remarkably effective in a wide variety of problems. A primary difficulty in training using gradient-based methods has been the so-called vanishing or exploding gradient problem, in which the instability of the gradients over multiple layers can impede learning [1, 2]. This problem is particularly keen for recurrent networks, since the repeated use of the recurrent weight matrix can magnify any instability.

This problem has been addressed in the past by various means, including gradient clipping [3], using orthogonal matrices for initialization of the recurrence matrix [4, 5], or by using pioneering architectures such as long short-term memory (LSTM) recurrent networks [6] or gated recurrent units [7]. Recently, several innovative architectures have been introduced to improve information flow in a network: residual networks, which directly pass information from previous layers up in a feed-forward network [8], and attention networks, which allow a recurrent network to access past activations [9].The idea of using a unitary recurrent weight matrix was introduced so that the gradients are inherently stable and do not vanish or explode [10]. The resulting unitary recurrent neural network (uRNN) is complex-valued and uses a complex form of the rectified linear activation function. However, this idea was investigated using, as we show, a potentially restricted form of unitary matrices.

The two main components of our contribution can be summarized as follows:

1) We provide a theoretical argument to determine the smallest dimension N for which any parameterization of the unitary recurrence matrix does not cover the entire set of all unitary matrices. The argument relies on counting real-valued parameters and using Sardâ€™s theorem to show that the smooth map from these parameters to the unitary manifold is not onto.Thus, we can show that a previously proposed parameterization [10] cannot represent all unitary matrices larger than 7Ã—7. Thus, such a parameterization results in what we refer to as a restricted-capacity unitary recurrence matrix.

2) To overcome the limitations of restricted-capacity parameterizations, we propose a new method for stochastic gradient descent for training the unitary recurrence matrix, which constrains the gradient to lie on the differentiable manifold of unitary matrices. This approach allows us to directly optimize a complete, or full-capacity, unitary matrix.Neither restricted-capacity nor full-capacity unitary matrix optimization require gradient clipping. Furthermore, full-capacity optimization still achieves good results without adaptation of the learning rate during training.

To test the limitations of a restricted-capacity representation and to confirm that our full-capacity uRNN does have practical implications, we test restricted-capacity and full-capacity uRNNs on both synthetic and natural data tasks. These tasks include synthetic system identification, long-term memorization, frame-to-frame prediction of speech spectra, and pixel-by-pixel classification of handwritten digits.Our proposed full-capacity uRNNs generally achieve equivalent or superior performance on synthetic and natural data compared to both LSTMs [6] and the original restricted-capacity uRNNs [10].

In the next section, we give an overview of unitary recurrent neural networks.Section 3 presents our first contribution: the theoretical argument to determine if any unitary parameterization has restricted-capacity.Section 4 describes our second contribution, where we show how to optimize a full-capacity unitary matrix. We confirm our results with simulated and natural data in Section 5 and present our conclusions in Section 6.

2 Unitary recurrent neural networks

The uRNN proposed by Arjovsky et al. [10] consists of the following nonlinear dynamical system that has real- or complex-valuedinputsxt of dimension M,complex-valued hidden statesht of dimension N,and real- or complex-valued outputsyt of dimension L:

ht=	Ïƒb(Whtâˆ’1+Vxt)	(1)
yt=	Uht+c,	(1)

where yt=Re{Uht+c} if the outputs yt are real-valued. The element-wise nonlinearity Ïƒ is

[Ïƒb(z)]i={(|zi|+bi)zi|zi|,if |zi|+bi>0,0,otherwise.

(2)

Note that this non-linearity consists in a soft-thresholding of the magnitude using the bias vector b. Hard-thresholding would set the output of Ïƒ to zi if |zi|+bi>0.The parameters of the uRNN are as follows:WâˆˆU(N), unitary hidden state transition matrix; VâˆˆCNÃ—M, input-to-hidden transformation; bâˆˆRN, nonlinearity bias; UâˆˆCLÃ—N, hidden-to-output transformation; and câˆˆCL, output bias.

Notes on optimization on stiefel manifolds for sale

Arjovsky et al. [10] propose the following parameterization of the unitary matrix W:

Wu(Î¸u)=D3R2Fâˆ’1D2PR1FD1,

(3)

where D are diagonal unitary matrices, R are Householder reflection matrices [11], F is a discrete Fourier transform (DFT) matrix, and P is a permutation matrix. The resulting matrix Wu is unitary because all its component matrices are unitary. This decomposition is efficient because diagonal, reflection, and permutation matrices are O(N) to compute, and DFTs can be computed efficiently in O(NlogN) time using the fast Fourier transform (FFT). The parameter vector Î¸u consists of 7N real-valued parameters: N parameters for each of the 3 diagonal matrices where Di,i=ejÎ¸i and 2N parameters for each of the 2 Householder reflection matrices, which are real and imaginary values of the complex reflection vectors ui: Ri=Iâˆ’2uiuHiâŸ¨ui,uiâŸ©.

3 Estimating the representation capacity of structured unitary matrices

In this section,we state and prove a theoremthat can be used to determine when any particular unitary parameterization does not have capacity to representall unitary matrices.As an application of this theorem, we show that the parameterization (3) does not have the capacity to cover all NÃ—N unitary matrices for N>7.First, we establish an upper bound on the number of real-valued parameters required to represent any NÃ—N unitary matrix. Then, we state and prove our theorem.

Lemma 3.1

The set of all unitary matrices is a manifold of dimension N2.

Proof:The set of all unitary matrices is the well-known unitary Lie group U(N)[12, Â§3.4]. A Lie group identifies group elements with points on a differentiable manifold [12, Â§2.2]. The dimension of the manifold is equal to the dimension of the Lie algebra u, which is a vector space that is the tangent space at the identity element [12, Â§4.5].For U(N), the Lie algebra consists of all skew-Hermitian matrices A[12, Â§5.4].A skew-Hermitian matrix is any AâˆˆCNÃ—N such that A=âˆ’AH, where (â‹…)H is the conjugate transpose. To determine the dimension of U(N), we can determine the dimension of u. Because of the skew-Hermitian constraint, the diagonal elements of A are purely imaginary, which corresponds to N real-valued parameters. Also, since Ai,j=âˆ’Aâˆ—j,i, the upper and lower triangular parts of A are parameterized by N(Nâˆ’1)2 complex numbers, which corresponds to an additional N2âˆ’N real parameters. Thus, U(N) is a manifold of dimension N2.

Post ejercicio y la puncin lumbar previa a la cisternogamagrafa. De acuerdo con los valores definidos en este Manual segn el nivel de complejidad de la entidad hospitalaria donde se cause el servicio. 27 MDULO 4: MANUAL TARIFARIO ISS 2000 ACUERDO 209 DE 1999. Pagarn de acuerdo con el nmero de Unidades de Valor Relativo asignadas a la. Ejercicios donde se emplee el manual iss. Dec 20, 2016 Les presentamos como liquidar cirugias segun el manual tarifario ISS 2001 CAPACITACIONES, CURSOS DIPLOMADOS EN FACTURACION Y AUDITORIA DE CUENTAS MEDICAS A NIVEL NACIONAL Nit. Captulo I de este Acuerdo, que para su prctica se utilice el quirfano o la sala de parto, por el uso se reconocen derechos de sala, los cuales comprenden: utilizacin de la dotacin bsica, los equipos con sus accesorios e implementos, el 20 MDULO 4: MANUAL TARIFARIO ISS 2000 ACUERDO 209 DE 1999. El 10 de abril, es trasladado (hospitalizacion) al servicio de neurocirugia, sala general, bipersonal donde continua el tratamiento evolucionando favorablemente, por lo que se decide dar salida el 15 de abril con recomendaciones especiales y controles de neurocirugia y ortopedia.

Theorem 3.2

If a family of NÃ—N unitary matrices is parameterized by P real-valued parameters for P<N2, then it cannot contain all NÃ—N unitary matrices.

Proof:We consider a family of unitary matrices that is parameterized by P real-valued parameters through a smooth map g:P(P)â†’U(N2) from the space of parameters P(P) to the space of all unitary matrices U(N2). The space P(P) of parameters is considered as a P-dimensional manifold, while the space U(N2) of all unitary matrices is an N2-dimensional manifold according to lemma 3.1.Then, if P<N2, Sardâ€™s theorem [13] implies that the image g(P) of g is of measure zero in U(N2), and in particular g is not onto. Since g is not onto, there must exist a unitary matrix WâˆˆU(N2) for which there is no corresponding input PâˆˆP(P) such that W=g(P). Thus, if P is such that P<N2, the manifold P(P) cannot represent all unitary matrices in U(N2).

We now apply Theorem 3.2 to the parameterization (3). Note that the parameterization (3) has P=7N real-valued parameters. If we solve for N in 7N<N2, we get N>7. Thus, the parameterization (3) cannot represent all unitary matrices for dimension N>7.

4 Optimizing full-capacity unitary matrices on the Stiefel manifold

In this section, we show how to get around the limitations of restricted-capacity parameterizations and directly optimize a full-capacity unitary matrix. We consider the Stiefel manifold of all NÃ—N complex-valued matrices whose columns are N orthonormal vectors in CN[14]. Mathematically, the Stiefel manifold is defined as

VN(CN)={WâˆˆCNÃ—N:WHW=INÃ—N}.

(4)

For any WâˆˆVN(CN), any matrix Z in the tangent space TWVN(CN) of the Stiefel manifold satisfies ZHWâˆ’WHZ=0[14].The Stiefel manifold becomes a Riemannian manifold when its tangent space is equipped with an inner product. Tagare [14] suggests using the canonical inner product, given by

âŸ¨Z1,Z2âŸ©c=tr(ZH1(Iâˆ’12WWH)Z2).

(5)

Under this canonical inner product on the tangent space, the gradient in the Stiefel manifold of the loss function f with respect to the matrix W is AW, whereA=GHWâˆ’WHGis a skew-Hermitian matrix and G with Gi,j=Î´fÎ´Wi,j is the usual gradient of the loss function f with respect to the matrix W[14]. Using these facts, Tagare [14] suggests a descent curve along the Stiefel manifold at training iteration k given by the matrix product of the Cayley transformationof A(k) with the current solution W(k):

Y(k)(Î»)=(I+Î»2A(k))âˆ’1(Iâˆ’Î»2A(k))W(k),

(6)

where Î» is a learning rate and A(k)=G(k)HW(k)âˆ’W(k)HG(k). Gradient descent proceeds by performing updatesW(k+1)=Y(k)(Î»).Tagare [14] suggests an Armijo-Wolfe search along the curve to adapt Î», but such a procedure would be expensive for neural network optimization since it requires multiple evaluations of the forward model and gradients. We found that simply using a fixed learning rate Î» often works well. Also, RMSprop-style scaling of the gradient G(k) by a running average of the previous gradientsâ€™ norms [15] before applying the multiplicative step (6) can improve convergence. The only additional substantial computation required beyond the forward and backward passes of the network is the NÃ—N matrix inverse in (6).

5 Experiments

All models are implemented in Theano [16], based on the implementation of restricted-capacity uRNNs by [10], available from https://github.com/amarshah/complex_RNN. All code to replicate our results is available from https://github.com/stwisdom/urnn. All models use RMSprop [15] for optimization, except that full-capacity uRNNs optimize their recurrence matrices with a fixed learning rate using the update step (6) and optional RMSprop-style gradient normalization.

5.1 Synthetic data

First, we compare the performance of full-capacity uRNNs to restricted-capacity uRNNs and LSTMs on two tasks with synthetic data. The first task is synthetic system identification, where a uRNN must learn the dynamics of a target uRNN given only samples of the target uRNNâ€™s inputs and outputs. The second task is the copy memory problem, in which the network must recall a sequence of data after a long period of time.

System identification

For the task of system identification, we consider the problem of learning the dynamics of a nonlinear dynamical system that has the form (1), given a dataset of inputs and outputs of the system. We will draw a true system Wsys randomly from either a constrained set Wu of restricted-capacity unitary matrices using the parameterization Wu(Î¸u) in (3) or from a wider set Wg of restricted-capacity unitary matrices that are guaranteed to lie outside Wu. We sample from Wg by taking a matrix product of two unitary matrices drawn from Wu.

We use a sequence length of T=150, and we set the input dimension M and output dimension L both equal to the hidden state dimension N. The input-to-hidden transformation V and output-to-hidden transformation U are both set to identity, the output bias c is set to 0, the initial state is set to 0, and the hidden bias b is drawn from a uniform distribution in the range [âˆ’0.11,âˆ’0.09]. The hidden bias has a mean of âˆ’0.1 to ensure stability of the system outputs. Inputs are generated by sampling T-length i.i.d. sequences of zero-mean, diagonal and unit covariance circular complex-valued Gaussians of dimension N. The outputs are created by running the system (1) forward on the inputs.

We compare a restricted-capacity uRNN using the parameterization from (3) and a full-capacity uRNN using Stiefel manifold optimization with no gradient normalization as described in Section 4. We choose hidden state dimensions N to test critical points predicted by our arguments in Section 3 of Wu(Î¸u) in (3):Nâˆˆ{4,6,7,8,16}. These dimensions are chosen to test below, at, and above the critical dimension of 7.

For all experiments, the number of training, validation, and test sequences are 20000, 1000, and 1000, respectively.Mean-squared error (MSE) is used as the loss function. The learning rate is 0.001 with a batch size of 50 for all experiments.Both models use the same matrix drawn from Wu as initialization. To isolate the effect of unitary recurrence matrix capacity, we only optimize W, setting all other parameters to true oracle values. For each method, we report the best test loss over 100 epochs and over 6 random initializations for the optimization.

The results are shown in Table 1. â€œWsys init.â€ refers to the initialization of the true system unitary matrix Wsys, which is sampled from either the restricted-capacity set Wu or the wider set Wg.

Notice that for N<7, the restricted-capacity uRNN achieves comparable or better performance than the full-capacity uRNN. At N=7, the restricted-capacity and full-capacity uRNNs achieve relatively comparable performance, with the full-capacity uRNN achieving slightly lower error. For N>7, the full-capacity uRNN always achieves better performance versus the restricted-capacity uRNN. This result confirms our theoretical arguments that the restricted-capacity parameterization in (3) lacks the capacity to model all matrices in the unitary group for N>7 and indicates the advantage of using a full-capacity unitary recurrence matrix.

Copy memory problem

The experimental setup follows the copy memory problem from [10], which itself was based on the experiment from [6]. We consider alternative hidden state dimensions and extend the sequence lengths to T=1000 and T=2000, which are longer than the maximum length of T=750 considered in previous literature.

In this task, the data is a vector of length T+20 and consists of elements from 10 categories. The vector begins with a sequence of 10 symbols sampled uniformly from categories 1 to 8. The next Tâˆ’1 elements of the vector are the ninth â€™blankâ€™ category, followed by an element from the tenth category, the â€˜delimiterâ€™. The remaining ten elements are â€˜blankâ€™.The task is to output T+10 blank characters followed by the sequence from the beginning of the vector. We use average cross entropy as the training loss function.The baseline solution outputs the blank category for T+10 time steps and then guesses a random symbol uniformly from the first eight categories. This baseline has an expected average cross entropy of 10log(8)T+20.

The full-capacity uRNN uses a hidden state size of N=128 with no gradient normalization. To match the number of parameters (â‰ˆ22k), we use N=470 for the restricted-capacity uRNN, and N=68 for the LSTM.The training set size is 100000 and the test set size is 10000. The results of the T=1000 experiment can be found on the left half of Figure 1. The full-capacity uRNN converges to a solution with zero average cross entropy after about 2000 training iterations, whereas the restricted-capacity uRNN settles to the baseline solution of 0.020. The results of the T=2000 experiment can be found on the right half of Figure 1. The full-capacity uRNN hovers around the baseline solution for about 5000 training iterations, after which it drops down to zero average cross entropy. The restricted-capacity again settles down to the baseline solution of 0.010.These results demonstrate that the full-capacity uRNN is very effective for problems requiring very long memory.

5.2 Speech data

We now apply restricted-capacity and full-capacity uRNNs to real-world speech data and compare their performance to LSTMs.The main task we consider is predicting the log-magnitude of future frames of a short-time Fourier transform (STFT).The STFT is a commonly used feature domain for speech enhancement, and is defined as the Fourier transform of short windowed frames of the time series. In the STFT domain, a real-valued audio signal is represented as a complex-valued FÃ—T matrix composed of T frames that are each composed of F=Nwin/2+1 frequency bins, where Nwin is the duration of the time-domain frame. Most speech processing algorithms use the log-magnitude of the complex STFT values and reconstruct the processed audio signal using the phase of the original observations.

The frame prediction task is as follows: given all the log-magnitudes of STFT frames up to time t, predict the log-magnitude of the STFT frame at time t+1.We use the TIMIT dataset [17]. According to common practice [18], weuse a training set with 3690 utterances from 462 speakers, a validation set of 400 utterances, an evaluation set of 192 utterances. Training, validation, and evaluation sets have distinct speakers. Results are reported on the evaluation set using the network parameters that perform best on the validation set in terms of the loss function over three training trials. All TIMIT audio is resampled to 8kHz. The STFT uses a Hann analysis window of 256 samples (32 milliseconds) and a window hop of 128 samples (16 milliseconds).

The LSTM requires gradient clipping during optimization, while the restricted-capacity and full-capacity uRNNs do not. The hidden state dimensions N of the LSTM are chosen to match the number of parameters of the full-capacity uRNN. For the restricted-capacity uRNN, we run models that match either N or number of parameters. For the LSTM and restricted-capacity uRNNs, we use RMSprop [15] with a learning rate of 0.001, momentum 0.9, and averaging parameter 0.1. For the full-capacity uRNN, we also use RMSprop to optimize all network parameters, except for the recurrence matrix, for which we use stochastic gradient descent along the Stiefel manifold using the update (6) with a fixed learning rate of 0.001 and no gradient normalization.

Results are shown in Table 2, and Figure 2 shows example predictions of the three types of networks. Results in Table 2 are given in terms of the mean-squared error (MSE) loss function and several metrics computed on the time-domain signals, which are reconstructed from the predicted log-magnitude and the original phase of the STFT. These time-domain metrics are segmental signal-to-noise ratio (SegSNR), short-time objective intelligibility (STOI), and perceptual evaluation of speech quality (PESQ). SegSNR, computed using [19], uses a voice activity detector to avoid measuring SNR in silent frames.STOI is designed to correlate well with human intelligibility of speech, and takes on values between 0 and 1, with a higher score indicating higher intelligibility [20]. PESQ is the ITU-T standard for telephone voice quality testing [21, 22], and is a popular perceptual quality metric for speech enhancement [23]. PESQ ranges from 1 (bad quality) to 4.5 (no distortion).

Note that full-capacity uRNNs generally perform better than restricted-capacity uRNNs with the samenumber of parameters,and both types of uRNN significantly outperform LSTMs.

5.3 Pixel-by-pixel MNIST

As another challenging long-term memory task with natural data, we test the performance of LSTMs and uRNNs on pixel-by-pixel MNIST and permuted pixel-by-pixel MNIST, first proposed by [5] and used by [10] to test restricted-capacity uRNNs. For permuted pixel-by-pixel MNIST, the pixels are shuffled, thereby creating some non-local dependencies between pixels in an image. Since the MNIST images are 28Ã—28 pixels, resulting pixel-by-pixel sequences are T=784 elements long. We use 5000 of the 60000 training examples as a validation set to perform early stopping with a patience of 5. The loss function is cross-entropy. Weights with the best validation loss are used to process the evaluation set. The full-capacity uRNN uses RMSprop-style gradient normalization.

Learning curves are shown in Figure 3, and a summary of classification accuracies is shown in Table 3. For the unpermuted task, the LSTM with N=256 achieves the best evaluation accuracy of 98.2%. For the permuted task, the full-capacity uRNN with N=512 achieves the best evaluation accuracy of 94.1%, which is state-of-the-art on this task. Both uRNNs outperform LSTMs on the permuted case, achieving their best performance after fewer traing epochs and using an equal or lesser number of trainable parameters. This performance difference suggests that LSTMs are only able to model local dependencies, while uRNNs have superior long-term memory capabilities. Despite not representing all unitary matrices, the restricted-capacity uRNN with N=512 still achieves impressive test accuracy of 93.3% with only 1/16 of the trainable parameters, outperforming the full-capacity uRNN with N=116 that matches number of parameters.This result suggests that further exploration into the potential trade-off between hidden state dimension N and capacity of unitary parameterizations is necessary.

6 Conclusion

Unitary recurrent matrices prove to be an effective means of addressing the vanishing and exploding gradient problems.We provided a theoretical argument to quantify the capacity of constrained unitary matrices.We also described a method for directly optimizing a full-capacity unitary matrix by constraining the gradient to lie in the differentiable manifold of unitary matrices. The effect of restricting the capacity of the unitary weight matrix was tested on system identification and memory tasks, in which full-capacity unitary recurrent neural networks (uRNNs) outperformed restricted-capacity uRNNs from [10] as well as LSTMs.Full-capacity uRNNs also outperformed restricted-capacity uRNNs on log-magnitude STFT prediction of natural speech signals and classification of permuted pixel-by-pixel images of handwritten digits, and both types of uRNN significantly outperformed LSTMs.In future work, we plan to explore more general forms of restricted-capacity unitary matrices, including constructions based on products of elementary unitary matrices such as Householder operators or Givens operators.

Acknowledgments:We thank an anonymous reviewer for suggesting improvements to our proof in Section 3 and Vamsi Potluru for helpful discussions. Scott Wisdom and Thomas Powers were funded by U.S. ONR contract number N00014-12-G-0078, delivery orders 13 and 24. Les Atlas was funded by U.S. ARO grant W911NF-15-1-0450.

References

Y. Bengio, P. Simard, and P. Frasconi.Learning long-term dependencies with gradient descent is difficult.IEEE Transactions on Neural Networks, 5(2):157â€“166, 1994.
S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber.Gradient flow in recurrent nets: the difficulty of learning long-termdependencies.In S. C. Kremer and J. F. Kolen, eds, A field guide to dynamicalrecurrent neural networks. IEEE Press, 2001.
R. Pascanu, T. Mikolov, and Y. Bengio.On the difficulty of training Recurrent Neural Networks.arXiv:1211.5063, Nov. 2012.
A. M. Saxe, J. L. McClelland, and S. Ganguli.Exact solutions to the nonlinear dynamics of learning in deep linearneural networks.arXiv:1312.6120, Dec. 2013.
Q. V. Le, N. Jaitly, and G. E. Hinton.A simple way to initialize recurrent networks of rectified linearunits.arXiv:1504.00941, Apr. 2015.
S. Hochreiter and J. Schmidhuber.Long short-term memory.Neural computation, 9(8):1735â€“1780, 1997.
K. Cho, B. van MerriÃ«nboer, D. Bahdanau, and Y. Bengio.On the properties of neural machine translation: Encoder-decoderapproaches.arXiv:1409.1259, 2014.
K. He, X. Zhang, S. Ren, and J. Sun.Deep residual learning for image recognition.arXiv:1512.03385, Dec. 2015.
V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu.Recurrent models of visual attention.In Advances in Neural Information Processing Systems (NIPS),pp. 2204â€“2212, 2014.
M. Arjovsky, A. Shah, and Y. Bengio.Unitary Evolution Recurrent Neural Networks.In International Conference on Machine Learning (ICML), Jun.2016.
A. S. Householder.Unitary triangularization of a nonsymmetric matrix.Journal of the ACM, 5(4):339â€“342, 1958.
R. Gilmore.Lie groups, physics, and geometry: an introduction forphysicists, engineers and chemists.Cambridge University Press, 2008.
A. Sard.The measure of the critical values of differentiable maps.Bulletin of the American Mathematical Society, 48(12):883â€“890,1942.
H. D. Tagare.Notes on optimization on Stiefel manifolds.Technical report, Yale University, 2011.
T. Tieleman and G. Hinton.Lecture 6.5â€”RmsProp: Divide the gradient by a runningaverage of its recent magnitude, 2012.COURSERA: Neural Networks for Machine Learning.
Theano Development Team.Theano: A Python framework for fast computation of mathematicalexpressions.arXiv: 1605.02688, May 2016.
J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett.DARPA TIMIT acoustic-phonetic continous speech corpus.Technical Report NISTIR 4930, National Institute of Standards andTechnology, 1993.
A. K. Halberstadt.Heterogeneous acoustic measurements and multiple classifiers forspeech recognition.PhD thesis, Massachusetts Institute of Technology, 1998.
M. Brookes.VOICEBOX: Speech processing toolbox for MATLAB, 2002.[Online]. Available:http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html.
C. Taal, R. Hendriks, R. Heusdens, and J. Jensen.An algorithm for intelligibility prediction of time-frequencyweighted noisy speech.IEEE Trans. on Audio, Speech, and Language Processing,19(7):2125â€“2136, Sep. 2011.
A. Rix, J. Beerends, M. Hollier, and A. Hekstra.Perceptual evaluation of speech quality (PESQ)-a new method forspeech quality assessment of telephone networks and codecs.In Proc. ICASSP, vol. 2, pp. 749â€“752, 2001.
ITU-T P.862.Perceptual evaluation of speech quality (PESQ): An objectivemethod for end-to-end speech quality assessment of narrow-band telephonenetworks and speech codecs, 2000.
P. C. Loizou.Speech Enhancement: Theory and Practice.CRC Press, Boca Raton, FL, Jun. 2007.

In mathematics, the Stiefel manifoldVk(Rn){displaystyle V_{k}(mathbb {R} ^{n})} is the set of all orthonormalk-frames in Rn.{displaystyle mathbb {R} ^{n}.} That is, it is the set of ordered k-tuples of orthonormal vectors in Rn.{displaystyle mathbb {R} ^{n}.} It is named after Swiss mathematician Eduard Stiefel. Likewise one can define the complex Stiefel manifold Vk(Cn){displaystyle V_{k}(mathbb {C} ^{n})} of orthonormal k-frames in Cn{displaystyle mathbb {C} ^{n}} and the quaternionic Stiefel manifold Vk(Hn){displaystyle V_{k}(mathbb {H} ^{n})} of orthonormal k-frames in Hn{displaystyle mathbb {H} ^{n}} More generally, the construction applies to any real, complex, or quaternionic inner product space.

In some contexts, a non-compact Stiefel manifold is defined as the set of all linearly independentk-frames in Rn,Cn,{displaystyle mathbb {R} ^{n},mathbb {C} ^{n},} or Hn;{displaystyle mathbb {H} ^{n};} this is homotopy equivalent, as the compact Stiefel manifold is a deformation retract of the non-compact one, by Gramâ€“Schmidt. Statements about the non-compact form correspond to those for the compact form, replacing the orthogonal group (or unitary or symplectic group) with the general linear group.

Topology[edit]

Let F{displaystyle mathbb {F} } stand for R,C,{displaystyle mathbb {R} ,mathbb {C} ,} or H.{displaystyle mathbb {H} .} The Stiefel manifold Vk(Fn){displaystyle V_{k}(mathbb {F} ^{n})} can be thought of as a set of n Ã— kmatrices by writing a k-frame as a matrix of kcolumn vectors in Fn.{displaystyle mathbb {F} ^{n}.} The orthonormality condition is expressed by A*A = Ik{displaystyle I_{k}} where A* denotes the conjugate transpose of A and Ik{displaystyle I_{k}} denotes the k Ã— kidentity matrix. We then have

Vk(Fn)={AâˆˆFnÃ—k:Aâˆ—A=Ik}.{displaystyle V_{k}(mathbb {F} ^{n})=left{Ain mathbb {F} ^{ntimes k}:A^{*}A=I_{k}right}.}

The topology on Vk(Fn){displaystyle V_{k}(mathbb {F} ^{n})} is the subspace topology inherited from FnÃ—k.{displaystyle mathbb {F} ^{ntimes k}.} With this topology Vk(Fn){displaystyle V_{k}(mathbb {F} ^{n})} is a compactmanifold whose dimension is given by

dimâ¡Vk(Rn)=nkâˆ’12k(k+1)dimâ¡Vk(Cn)=2nkâˆ’k2dimâ¡Vk(Hn)=4nkâˆ’k(2kâˆ’1){displaystyle {begin{aligned}dim V_{k}(mathbb {R} ^{n})&=nk-{frac {1}{2}}k(k+1)dim V_{k}(mathbb {C} ^{n})&=2nk-k^{2}dim V_{k}(mathbb {H} ^{n})&=4nk-k(2k-1)end{aligned}}}

As a homogeneous space[edit]

Each of the Stiefel manifolds Vk(Fn){displaystyle V_{k}(mathbb {F} ^{n})} can be viewed as a homogeneous space for the action of a classical group in a natural manner.

Every orthogonal transformation of a k-frame in Rn{displaystyle mathbb {R} ^{n}} results in another k-frame, and any two k-frames are related by some orthogonal transformation. In other words, the orthogonal group O(n) acts transitively on Vk(Rn).{displaystyle V_{k}(mathbb {R} ^{n}).} The stabilizer subgroup of a given frame is the subgroup isomorphic to O(nâˆ’k) which acts nontrivially on the orthogonal complement of the space spanned by that frame.

Likewise the unitary group U(n) acts transitively on Vk(Cn){displaystyle V_{k}(mathbb {C} ^{n})} with stabilizer subgroup U(nâˆ’k) and the symplectic group Sp(n) acts transitively on Vk(Hn){displaystyle V_{k}(mathbb {H} ^{n})} with stabilizer subgroup Sp(nâˆ’k).

In each case Vk(Fn){displaystyle V_{k}(mathbb {F} ^{n})} can be viewed as a homogeneous space:

Vk(Rn)â‰…O(n)/O(nâˆ’k)Vk(Cn)â‰…U(n)/U(nâˆ’k)Vk(Hn)â‰…Sp(n)/Sp(nâˆ’k){displaystyle {begin{aligned}V_{k}(mathbb {R} ^{n})&cong {mbox{O}}(n)/{mbox{O}}(n-k)V_{k}(mathbb {C} ^{n})&cong {mbox{U}}(n)/{mbox{U}}(n-k)V_{k}(mathbb {H} ^{n})&cong {mbox{Sp}}(n)/{mbox{Sp}}(n-k)end{aligned}}}

When k = n, the corresponding action is free so that the Stiefel manifold Vn(Fn){displaystyle V_{n}(mathbb {F} ^{n})} is a principal homogeneous space for the corresponding classical group.

When k is strictly less than n then the special orthogonal group SO(n) also acts transitively on Vk(Rn){displaystyle V_{k}(mathbb {R} ^{n})} with stabilizer subgroup isomorphic to SO(nâˆ’k) so that

Vk(Rn)â‰…SO(n)/SO(nâˆ’k)for k<n.{displaystyle V_{k}(mathbb {R} ^{n})cong {mbox{SO}}(n)/{mbox{SO}}(n-k)qquad {mbox{for }}k<n.}

The same holds for the action of the special unitary group on Vk(Cn){displaystyle V_{k}(mathbb {C} ^{n})}

Vk(Cn)â‰…SU(n)/SU(nâˆ’k)for k<n.{displaystyle V_{k}(mathbb {C} ^{n})cong {mbox{SU}}(n)/{mbox{SU}}(n-k)qquad {mbox{for }}k<n.}

Thus for k = n âˆ’ 1, the Stiefel manifold is a principal homogeneous space for the corresponding special classical group.

Uniform measure[edit]

The Stiefel manifold can be equipped with a uniform measure, i.e. a Borel measure that is invariant under the action of the groups noted above. For example, V1(R2){displaystyle V_{1}(mathbb {R} ^{2})} which is isomorphic to the unit circle in the Euclidean plane, has as its uniform measure the obvious uniform measure (arc length) on the circle. It is straightforward to sample this measure on Vk(Fn){displaystyle V_{k}(mathbb {F} ^{n})} using Gaussian random matrices: if AâˆˆFnÃ—k{displaystyle Ain mathbb {F} ^{ntimes k}} is a random matrix with independent entries identically distributed according to the standard normal distribution on F{displaystyle mathbb {F} } and A = QR is the QR factorization of A, then the matrices, QâˆˆFnÃ—k,RâˆˆFkÃ—k{displaystyle Qin mathbb {F} ^{ntimes k},Rin mathbb {F} ^{ktimes k}} are independent random variables and Q is distributed according to the uniform measure on Vk(Fn).{displaystyle V_{k}(mathbb {F} ^{n}).} This result is a consequence of the Bartlett decomposition theorem.^[1]

Special cases[edit]

A 1-frame in Fn{displaystyle mathbb {F} ^{n}} is nothing but a unit vector, so the Stiefel manifold V1(Fn){displaystyle V_{1}(mathbb {F} ^{n})} is just the unit sphere in Fn.{displaystyle mathbb {F} ^{n}.} Therefore:

V1(Rn)=Snâˆ’1V1(Cn)=S2nâˆ’1V1(Hn)=S4nâˆ’1{displaystyle {begin{aligned}V_{1}(mathbb {R} ^{n})&=S^{n-1}V_{1}(mathbb {C} ^{n})&=S^{2n-1}V_{1}(mathbb {H} ^{n})&=S^{4n-1}end{aligned}}}

Given a 2-frame in Rn,{displaystyle mathbb {R} ^{n},} let the first vector define a point in S^nâˆ’1 and the second a unit tangent vector to the sphere at that point. In this way, the Stiefel manifold V2(Rn){displaystyle V_{2}(mathbb {R} ^{n})} may be identified with the unit tangent bundleto S^nâˆ’1.

When k = n or nâˆ’1 we saw in the previous section that Vk(Fn){displaystyle V_{k}(mathbb {F} ^{n})} is a principal homogeneous space, and therefore diffeomorphic to the corresponding classical group:

Vnâˆ’1(Rn)â‰…SO(n)Vnâˆ’1(Cn)â‰…SU(n){displaystyle {begin{aligned}V_{n-1}(mathbb {R} ^{n})&cong mathrm {SO} (n)V_{n-1}(mathbb {C} ^{n})&cong mathrm {SU} (n)end{aligned}}}

Vn(Rn)â‰…O(n)Vn(Cn)â‰…U(n)Vn(Hn)â‰…Sp(n){displaystyle {begin{aligned}V_{n}(mathbb {R} ^{n})&cong mathrm {O} (n)V_{n}(mathbb {C} ^{n})&cong mathrm {U} (n)V_{n}(mathbb {H} ^{n})&cong mathrm {Sp} (n)end{aligned}}}

Functoriality[edit]

Given an orthogonal inclusion between vector spaces Xâ†ªY,{displaystyle Xhookrightarrow Y,} the image of a set of k orthonormal vectors is orthonormal, so there is an induced closed inclusion of Stiefel manifolds, Vk(X)â†ªVk(Y),{displaystyle V_{k}(X)hookrightarrow V_{k}(Y),} and this is functorial. More subtly, given an n-dimensional vector space X, the dual basis construction gives a bijection between bases for X and bases for the dual space Xâˆ—,{displaystyle X^{*},} which is continuous, and thus yields a homeomorphism of top Stiefel manifolds Vn(X)â†’âˆ¼Vn(Xâˆ—).{displaystyle V_{n}(X){stackrel {sim }{to }}V_{n}(X^{*}).} This is also functorial for isomorphisms of vector spaces.

As a principal bundle[edit]

There is a natural projection

p:Vk(Fn)â†’Gk(Fn){displaystyle p:V_{k}(mathbb {F} ^{n})to G_{k}(mathbb {F} ^{n})}

from the Stiefel manifold Vk(Fn){displaystyle V_{k}(mathbb {F} ^{n})} to the Grassmannian of k-planes in Fn{displaystyle mathbb {F} ^{n}} which sends a k-frame to the subspace spanned by that frame. The fiber over a given point P in Gk(Fn){displaystyle G_{k}(mathbb {F} ^{n})} is the set of all orthonormal k-frames contained in the space P.

This projection has the structure of a principal G-bundle where G is the associated classical group of degree k. Take the real case for concreteness. There is a natural right action of O(k) on Vk(Rn){displaystyle V_{k}(mathbb {R} ^{n})} which rotates a k-frame in the space it spans. This action is free but not transitive. The orbits of this action are precisely the orthonormal k-frames spanning a given k-dimensional subspace; that is, they are the fibers of the map p. Similar arguments hold in the complex and quaternionic cases.

We then have a sequence of principal bundles:

O(k)â†’Vk(Rn)â†’Gk(Rn)U(k)â†’Vk(Cn)â†’Gk(Cn)Sp(k)â†’Vk(Hn)â†’Gk(Hn){displaystyle {begin{aligned}mathrm {O} (k)&to V_{k}(mathbb {R} ^{n})to G_{k}(mathbb {R} ^{n})mathrm {U} (k)&to V_{k}(mathbb {C} ^{n})to G_{k}(mathbb {C} ^{n})mathrm {Sp} (k)&to V_{k}(mathbb {H} ^{n})to G_{k}(mathbb {H} ^{n})end{aligned}}}

The vector bundlesassociated to these principal bundles via the natural action of G on Fk{displaystyle mathbb {F} ^{k}} are just the tautological bundles over the Grassmannians. In other words, the Stiefel manifold Vk(Fn){displaystyle V_{k}(mathbb {F} ^{n})} is the orthogonal, unitary, or symplectic frame bundle associated to the tautological bundle on a Grassmannian.

When one passes to the nâ†’âˆž{displaystyle nto infty } limit, these bundles become the universal bundles for the classical groups.

Homotopy[edit]

The Stiefel manifolds fit into a family of fibrations:

Vkâˆ’1(Rnâˆ’1)â†’Vk(Rn)â†’Snâˆ’1,{displaystyle V_{k-1}(mathbb {R} ^{n-1})to V_{k}(mathbb {R} ^{n})to S^{n-1},}

thus the first non-trivial homotopy group of the space Vk(Rn){displaystyle V_{k}(mathbb {R} ^{n})} is in dimension n âˆ’ k. Moreover,

Ï€nâˆ’kVk(Rn)â‰ƒ{Znâˆ’k even or k=1Z2nâˆ’k odd and k>1{displaystyle pi _{n-k}V_{k}(mathbb {R} ^{n})simeq {begin{cases}mathbb {Z} &n-k{text{ even or }}k=1mathbb {Z} _{2}&n-k{text{ odd and }}k>1end{cases}}}

This result is used in the obstruction-theoretic definition of Stiefelâ€“Whitney classes.

References[edit]

^Muirhead, Robb J. (1982). Aspects of Multivariate Statistical Theory. John Wiley & Sons, Inc., New York. pp. xix+673. ISBN0-471-09442-0.

Hatcher, Allen (2002). Algebraic Topology. Cambridge University Press. ISBN0-521-79540-0.
Husemoller, Dale (1994). Fibre Bundles ((3rd ed.) ed.). New York: Springer-Verlag. ISBN0-387-94087-1.
James, Ioan Mackenzie (1976). The topology of Stiefel manifolds. CUP Archive. ISBN978-0-521-21334-9.
Hazewinkel, Michiel, ed. (2001) [1994], 'Stiefel manifold', Encyclopedia of Mathematics, Springer Science+Business Media B.V. / Kluwer Academic Publishers, ISBN978-1-55608-010-4

Retrieved from 'https://en.wikipedia.org/w/index.php?title=Stiefel_manifold&oldid=896581762'

Abstract

In this paper, we introduce McTorch, a manifold optimization library for deep learning that extends PyTorch¹. It aims to lower the barrier for users wishing to use manifold constraints in deep learning applications, i.e., when the parameters are constrained to lie on a manifold. Such constraints include the popular orthogonality and rank constraints, and have been recently used in a number of applications in deep learning. McTorch follows PyTorchâ€™s architecture and decouples manifold definitions and optimizers, i.e., once a new manifold is added it can be used with any existing optimizer and vice-versa. McTorch is available at https://github.com/mctorch.

1 Introduction

Manifold optimization refers to nonlinear optimization problems of the form

minxâˆˆMf(x),

(1)

where f is the loss function and the parameter search space M is a smooth Riemannian manifold absil08a (). Note that the Euclidean space is trivially a manifold. Conceptually, manifold optimization translates the constrained optimization problem (1) into an unconstrained optimization over the manifold M, thereby generalizing many of the standard nonlinear optimization algorithms with guarantees absil08a (); bonnabel13a (); sato13a (); Sato17a (); zhang16a (); kasai18a (); boumal2018globalrates (). A few ingredients of manifold optimization are the matrix representations of the tangent space (linearization of the search space at a point), an inner product (to define a metric structure to compute the Riemannian gradient and Hessian efficiently), and a notion of a straight line (a characterization of the geodesic curves to maintain strict feasibility of the parameters). Two popular examples² of smooth manifolds are i) the Stiefel manifold, which is the set of nÃ—p matrices whose columns are orthonormal, i.e., Mcoloneqq{XâˆˆRnÃ—p:XâŠ¤X=I}edelman98a () and ii) the symmetric positive definite manifold, which is the set of symmetric positive definite matrices, i.e., Mcoloneqq{XâˆˆRnÃ—n:Xâ‰»0}bhatia09a ().

Manifold optimization has gained significant interest in computer vision kovnatsky2016madmm (); tron2017space (), Gaussian mixture models hosseini2015matrix (), multilingual embeddings jawanpuria18a (), matrix/tensor completion boumal14a (); mjaw18a (); Kasai2016a (); kressner13a (); mishra14a (); nimishakavi18a (); vandereycken13a (), metric learning meyer2011linear (); zadeh2016geometric (), phase synchronization boumal2016nonconvexphase (); zhong2018nearoptimal (), to name a few.

Deep learning refers to machine learning methods with multiple layers of processing to learn effective representation of data. These methods have led to state-of-the-art results in computer vision, speech, and natural language processing. Recently, manifold optimization has been applied successfully in various deep learning applications nickel18a (); arjovsky16a (); roy18a (); huang17a (); huang18a (); ozay18a (); badrinarayanan15a ().

On the practical implementation front, there exists popular toolboxes â€“ Manopt boumal14a (), Pymanopt townsend16a (), and ROPTLIB huang16a () â€“ that allow rapid prototyping without the burden of being well-versed with manifold-related notions. However, these toolboxes are more suitable for handling standard nonlinear optimization problems and not particularly well-suited for deep learning applications. On the other hand, PyTorch paszke17a (), a Python based deep learning library, supports tensor computations on GPU and provides dynamic tape-based auto-grad system to create neural networks. PyTorch provides a flexible format to define and train deep learning networks. Currently, however, PyTorch lacks manifold optimization support. The proposed McTorch³ library aims to bridge this gap between the standard manifold toolboxes and PyTorch by extending the latterâ€™s functionality.

McTorch builds upon the PyTorch library for tensor computation and GPU acceleration, and derives manifold definitions and optimization methods from the toolboxes boumal14a (); huang16a (); townsend16a (). McTorch is well-integrated with PyTorch that allows users to use manifold optimization in a straightforward way.

2 Overview of McTorch

McTorch library has been implemented by extending a PyTorch fork to closely follow its architecture. All manifold definitions reside in the module torch.nn.manifold and are derived from the parent class Manifold, which defines the manifold structure (i.e., the expressions of manifold-related notions) that any manifold must implement. A few of these expressions are:

Notes On Optimization On Stifel Manifolds Youtube

retr: to retract a tangent vector onto the manifold,
egrad2rgrad: to convert the back-propagated gradient to the Riemannian gradient.

To facilitate creation of a manifold-constrained parameter, PyTorchâ€™s nativeParameter class is modified to accept an extra argument on initialization to specify the manifold type and size. Parameter can be initialized to a random point on the manifold or a particular value provided by the user. Parameter also holds the attribute rgrad (which stands for the Riemannian gradient) that gets updated with every back-propagated gradient step.

The existing optimizers in the module torch.optim are modified to support updates on the manifold. An optimization step is a function of parameterâ€™s current value, gradient, and optimizer state. In the manifold optimization, the gradient is the Riemannian gradient and the update is with the retraction operation.

To use manifolds in PyTorch layers (in torch.nn.Module), we have added the property weight_manifold to the linear and convolutional layers which constrains the weight tensor of the layer to a specified manifold. As the shape of weight tensor is calculated using the inputs to the layer, we have added ManifoldShapeFactory to create a manifold object for a given tensor shape such that it obeys the initialization conditions of that manifold.

All the numerical methods are implemented using the tensor functions of PyTorch and support both CPU and GPU computations. As the implementation modifies and appends to the PyTorch code, all the user facing APIs are similar to PyTorch. McTorch currently supports:

Optimizers: SGD, Adagrad, ConjugateGradient,

3 McTorch Usage

Orthogonal weight normalization in multi-layer perceptronhuang17a (). An example showing creation and optimization of McTorch module with the Stiefel manifold.

2importtorch.nnasnn

5#AMcTorchmoduleusingmanifoldconstrainedlinearlayers.

6classOrthogonalWeightNormalizationNet(nn.Module):

7def__init__(self,input_size,hidden_sizes,output_size):

8super(OrthogonalWeightNormalizationNet,self).__init__()

9layer_sizes=[input_size]+hidden_sizes+[output_size]

13self.layers.append(nn.Linear(in_features=layer_sizes[i-1],

15weight_manifold=nn.Stiefel))

17ifi!=len(layer_sizes)-1:

19#LogSoftmaxattheoutputlayer

21self.layers.append(nn.LogSoftmax(dim=1))

25returnself.model(x)

27#Createmoduleobject.

28model=OrthogonalWeightNormalizationNet(input_size=1024,

29hidden_sizes=[128,128,128,128,128],output_size=68)

31#OptimizewiththeAdagradalgorithm.

32optimizer=torch.optim.Adagrad(params=model.parameters(),lr=1e-2)

34optmizer.zero_grad()

36output=model(data)

38cost.backward()

In the above example, a new PyTorch module is defined by inheriting from nn.Module. It requires defining an __init__ function to set up layers and a forward function to define forward pass of the module. The backward pass for back-propagation of gradients is automatically computed. In the layer definition, Linear layers with Stiefel (orthogonal) manifold are initialized. It also adds ReLU nonlinearity between layers and softmax nonlinearity on the output. The forward function of the model does a forward pass on sequence of layers. To optimize, torch.optim.Adagrad is initialized, which is followed by multiple epochs of forward and backward passes of the module on batched training data.