Recurrent layers¶
Layers to construct recurrent networks. Recurrent layers can be used similarly
to feed-forward layers except that the input shape is expected to be
(batch_size, sequence_length, num_inputs)
. The CustomRecurrentLayer can
also support more than one “feature” dimension (e.g. using convolutional
connections), but for all other layers, dimensions trailing the third
dimension are flattened.
The following recurrent layers are implemented:
A layer which implements a recurrent connection. |
|
Dense recurrent neural network (RNN) layer |
|
A long short-term memory (LSTM) layer. |
|
Gated Recurrent Unit (GRU) Layer |
For recurrent layers with gates we use a helper class to set up the parameters in each gate:
Simple class to hold the parameters for a gate connection. |
Please refer to that class if you need to modify initial conditions of gates.
Recurrent layers and feed-forward layers can be combined in the same network by using a few reshape operations; please refer to the example below.
Examples¶
The following example demonstrates how recurrent layers can be easily mixed
with feed-forward layers using ReshapeLayer
and how to build a
network with variable batch size and number of time steps.
>>> from lasagne.layers import *
>>> num_inputs, num_units, num_classes = 10, 12, 5
>>> # By setting the first two dimensions as None, we are allowing them to vary
>>> # They correspond to batch size and sequence length, so we will be able to
>>> # feed in batches of varying size with sequences of varying length.
>>> l_inp = InputLayer((None, None, num_inputs))
>>> # We can retrieve symbolic references to the input variable's shape, which
>>> # we will later use in reshape layers.
>>> batchsize, seqlen, _ = l_inp.input_var.shape
>>> l_lstm = LSTMLayer(l_inp, num_units=num_units)
>>> # In order to connect a recurrent layer to a dense layer, we need to
>>> # flatten the first two dimensions (our "sample dimensions"); this will
>>> # cause each time step of each sequence to be processed independently
>>> l_shp = ReshapeLayer(l_lstm, (-1, num_units))
>>> l_dense = DenseLayer(l_shp, num_units=num_classes)
>>> # To reshape back to our original shape, we can use the symbolic shape
>>> # variables we retrieved above.
>>> l_out = ReshapeLayer(l_dense, (batchsize, seqlen, num_classes))
-
class
lasagne.layers.
CustomRecurrentLayer
(incoming, input_to_hidden, hidden_to_hidden, nonlinearity=lasagne.nonlinearities.rectify, hid_init=lasagne.init.Constant(0.), backwards=False, learn_init=False, gradient_steps=-1, grad_clipping=0, unroll_scan=False, precompute_input=True, mask_input=None, only_return_final=False, **kwargs)[source]¶ A layer which implements a recurrent connection.
This layer allows you to specify custom input-to-hidden and hidden-to-hidden connections by instantiating
lasagne.layers.Layer
instances and passing them on initialization. Note that these connections can consist of multiple layers chained together. The output shape for the provided input-to-hidden and hidden-to-hidden connections must be the same. If you are looking for a standard, densely-connected recurrent layer, please seeRecurrentLayer
. The output is computed by\[h_t = \sigma(f_i(x_t) + f_h(h_{t-1}))\]- Parameters
incoming : a
lasagne.layers.Layer
instance or a tupleThe layer feeding into this layer, or the expected input shape.
input_to_hidden :
lasagne.layers.Layer
lasagne.layers.Layer
instance which connects input to the hidden state (\(f_i\)). This layer may be connected to a chain of layers, which must end in alasagne.layers.InputLayer
with the same input shape as incoming, except for the first dimension: Whenprecompute_input == True
(the default), it must beincoming.output_shape[0]*incoming.output_shape[1]
orNone
; whenprecompute_input == False
, it must beincoming.output_shape[0]
orNone
.hidden_to_hidden :
lasagne.layers.Layer
Layer which connects the previous hidden state to the new state (\(f_h\)). This layer may be connected to a chain of layers, which must end in a
lasagne.layers.InputLayer
with the same input shape as hidden_to_hidden’s output shape.nonlinearity : callable or None
Nonlinearity to apply when computing new state (\(\sigma\)). If None is provided, no nonlinearity will be applied.
hid_init : callable, np.ndarray, theano.shared or
Layer
Initializer for initial hidden state (\(h_0\)).
backwards : bool
If True, process the sequence backwards and then reverse the output again such that the output from the layer is always from \(x_1\) to \(x_n\).
learn_init : bool
If True, initial hidden values are learned.
gradient_steps : int
Number of timesteps to include in the backpropagated gradient. If -1, backpropagate through the entire sequence.
grad_clipping : float
If nonzero, the gradient messages are clipped to the given value during the backward pass. See [R58] (p. 6) for further explanation.
unroll_scan : bool
If True the recursion is unrolled instead of using scan. For some graphs this gives a significant speed up but it might also consume more memory. When unroll_scan is True, backpropagation always includes the full sequence, so gradient_steps must be set to -1 and the input sequence length must be known at compile time (i.e., cannot be given as None).
precompute_input : bool
If True, precompute input_to_hid before iterating through the sequence. This can result in a speedup at the expense of an increase in memory usage.
mask_input :
lasagne.layers.Layer
Layer which allows for a sequence mask to be input, for when sequences are of variable length. Default None, which means no mask will be supplied (i.e. all sequences are of the same length).
only_return_final : bool
If True, only return the final sequential output (e.g. for tasks where a single target value for the entire sequence is desired). In this case, Theano makes an optimization which saves memory.
References
- R58(1,2)
Graves, Alex: “Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850 (2013).
Examples
The following example constructs a simple CustomRecurrentLayer which has dense input-to-hidden and hidden-to-hidden connections.
>>> import lasagne >>> n_batch, n_steps, n_in = (2, 3, 4) >>> n_hid = 5 >>> l_in = lasagne.layers.InputLayer((n_batch, n_steps, n_in)) >>> l_in_hid = lasagne.layers.DenseLayer( ... lasagne.layers.InputLayer((None, n_in)), n_hid) >>> l_hid_hid = lasagne.layers.DenseLayer( ... lasagne.layers.InputLayer((None, n_hid)), n_hid) >>> l_rec = lasagne.layers.CustomRecurrentLayer(l_in, l_in_hid, l_hid_hid)
The CustomRecurrentLayer can also support “convolutional recurrence”, as is demonstrated below.
>>> n_batch, n_steps, n_channels, width, height = (2, 3, 4, 5, 6) >>> n_out_filters = 7 >>> filter_shape = (3, 3) >>> l_in = lasagne.layers.InputLayer( ... (n_batch, n_steps, n_channels, width, height)) >>> l_in_to_hid = lasagne.layers.Conv2DLayer( ... lasagne.layers.InputLayer((None, n_channels, width, height)), ... n_out_filters, filter_shape, pad='same') >>> l_hid_to_hid = lasagne.layers.Conv2DLayer( ... lasagne.layers.InputLayer(l_in_to_hid.output_shape), ... n_out_filters, filter_shape, pad='same') >>> l_rec = lasagne.layers.CustomRecurrentLayer( ... l_in, l_in_to_hid, l_hid_to_hid)
-
get_output_for
(inputs, **kwargs)[source]¶ Compute this layer’s output function given a symbolic input variable.
- Parameters
inputs : list of theano.TensorType
inputs[0] should always be the symbolic input variable. When this layer has a mask input (i.e. was instantiated with mask_input != None, indicating that the lengths of sequences in each batch vary), inputs should have length 2, where inputs[1] is the mask. The mask should be supplied as a Theano variable denoting whether each time step in each sequence in the batch is part of the sequence or not. mask should be a matrix of shape
(n_batch, n_time_steps)
wheremask[i, j] = 1
whenj <= (length of sequence i)
andmask[i, j] = 0
whenj > (length of sequence i)
. When the hidden state of this layer is to be pre-filled (i.e. was set to aLayer
instance) inputs should have length at least 2, and inputs[-1] is the hidden state to prefill with.- Returns
layer_output : theano.TensorType
Symbolic output variable.
-
get_output_shape_for
(input_shapes)[source]¶ Computes the output shape of this layer, given a list of input shapes.
- Parameters
input_shape : list of tuple
A list of tuples, with each tuple representing the shape of one of the inputs (in the correct order). These tuples should have as many elements as there are input dimensions, and the elements should be integers or None.
- Returns
tuple
A tuple representing the shape of the output of this layer. The tuple has as many elements as there are output dimensions, and the elements are all either integers or None.
Notes
This method must be overridden when implementing a new
Layer
class with multiple inputs. By default it raises NotImplementedError.
-
get_params
(**tags)[source]¶ Returns a list of Theano shared variables or expressions that parameterize the layer.
By default, all shared variables that participate in the forward pass will be returned (in the order they were registered in the Layer’s constructor via
add_param()
). The list can optionally be filtered by specifying tags as keyword arguments. For example,trainable=True
will only return trainable parameters, andregularizable=True
will only return parameters that can be regularized (e.g., by L2 decay).If any of the layer’s parameters was set to a Theano expression instead of a shared variable, unwrap_shared controls whether to return the shared variables involved in that expression (
unwrap_shared=True
, the default), or the expression itself (unwrap_shared=False
). In either case, tag filtering applies to the expressions, considering all variables within an expression to be tagged the same.- Parameters
unwrap_shared : bool (default: True)
Affects only parameters that were set to a Theano expression. If
True
the function returns the shared variables contained in the expression, otherwise the Theano expression itself.**tags (optional)
tags can be specified to filter the list. Specifying
tag1=True
will limit the list to parameters that are tagged withtag1
. Specifyingtag1=False
will limit the list to parameters that are not tagged withtag1
. Commonly used tags areregularizable
andtrainable
.- Returns
list of Theano shared variables or expressions
A list of variables that parameterize the layer
Notes
For layers without any parameters, this will return an empty list.
-
class
lasagne.layers.
RecurrentLayer
(incoming, num_units, W_in_to_hid=lasagne.init.Uniform(), W_hid_to_hid=lasagne.init.Uniform(), b=lasagne.init.Constant(0.), nonlinearity=lasagne.nonlinearities.rectify, hid_init=lasagne.init.Constant(0.), backwards=False, learn_init=False, gradient_steps=-1, grad_clipping=0, unroll_scan=False, precompute_input=True, mask_input=None, only_return_final=False, **kwargs)[source]¶ Dense recurrent neural network (RNN) layer
A “vanilla” RNN layer, which has dense input-to-hidden and hidden-to-hidden connections. The output is computed as
\[h_t = \sigma(x_t W_x + h_{t-1} W_h + b)\]- Parameters
incoming : a
lasagne.layers.Layer
instance or a tupleThe layer feeding into this layer, or the expected input shape.
num_units : int
Number of hidden units in the layer.
W_in_to_hid : Theano shared variable, numpy array or callable
Initializer for input-to-hidden weight matrix (\(W_x\)).
W_hid_to_hid : Theano shared variable, numpy array or callable
Initializer for hidden-to-hidden weight matrix (\(W_h\)).
b : Theano shared variable, numpy array, callable or None
Initializer for bias vector (\(b\)). If None is provided there will be no bias.
nonlinearity : callable or None
Nonlinearity to apply when computing new state (\(\sigma\)). If None is provided, no nonlinearity will be applied.
hid_init : callable, np.ndarray, theano.shared or
Layer
Initializer for initial hidden state (\(h_0\)).
backwards : bool
If True, process the sequence backwards and then reverse the output again such that the output from the layer is always from \(x_1\) to \(x_n\).
learn_init : bool
If True, initial hidden values are learned.
gradient_steps : int
Number of timesteps to include in the backpropagated gradient. If -1, backpropagate through the entire sequence.
grad_clipping : float
If nonzero, the gradient messages are clipped to the given value during the backward pass. See [R59] (p. 6) for further explanation.
unroll_scan : bool
If True the recursion is unrolled instead of using scan. For some graphs this gives a significant speed up but it might also consume more memory. When unroll_scan is True, backpropagation always includes the full sequence, so gradient_steps must be set to -1 and the input sequence length must be known at compile time (i.e., cannot be given as None).
precompute_input : bool
If True, precompute input_to_hid before iterating through the sequence. This can result in a speedup at the expense of an increase in memory usage.
mask_input :
lasagne.layers.Layer
Layer which allows for a sequence mask to be input, for when sequences are of variable length. Default None, which means no mask will be supplied (i.e. all sequences are of the same length).
only_return_final : bool
If True, only return the final sequential output (e.g. for tasks where a single target value for the entire sequence is desired). In this case, Theano makes an optimization which saves memory.
References
-
class
lasagne.layers.
LSTMLayer
(incoming, num_units, ingate=lasagne.layers.Gate(), forgetgate=lasagne.layers.Gate(), cell=lasagne.layers.Gate( W_cell=None, nonlinearity=lasagne.nonlinearities.tanh), outgate=lasagne.layers.Gate(), nonlinearity=lasagne.nonlinearities.tanh, cell_init=lasagne.init.Constant(0.), hid_init=lasagne.init.Constant(0.), backwards=False, learn_init=False, peepholes=True, gradient_steps=-1, grad_clipping=0, unroll_scan=False, precompute_input=True, mask_input=None, only_return_final=False, **kwargs)[source]¶ A long short-term memory (LSTM) layer.
Includes optional “peephole connections” and a forget gate. Based on the definition in [R60], which is the current common definition. The output is computed by
\[\begin{split}i_t &= \sigma_i(x_t W_{xi} + h_{t-1} W_{hi} + w_{ci} \odot c_{t-1} + b_i)\\ f_t &= \sigma_f(x_t W_{xf} + h_{t-1} W_{hf} + w_{cf} \odot c_{t-1} + b_f)\\ c_t &= f_t \odot c_{t - 1} + i_t \odot \sigma_c(x_t W_{xc} + h_{t-1} W_{hc} + b_c)\\ o_t &= \sigma_o(x_t W_{xo} + h_{t-1} W_{ho} + w_{co} \odot c_t + b_o)\\ h_t &= o_t \odot \sigma_h(c_t)\end{split}\]- Parameters
incoming : a
lasagne.layers.Layer
instance or a tupleThe layer feeding into this layer, or the expected input shape.
num_units : int
Number of hidden/cell units in the layer.
ingate : Gate
Parameters for the input gate (\(i_t\)): \(W_{xi}\), \(W_{hi}\), \(w_{ci}\), \(b_i\), and \(\sigma_i\).
forgetgate : Gate
Parameters for the forget gate (\(f_t\)): \(W_{xf}\), \(W_{hf}\), \(w_{cf}\), \(b_f\), and \(\sigma_f\).
cell : Gate
Parameters for the cell computation (\(c_t\)): \(W_{xc}\), \(W_{hc}\), \(b_c\), and \(\sigma_c\).
outgate : Gate
Parameters for the output gate (\(o_t\)): \(W_{xo}\), \(W_{ho}\), \(w_{co}\), \(b_o\), and \(\sigma_o\).
nonlinearity : callable or None
The nonlinearity that is applied to the output (\(\sigma_h\)). If None is provided, no nonlinearity will be applied.
cell_init : callable, np.ndarray, theano.shared or
Layer
Initializer for initial cell state (\(c_0\)).
hid_init : callable, np.ndarray, theano.shared or
Layer
Initializer for initial hidden state (\(h_0\)).
backwards : bool
If True, process the sequence backwards and then reverse the output again such that the output from the layer is always from \(x_1\) to \(x_n\).
learn_init : bool
If True, initial hidden values are learned.
peepholes : bool
If True, the LSTM uses peephole connections. When False, ingate.W_cell, forgetgate.W_cell and outgate.W_cell are ignored.
gradient_steps : int
Number of timesteps to include in the backpropagated gradient. If -1, backpropagate through the entire sequence.
grad_clipping : float
If nonzero, the gradient messages are clipped to the given value during the backward pass. See [R60] (p. 6) for further explanation.
unroll_scan : bool
If True the recursion is unrolled instead of using scan. For some graphs this gives a significant speed up but it might also consume more memory. When unroll_scan is True, backpropagation always includes the full sequence, so gradient_steps must be set to -1 and the input sequence length must be known at compile time (i.e., cannot be given as None).
precompute_input : bool
If True, precompute input_to_hid before iterating through the sequence. This can result in a speedup at the expense of an increase in memory usage.
mask_input :
lasagne.layers.Layer
Layer which allows for a sequence mask to be input, for when sequences are of variable length. Default None, which means no mask will be supplied (i.e. all sequences are of the same length).
only_return_final : bool
If True, only return the final sequential output (e.g. for tasks where a single target value for the entire sequence is desired). In this case, Theano makes an optimization which saves memory.
References
- R60(1,2,3)
Graves, Alex: “Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850 (2013).
-
get_output_for
(inputs, **kwargs)[source]¶ Compute this layer’s output function given a symbolic input variable
- Parameters
inputs : list of theano.TensorType
inputs[0] should always be the symbolic input variable. When this layer has a mask input (i.e. was instantiated with mask_input != None, indicating that the lengths of sequences in each batch vary), inputs should have length 2, where inputs[1] is the mask. The mask should be supplied as a Theano variable denoting whether each time step in each sequence in the batch is part of the sequence or not. mask should be a matrix of shape
(n_batch, n_time_steps)
wheremask[i, j] = 1
whenj <= (length of sequence i)
andmask[i, j] = 0
whenj > (length of sequence i)
. When the hidden state of this layer is to be pre-filled (i.e. was set to aLayer
instance) inputs should have length at least 2, and inputs[-1] is the hidden state to prefill with. When the cell state of this layer is to be pre-filled (i.e. was set to aLayer
instance) inputs should have length at least 2, and inputs[-1] is the hidden state to prefill with. When both the cell state and the hidden state are being pre-filled inputs[-2] is the hidden state, while inputs[-1] is the cell state.- Returns
layer_output : theano.TensorType
Symbolic output variable.
-
get_output_shape_for
(input_shapes)[source]¶ Computes the output shape of this layer, given a list of input shapes.
- Parameters
input_shape : list of tuple
A list of tuples, with each tuple representing the shape of one of the inputs (in the correct order). These tuples should have as many elements as there are input dimensions, and the elements should be integers or None.
- Returns
tuple
A tuple representing the shape of the output of this layer. The tuple has as many elements as there are output dimensions, and the elements are all either integers or None.
Notes
This method must be overridden when implementing a new
Layer
class with multiple inputs. By default it raises NotImplementedError.
-
class
lasagne.layers.
GRULayer
(incoming, num_units, resetgate=lasagne.layers.Gate(W_cell=None), updategate=lasagne.layers.Gate(W_cell=None), hidden_update=lasagne.layers.Gate( W_cell=None, lasagne.nonlinearities.tanh), hid_init=lasagne.init.Constant(0.), backwards=False, learn_init=False, gradient_steps=-1, grad_clipping=0, unroll_scan=False, precompute_input=True, mask_input=None, only_return_final=False, **kwargs)[source]¶ Gated Recurrent Unit (GRU) Layer
Implements the recurrent step proposed in [R61], which computes the output by
\[\begin{split}r_t &= \sigma_r(x_t W_{xr} + h_{t - 1} W_{hr} + b_r)\\ u_t &= \sigma_u(x_t W_{xu} + h_{t - 1} W_{hu} + b_u)\\ c_t &= \sigma_c(x_t W_{xc} + r_t \odot (h_{t - 1} W_{hc}) + b_c)\\ h_t &= (1 - u_t) \odot h_{t - 1} + u_t \odot c_t\end{split}\]- Parameters
incoming : a
lasagne.layers.Layer
instance or a tupleThe layer feeding into this layer, or the expected input shape.
num_units : int
Number of hidden units in the layer.
resetgate : Gate
Parameters for the reset gate (\(r_t\)): \(W_{xr}\), \(W_{hr}\), \(b_r\), and \(\sigma_r\).
updategate : Gate
Parameters for the update gate (\(u_t\)): \(W_{xu}\), \(W_{hu}\), \(b_u\), and \(\sigma_u\).
hidden_update : Gate
Parameters for the hidden update (\(c_t\)): \(W_{xc}\), \(W_{hc}\), \(b_c\), and \(\sigma_c\).
hid_init : callable, np.ndarray, theano.shared or
Layer
Initializer for initial hidden state (\(h_0\)).
backwards : bool
If True, process the sequence backwards and then reverse the output again such that the output from the layer is always from \(x_1\) to \(x_n\).
learn_init : bool
If True, initial hidden values are learned.
gradient_steps : int
Number of timesteps to include in the backpropagated gradient. If -1, backpropagate through the entire sequence.
grad_clipping : float
If nonzero, the gradient messages are clipped to the given value during the backward pass. See [R61] (p. 6) for further explanation.
unroll_scan : bool
If True the recursion is unrolled instead of using scan. For some graphs this gives a significant speed up but it might also consume more memory. When unroll_scan is True, backpropagation always includes the full sequence, so gradient_steps must be set to -1 and the input sequence length must be known at compile time (i.e., cannot be given as None).
precompute_input : bool
If True, precompute input_to_hid before iterating through the sequence. This can result in a speedup at the expense of an increase in memory usage.
mask_input :
lasagne.layers.Layer
Layer which allows for a sequence mask to be input, for when sequences are of variable length. Default None, which means no mask will be supplied (i.e. all sequences are of the same length).
only_return_final : bool
If True, only return the final sequential output (e.g. for tasks where a single target value for the entire sequence is desired). In this case, Theano makes an optimization which saves memory.
Notes
An alternate update for the candidate hidden state is proposed in [R62]:
\[\begin{split}c_t &= \sigma_c(x_t W_{ic} + (r_t \odot h_{t - 1})W_{hc} + b_c)\\\end{split}\]We use the formulation from [R61] because it allows us to do all matrix operations in a single dot product.
References
- R61(1,2,3,4)
Cho, Kyunghyun, et al: On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014).
- R62(1,2)
Chung, Junyoung, et al.: Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv preprint arXiv:1412.3555 (2014).
- R63
Graves, Alex: “Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850 (2013).
-
get_output_for
(inputs, **kwargs)[source]¶ Compute this layer’s output function given a symbolic input variable
- Parameters
inputs : list of theano.TensorType
inputs[0] should always be the symbolic input variable. When this layer has a mask input (i.e. was instantiated with mask_input != None, indicating that the lengths of sequences in each batch vary), inputs should have length 2, where inputs[1] is the mask. The mask should be supplied as a Theano variable denoting whether each time step in each sequence in the batch is part of the sequence or not. mask should be a matrix of shape
(n_batch, n_time_steps)
wheremask[i, j] = 1
whenj <= (length of sequence i)
andmask[i, j] = 0
whenj > (length of sequence i)
. When the hidden state of this layer is to be pre-filled (i.e. was set to aLayer
instance) inputs should have length at least 2, and inputs[-1] is the hidden state to prefill with.- Returns
layer_output : theano.TensorType
Symbolic output variable.
-
get_output_shape_for
(input_shapes)[source]¶ Computes the output shape of this layer, given a list of input shapes.
- Parameters
input_shape : list of tuple
A list of tuples, with each tuple representing the shape of one of the inputs (in the correct order). These tuples should have as many elements as there are input dimensions, and the elements should be integers or None.
- Returns
tuple
A tuple representing the shape of the output of this layer. The tuple has as many elements as there are output dimensions, and the elements are all either integers or None.
Notes
This method must be overridden when implementing a new
Layer
class with multiple inputs. By default it raises NotImplementedError.
-
class
lasagne.layers.
Gate
(W_in=lasagne.init.Normal(0.1), W_hid=lasagne.init.Normal(0.1), W_cell=lasagne.init.Normal(0.1), b=lasagne.init.Constant(0.), nonlinearity=lasagne.nonlinearities.sigmoid)[source]¶ Simple class to hold the parameters for a gate connection. We define a gate loosely as something which computes the linear mix of two inputs, optionally computes an element-wise product with a third, adds a bias, and applies a nonlinearity.
- Parameters
W_in : Theano shared variable, numpy array or callable
Initializer for input-to-gate weight matrix.
W_hid : Theano shared variable, numpy array or callable
Initializer for hidden-to-gate weight matrix.
W_cell : Theano shared variable, numpy array, callable, or None
Initializer for cell-to-gate weight vector. If None, no cell-to-gate weight vector will be stored.
b : Theano shared variable, numpy array or callable
Initializer for input gate bias vector.
nonlinearity : callable or None
The nonlinearity that is applied to the input gate activation. If None is provided, no nonlinearity will be applied.
References
- R64(1,2)
Gers, Felix A., Jürgen Schmidhuber, and Fred Cummins. “Learning to forget: Continual prediction with LSTM.” Neural computation 12.10 (2000): 2451-2471.
Examples
For
LSTMLayer
the bias of the forget gate is often initialized to a large positive value to encourage the layer initially remember the cell value, see e.g. [R64] page 15.>>> import lasagne >>> forget_gate = Gate(b=lasagne.init.Constant(5.0)) >>> l_lstm = LSTMLayer((10, 20, 30), num_units=10, ... forgetgate=forget_gate)