Recurrent layers

Layers to construct recurrent networks. Recurrent layers can be used similarly to feed-forward layers except that the input shape is expected to be (batch_size, sequence_length, num_inputs). The CustomRecurrentLayer can also support more than one “feature” dimension (e.g. using convolutional connections), but for all other layers, dimensions trailing the third dimension are flattened.

The following recurrent layers are implemented:

CustomRecurrentLayer

A layer which implements a recurrent connection.

RecurrentLayer

Dense recurrent neural network (RNN) layer

LSTMLayer

A long short-term memory (LSTM) layer.

GRULayer

Gated Recurrent Unit (GRU) Layer

For recurrent layers with gates we use a helper class to set up the parameters in each gate:

Gate

Simple class to hold the parameters for a gate connection.

Please refer to that class if you need to modify initial conditions of gates.

Recurrent layers and feed-forward layers can be combined in the same network by using a few reshape operations; please refer to the example below.

Examples

The following example demonstrates how recurrent layers can be easily mixed with feed-forward layers using ReshapeLayer and how to build a network with variable batch size and number of time steps.

>>> from lasagne.layers import *
>>> num_inputs, num_units, num_classes = 10, 12, 5
>>> # By setting the first two dimensions as None, we are allowing them to vary
>>> # They correspond to batch size and sequence length, so we will be able to
>>> # feed in batches of varying size with sequences of varying length.
>>> l_inp = InputLayer((None, None, num_inputs))
>>> # We can retrieve symbolic references to the input variable's shape, which
>>> # we will later use in reshape layers.
>>> batchsize, seqlen, _ = l_inp.input_var.shape
>>> l_lstm = LSTMLayer(l_inp, num_units=num_units)
>>> # In order to connect a recurrent layer to a dense layer, we need to
>>> # flatten the first two dimensions (our "sample dimensions"); this will
>>> # cause each time step of each sequence to be processed independently
>>> l_shp = ReshapeLayer(l_lstm, (-1, num_units))
>>> l_dense = DenseLayer(l_shp, num_units=num_classes)
>>> # To reshape back to our original shape, we can use the symbolic shape
>>> # variables we retrieved above.
>>> l_out = ReshapeLayer(l_dense, (batchsize, seqlen, num_classes))
class lasagne.layers.CustomRecurrentLayer(incoming, input_to_hidden, hidden_to_hidden, nonlinearity=lasagne.nonlinearities.rectify, hid_init=lasagne.init.Constant(0.), backwards=False, learn_init=False, gradient_steps=-1, grad_clipping=0, unroll_scan=False, precompute_input=True, mask_input=None, only_return_final=False, **kwargs)[source]

A layer which implements a recurrent connection.

This layer allows you to specify custom input-to-hidden and hidden-to-hidden connections by instantiating lasagne.layers.Layer instances and passing them on initialization. Note that these connections can consist of multiple layers chained together. The output shape for the provided input-to-hidden and hidden-to-hidden connections must be the same. If you are looking for a standard, densely-connected recurrent layer, please see RecurrentLayer. The output is computed by

\[h_t = \sigma(f_i(x_t) + f_h(h_{t-1}))\]
Parameters

incoming : a lasagne.layers.Layer instance or a tuple

The layer feeding into this layer, or the expected input shape.

input_to_hidden : lasagne.layers.Layer

lasagne.layers.Layer instance which connects input to the hidden state (\(f_i\)). This layer may be connected to a chain of layers, which must end in a lasagne.layers.InputLayer with the same input shape as incoming, except for the first dimension: When precompute_input == True (the default), it must be incoming.output_shape[0]*incoming.output_shape[1] or None; when precompute_input == False, it must be incoming.output_shape[0] or None.

hidden_to_hidden : lasagne.layers.Layer

Layer which connects the previous hidden state to the new state (\(f_h\)). This layer may be connected to a chain of layers, which must end in a lasagne.layers.InputLayer with the same input shape as hidden_to_hidden’s output shape.

nonlinearity : callable or None

Nonlinearity to apply when computing new state (\(\sigma\)). If None is provided, no nonlinearity will be applied.

hid_init : callable, np.ndarray, theano.shared or Layer

Initializer for initial hidden state (\(h_0\)).

backwards : bool

If True, process the sequence backwards and then reverse the output again such that the output from the layer is always from \(x_1\) to \(x_n\).

learn_init : bool

If True, initial hidden values are learned.

gradient_steps : int

Number of timesteps to include in the backpropagated gradient. If -1, backpropagate through the entire sequence.

grad_clipping : float

If nonzero, the gradient messages are clipped to the given value during the backward pass. See [R58] (p. 6) for further explanation.

unroll_scan : bool

If True the recursion is unrolled instead of using scan. For some graphs this gives a significant speed up but it might also consume more memory. When unroll_scan is True, backpropagation always includes the full sequence, so gradient_steps must be set to -1 and the input sequence length must be known at compile time (i.e., cannot be given as None).

precompute_input : bool

If True, precompute input_to_hid before iterating through the sequence. This can result in a speedup at the expense of an increase in memory usage.

mask_input : lasagne.layers.Layer

Layer which allows for a sequence mask to be input, for when sequences are of variable length. Default None, which means no mask will be supplied (i.e. all sequences are of the same length).

only_return_final : bool

If True, only return the final sequential output (e.g. for tasks where a single target value for the entire sequence is desired). In this case, Theano makes an optimization which saves memory.

References

R58(1,2)

Graves, Alex: “Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850 (2013).

Examples

The following example constructs a simple CustomRecurrentLayer which has dense input-to-hidden and hidden-to-hidden connections.

>>> import lasagne
>>> n_batch, n_steps, n_in = (2, 3, 4)
>>> n_hid = 5
>>> l_in = lasagne.layers.InputLayer((n_batch, n_steps, n_in))
>>> l_in_hid = lasagne.layers.DenseLayer(
...     lasagne.layers.InputLayer((None, n_in)), n_hid)
>>> l_hid_hid = lasagne.layers.DenseLayer(
...     lasagne.layers.InputLayer((None, n_hid)), n_hid)
>>> l_rec = lasagne.layers.CustomRecurrentLayer(l_in, l_in_hid, l_hid_hid)

The CustomRecurrentLayer can also support “convolutional recurrence”, as is demonstrated below.

>>> n_batch, n_steps, n_channels, width, height = (2, 3, 4, 5, 6)
>>> n_out_filters = 7
>>> filter_shape = (3, 3)
>>> l_in = lasagne.layers.InputLayer(
...     (n_batch, n_steps, n_channels, width, height))
>>> l_in_to_hid = lasagne.layers.Conv2DLayer(
...     lasagne.layers.InputLayer((None, n_channels, width, height)),
...     n_out_filters, filter_shape, pad='same')
>>> l_hid_to_hid = lasagne.layers.Conv2DLayer(
...     lasagne.layers.InputLayer(l_in_to_hid.output_shape),
...     n_out_filters, filter_shape, pad='same')
>>> l_rec = lasagne.layers.CustomRecurrentLayer(
...     l_in, l_in_to_hid, l_hid_to_hid)
get_output_for(inputs, **kwargs)[source]

Compute this layer’s output function given a symbolic input variable.

Parameters

inputs : list of theano.TensorType

inputs[0] should always be the symbolic input variable. When this layer has a mask input (i.e. was instantiated with mask_input != None, indicating that the lengths of sequences in each batch vary), inputs should have length 2, where inputs[1] is the mask. The mask should be supplied as a Theano variable denoting whether each time step in each sequence in the batch is part of the sequence or not. mask should be a matrix of shape (n_batch, n_time_steps) where mask[i, j] = 1 when j <= (length of sequence i) and mask[i, j] = 0 when j > (length of sequence i). When the hidden state of this layer is to be pre-filled (i.e. was set to a Layer instance) inputs should have length at least 2, and inputs[-1] is the hidden state to prefill with.

Returns

layer_output : theano.TensorType

Symbolic output variable.

get_output_shape_for(input_shapes)[source]

Computes the output shape of this layer, given a list of input shapes.

Parameters

input_shape : list of tuple

A list of tuples, with each tuple representing the shape of one of the inputs (in the correct order). These tuples should have as many elements as there are input dimensions, and the elements should be integers or None.

Returns

tuple

A tuple representing the shape of the output of this layer. The tuple has as many elements as there are output dimensions, and the elements are all either integers or None.

Notes

This method must be overridden when implementing a new Layer class with multiple inputs. By default it raises NotImplementedError.

get_params(**tags)[source]

Returns a list of Theano shared variables or expressions that parameterize the layer.

By default, all shared variables that participate in the forward pass will be returned (in the order they were registered in the Layer’s constructor via add_param()). The list can optionally be filtered by specifying tags as keyword arguments. For example, trainable=True will only return trainable parameters, and regularizable=True will only return parameters that can be regularized (e.g., by L2 decay).

If any of the layer’s parameters was set to a Theano expression instead of a shared variable, unwrap_shared controls whether to return the shared variables involved in that expression (unwrap_shared=True, the default), or the expression itself (unwrap_shared=False). In either case, tag filtering applies to the expressions, considering all variables within an expression to be tagged the same.

Parameters

unwrap_shared : bool (default: True)

Affects only parameters that were set to a Theano expression. If True the function returns the shared variables contained in the expression, otherwise the Theano expression itself.

**tags (optional)

tags can be specified to filter the list. Specifying tag1=True will limit the list to parameters that are tagged with tag1. Specifying tag1=False will limit the list to parameters that are not tagged with tag1. Commonly used tags are regularizable and trainable.

Returns

list of Theano shared variables or expressions

A list of variables that parameterize the layer

Notes

For layers without any parameters, this will return an empty list.

class lasagne.layers.RecurrentLayer(incoming, num_units, W_in_to_hid=lasagne.init.Uniform(), W_hid_to_hid=lasagne.init.Uniform(), b=lasagne.init.Constant(0.), nonlinearity=lasagne.nonlinearities.rectify, hid_init=lasagne.init.Constant(0.), backwards=False, learn_init=False, gradient_steps=-1, grad_clipping=0, unroll_scan=False, precompute_input=True, mask_input=None, only_return_final=False, **kwargs)[source]

Dense recurrent neural network (RNN) layer

A “vanilla” RNN layer, which has dense input-to-hidden and hidden-to-hidden connections. The output is computed as

\[h_t = \sigma(x_t W_x + h_{t-1} W_h + b)\]
Parameters

incoming : a lasagne.layers.Layer instance or a tuple

The layer feeding into this layer, or the expected input shape.

num_units : int

Number of hidden units in the layer.

W_in_to_hid : Theano shared variable, numpy array or callable

Initializer for input-to-hidden weight matrix (\(W_x\)).

W_hid_to_hid : Theano shared variable, numpy array or callable

Initializer for hidden-to-hidden weight matrix (\(W_h\)).

b : Theano shared variable, numpy array, callable or None

Initializer for bias vector (\(b\)). If None is provided there will be no bias.

nonlinearity : callable or None

Nonlinearity to apply when computing new state (\(\sigma\)). If None is provided, no nonlinearity will be applied.

hid_init : callable, np.ndarray, theano.shared or Layer

Initializer for initial hidden state (\(h_0\)).

backwards : bool

If True, process the sequence backwards and then reverse the output again such that the output from the layer is always from \(x_1\) to \(x_n\).

learn_init : bool

If True, initial hidden values are learned.

gradient_steps : int

Number of timesteps to include in the backpropagated gradient. If -1, backpropagate through the entire sequence.

grad_clipping : float

If nonzero, the gradient messages are clipped to the given value during the backward pass. See [R59] (p. 6) for further explanation.

unroll_scan : bool

If True the recursion is unrolled instead of using scan. For some graphs this gives a significant speed up but it might also consume more memory. When unroll_scan is True, backpropagation always includes the full sequence, so gradient_steps must be set to -1 and the input sequence length must be known at compile time (i.e., cannot be given as None).

precompute_input : bool

If True, precompute input_to_hid before iterating through the sequence. This can result in a speedup at the expense of an increase in memory usage.

mask_input : lasagne.layers.Layer

Layer which allows for a sequence mask to be input, for when sequences are of variable length. Default None, which means no mask will be supplied (i.e. all sequences are of the same length).

only_return_final : bool

If True, only return the final sequential output (e.g. for tasks where a single target value for the entire sequence is desired). In this case, Theano makes an optimization which saves memory.

References

R59(1,2)

Graves, Alex: “Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850 (2013).

class lasagne.layers.LSTMLayer(incoming, num_units, ingate=lasagne.layers.Gate(), forgetgate=lasagne.layers.Gate(), cell=lasagne.layers.Gate( W_cell=None, nonlinearity=lasagne.nonlinearities.tanh), outgate=lasagne.layers.Gate(), nonlinearity=lasagne.nonlinearities.tanh, cell_init=lasagne.init.Constant(0.), hid_init=lasagne.init.Constant(0.), backwards=False, learn_init=False, peepholes=True, gradient_steps=-1, grad_clipping=0, unroll_scan=False, precompute_input=True, mask_input=None, only_return_final=False, **kwargs)[source]

A long short-term memory (LSTM) layer.

Includes optional “peephole connections” and a forget gate. Based on the definition in [R60], which is the current common definition. The output is computed by

\[\begin{split}i_t &= \sigma_i(x_t W_{xi} + h_{t-1} W_{hi} + w_{ci} \odot c_{t-1} + b_i)\\ f_t &= \sigma_f(x_t W_{xf} + h_{t-1} W_{hf} + w_{cf} \odot c_{t-1} + b_f)\\ c_t &= f_t \odot c_{t - 1} + i_t \odot \sigma_c(x_t W_{xc} + h_{t-1} W_{hc} + b_c)\\ o_t &= \sigma_o(x_t W_{xo} + h_{t-1} W_{ho} + w_{co} \odot c_t + b_o)\\ h_t &= o_t \odot \sigma_h(c_t)\end{split}\]
Parameters

incoming : a lasagne.layers.Layer instance or a tuple

The layer feeding into this layer, or the expected input shape.

num_units : int

Number of hidden/cell units in the layer.

ingate : Gate

Parameters for the input gate (\(i_t\)): \(W_{xi}\), \(W_{hi}\), \(w_{ci}\), \(b_i\), and \(\sigma_i\).

forgetgate : Gate

Parameters for the forget gate (\(f_t\)): \(W_{xf}\), \(W_{hf}\), \(w_{cf}\), \(b_f\), and \(\sigma_f\).

cell : Gate

Parameters for the cell computation (\(c_t\)): \(W_{xc}\), \(W_{hc}\), \(b_c\), and \(\sigma_c\).

outgate : Gate

Parameters for the output gate (\(o_t\)): \(W_{xo}\), \(W_{ho}\), \(w_{co}\), \(b_o\), and \(\sigma_o\).

nonlinearity : callable or None

The nonlinearity that is applied to the output (\(\sigma_h\)). If None is provided, no nonlinearity will be applied.

cell_init : callable, np.ndarray, theano.shared or Layer

Initializer for initial cell state (\(c_0\)).

hid_init : callable, np.ndarray, theano.shared or Layer

Initializer for initial hidden state (\(h_0\)).

backwards : bool

If True, process the sequence backwards and then reverse the output again such that the output from the layer is always from \(x_1\) to \(x_n\).

learn_init : bool

If True, initial hidden values are learned.

peepholes : bool

If True, the LSTM uses peephole connections. When False, ingate.W_cell, forgetgate.W_cell and outgate.W_cell are ignored.

gradient_steps : int

Number of timesteps to include in the backpropagated gradient. If -1, backpropagate through the entire sequence.

grad_clipping : float

If nonzero, the gradient messages are clipped to the given value during the backward pass. See [R60] (p. 6) for further explanation.

unroll_scan : bool

If True the recursion is unrolled instead of using scan. For some graphs this gives a significant speed up but it might also consume more memory. When unroll_scan is True, backpropagation always includes the full sequence, so gradient_steps must be set to -1 and the input sequence length must be known at compile time (i.e., cannot be given as None).

precompute_input : bool

If True, precompute input_to_hid before iterating through the sequence. This can result in a speedup at the expense of an increase in memory usage.

mask_input : lasagne.layers.Layer

Layer which allows for a sequence mask to be input, for when sequences are of variable length. Default None, which means no mask will be supplied (i.e. all sequences are of the same length).

only_return_final : bool

If True, only return the final sequential output (e.g. for tasks where a single target value for the entire sequence is desired). In this case, Theano makes an optimization which saves memory.

References

R60(1,2,3)

Graves, Alex: “Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850 (2013).

get_output_for(inputs, **kwargs)[source]

Compute this layer’s output function given a symbolic input variable

Parameters

inputs : list of theano.TensorType

inputs[0] should always be the symbolic input variable. When this layer has a mask input (i.e. was instantiated with mask_input != None, indicating that the lengths of sequences in each batch vary), inputs should have length 2, where inputs[1] is the mask. The mask should be supplied as a Theano variable denoting whether each time step in each sequence in the batch is part of the sequence or not. mask should be a matrix of shape (n_batch, n_time_steps) where mask[i, j] = 1 when j <= (length of sequence i) and mask[i, j] = 0 when j > (length of sequence i). When the hidden state of this layer is to be pre-filled (i.e. was set to a Layer instance) inputs should have length at least 2, and inputs[-1] is the hidden state to prefill with. When the cell state of this layer is to be pre-filled (i.e. was set to a Layer instance) inputs should have length at least 2, and inputs[-1] is the hidden state to prefill with. When both the cell state and the hidden state are being pre-filled inputs[-2] is the hidden state, while inputs[-1] is the cell state.

Returns

layer_output : theano.TensorType

Symbolic output variable.

get_output_shape_for(input_shapes)[source]

Computes the output shape of this layer, given a list of input shapes.

Parameters

input_shape : list of tuple

A list of tuples, with each tuple representing the shape of one of the inputs (in the correct order). These tuples should have as many elements as there are input dimensions, and the elements should be integers or None.

Returns

tuple

A tuple representing the shape of the output of this layer. The tuple has as many elements as there are output dimensions, and the elements are all either integers or None.

Notes

This method must be overridden when implementing a new Layer class with multiple inputs. By default it raises NotImplementedError.

class lasagne.layers.GRULayer(incoming, num_units, resetgate=lasagne.layers.Gate(W_cell=None), updategate=lasagne.layers.Gate(W_cell=None), hidden_update=lasagne.layers.Gate( W_cell=None, lasagne.nonlinearities.tanh), hid_init=lasagne.init.Constant(0.), backwards=False, learn_init=False, gradient_steps=-1, grad_clipping=0, unroll_scan=False, precompute_input=True, mask_input=None, only_return_final=False, **kwargs)[source]

Gated Recurrent Unit (GRU) Layer

Implements the recurrent step proposed in [R61], which computes the output by

\[\begin{split}r_t &= \sigma_r(x_t W_{xr} + h_{t - 1} W_{hr} + b_r)\\ u_t &= \sigma_u(x_t W_{xu} + h_{t - 1} W_{hu} + b_u)\\ c_t &= \sigma_c(x_t W_{xc} + r_t \odot (h_{t - 1} W_{hc}) + b_c)\\ h_t &= (1 - u_t) \odot h_{t - 1} + u_t \odot c_t\end{split}\]
Parameters

incoming : a lasagne.layers.Layer instance or a tuple

The layer feeding into this layer, or the expected input shape.

num_units : int

Number of hidden units in the layer.

resetgate : Gate

Parameters for the reset gate (\(r_t\)): \(W_{xr}\), \(W_{hr}\), \(b_r\), and \(\sigma_r\).

updategate : Gate

Parameters for the update gate (\(u_t\)): \(W_{xu}\), \(W_{hu}\), \(b_u\), and \(\sigma_u\).

hidden_update : Gate

Parameters for the hidden update (\(c_t\)): \(W_{xc}\), \(W_{hc}\), \(b_c\), and \(\sigma_c\).

hid_init : callable, np.ndarray, theano.shared or Layer

Initializer for initial hidden state (\(h_0\)).

backwards : bool

If True, process the sequence backwards and then reverse the output again such that the output from the layer is always from \(x_1\) to \(x_n\).

learn_init : bool

If True, initial hidden values are learned.

gradient_steps : int

Number of timesteps to include in the backpropagated gradient. If -1, backpropagate through the entire sequence.

grad_clipping : float

If nonzero, the gradient messages are clipped to the given value during the backward pass. See [R61] (p. 6) for further explanation.

unroll_scan : bool

If True the recursion is unrolled instead of using scan. For some graphs this gives a significant speed up but it might also consume more memory. When unroll_scan is True, backpropagation always includes the full sequence, so gradient_steps must be set to -1 and the input sequence length must be known at compile time (i.e., cannot be given as None).

precompute_input : bool

If True, precompute input_to_hid before iterating through the sequence. This can result in a speedup at the expense of an increase in memory usage.

mask_input : lasagne.layers.Layer

Layer which allows for a sequence mask to be input, for when sequences are of variable length. Default None, which means no mask will be supplied (i.e. all sequences are of the same length).

only_return_final : bool

If True, only return the final sequential output (e.g. for tasks where a single target value for the entire sequence is desired). In this case, Theano makes an optimization which saves memory.

Notes

An alternate update for the candidate hidden state is proposed in [R62]:

\[\begin{split}c_t &= \sigma_c(x_t W_{ic} + (r_t \odot h_{t - 1})W_{hc} + b_c)\\\end{split}\]

We use the formulation from [R61] because it allows us to do all matrix operations in a single dot product.

References

R61(1,2,3,4)

Cho, Kyunghyun, et al: On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014).

R62(1,2)

Chung, Junyoung, et al.: Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv preprint arXiv:1412.3555 (2014).

R63

Graves, Alex: “Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850 (2013).

get_output_for(inputs, **kwargs)[source]

Compute this layer’s output function given a symbolic input variable

Parameters

inputs : list of theano.TensorType

inputs[0] should always be the symbolic input variable. When this layer has a mask input (i.e. was instantiated with mask_input != None, indicating that the lengths of sequences in each batch vary), inputs should have length 2, where inputs[1] is the mask. The mask should be supplied as a Theano variable denoting whether each time step in each sequence in the batch is part of the sequence or not. mask should be a matrix of shape (n_batch, n_time_steps) where mask[i, j] = 1 when j <= (length of sequence i) and mask[i, j] = 0 when j > (length of sequence i). When the hidden state of this layer is to be pre-filled (i.e. was set to a Layer instance) inputs should have length at least 2, and inputs[-1] is the hidden state to prefill with.

Returns

layer_output : theano.TensorType

Symbolic output variable.

get_output_shape_for(input_shapes)[source]

Computes the output shape of this layer, given a list of input shapes.

Parameters

input_shape : list of tuple

A list of tuples, with each tuple representing the shape of one of the inputs (in the correct order). These tuples should have as many elements as there are input dimensions, and the elements should be integers or None.

Returns

tuple

A tuple representing the shape of the output of this layer. The tuple has as many elements as there are output dimensions, and the elements are all either integers or None.

Notes

This method must be overridden when implementing a new Layer class with multiple inputs. By default it raises NotImplementedError.

class lasagne.layers.Gate(W_in=lasagne.init.Normal(0.1), W_hid=lasagne.init.Normal(0.1), W_cell=lasagne.init.Normal(0.1), b=lasagne.init.Constant(0.), nonlinearity=lasagne.nonlinearities.sigmoid)[source]

Simple class to hold the parameters for a gate connection. We define a gate loosely as something which computes the linear mix of two inputs, optionally computes an element-wise product with a third, adds a bias, and applies a nonlinearity.

Parameters

W_in : Theano shared variable, numpy array or callable

Initializer for input-to-gate weight matrix.

W_hid : Theano shared variable, numpy array or callable

Initializer for hidden-to-gate weight matrix.

W_cell : Theano shared variable, numpy array, callable, or None

Initializer for cell-to-gate weight vector. If None, no cell-to-gate weight vector will be stored.

b : Theano shared variable, numpy array or callable

Initializer for input gate bias vector.

nonlinearity : callable or None

The nonlinearity that is applied to the input gate activation. If None is provided, no nonlinearity will be applied.

References

R64(1,2)

Gers, Felix A., Jürgen Schmidhuber, and Fred Cummins. “Learning to forget: Continual prediction with LSTM.” Neural computation 12.10 (2000): 2451-2471.

Examples

For LSTMLayer the bias of the forget gate is often initialized to a large positive value to encourage the layer initially remember the cell value, see e.g. [R64] page 15.

>>> import lasagne
>>> forget_gate = Gate(b=lasagne.init.Constant(5.0))
>>> l_lstm = LSTMLayer((10, 20, 30), num_units=10,
...                    forgetgate=forget_gate)