λ

MGL Manual

Table of Contents

[in package MGL]

λ

1 The mgl ASDF System

λ

2 Introduction

λ

2.1 Overview

MGL is a Common Lisp machine learning library by Gábor Melis with some parts originally contributed by Ravenpack International. It mainly concentrates on various forms of neural networks (boltzmann machines, feed-forward and recurrent backprop nets). Most of MGL is built on top of MGL-MAT so it has BLAS and CUDA support.

In general, the focus is on power and performance not on ease of use. Perhaps one day there will be a cookie cutter interface with restricted functionality if a reasonable compromise is found between power and utility.

λ

2.2 Links

Here is the official repository and the HTML documentation for the latest version.

λ

2.3 Dependencies

MGL used to rely on LLA to interface to BLAS and LAPACK. That's mostly history by now, but configuration of foreign libraries is still done via LLA. See the README in LLA on how to set things up. Note that these days OpenBLAS is easier to set up and just as fast as ATLAS.

CL-CUDA and MGL-MAT are the two main dependencies and also the ones not yet in quicklisp, so just drop them into quicklisp/local-projects/. If there is no suitable GPU on the system or the CUDA SDK is not installed, MGL will simply fall back on using BLAS and Lisp code. Wrapping code in mgl-mat:with-cuda* is basically all that's needed to run on the GPU, and with mgl-mat:cuda-available-p one can check whether the GPU is really being used.

λ

2.4 Code Organization

MGL consists of several packages dedicated to different tasks. For example, package mgl-resample is about Resampling and mgl-gd is about Gradient Descent and so on. On one hand, having many packages makes it easier to cleanly separate API and implementation and also to explore into a specific task. At other times, they can be a hassle, so the mgl package itself reexports every external symbol found in all the other packages that make up MGL and MGL-MAT (see MAT Manual) on which it heavily relies.

One exception to this rule is the bundled, but independent MGL-GNUPLOT library.

The built in tests can be run with:

(ASDF:OOS 'ASDF:TEST-OP '#:MGL)

Note, that most of the tests are rather stochastic and can fail once in a while.

λ

2.5 Glossary

Ultimately machine learning is about creating models of some domain. The observations in the modelled domain are called instances (also known as examples or samples). Sets of instances are called datasets. Datasets are used when fitting a model or when making predictions. Sometimes the word predictions is too specific, and the results obtained from applying a model to some instances are simply called results.

λ

3 Datasets

[in package MGL-DATASET]

An instance can often be any kind of object of the user's choice. It is typically represented by a set of numbers which is called a feature vector or by a structure holding the feature vector, the label, etc. A dataset is a sequence of such instances or a Samplers object that produces instances.

λ

3.1 Samplers

Some algorithms do not need random access to the entire dataset and can work with a stream observations. Samplers are simple generators providing two functions: sample and finishedp.

λ

3.1.1 Function Sampler

λ

4 Resampling

[in package MGL-RESAMPLE]

The focus of this package is on resampling methods such as cross-validation and bagging which can be used for model evaluation, model selection, and also as a simple form of ensembling. Data partitioning and sampling functions are also provided because they tend to be used together with resampling.

λ

4.1 Partitions

The following functions partition a dataset (currently only sequences are supported) into a number of partitions. For each element in the original dataset there is exactly one partition that contains it.

λ

4.2 Cross-validation

λ

4.3 Bagging

λ

4.4 CV Bagging

λ

4.5 Miscellaneous Operations

λ

5 Core

[in package MGL-CORE]

λ

5.1 Persistence

λ

5.2 Batch Processing

Processing instances one by one during training or prediction can be slow. The models that support batch processing for greater efficiency are said to be striped.

Typically, during or after creating a model, one sets max-n-stripes on it a positive integer. When a batch of instances is to be fed to the model it is first broken into subbatches of length that's at most max-n-stripes. For each subbatch, set-input (FIXDOC) is called and a before method takes care of setting n-stripes to the actual number of instances in the subbatch. When max-n-stripes is set internal data structures may be resized which is an expensive operation. Setting n-stripes is a comparatively cheap operation, often implemented as matrix reshaping.

Note that for models made of different parts (for example, mgl-bp:bpn consists of mgl-bp:lumps) , setting these values affects the constituent parts, but one should never change the number stripes of the parts directly because that would lead to an internal inconsistency in the model.

λ

5.3 Executors

λ

5.3.1 Parameterized Executor Cache

λ

6 Monitoring

[in package MGL-CORE]

When training or applying a model, one often wants to track various statistics. For example, in the case of training a neural network with cross-entropy loss, these statistics could be the average cross-entropy loss itself, classification accuracy, or even the entire confusion matrix and sparsity levels in hidden layers. Also, there is the question of what to do with the measured values (log and forget, add to some counter or a list).

So there may be several phases of operation when we want to keep an eye on. Let's call these events. There can also be many fairly independent things to do in response to an event. Let's call these monitors. Some monitors are a composition of two operations: one that extracts some measurements and another that aggregates those measurements. Let's call these two measurers and counters, respectively.

For example, consider training a backpropagation neural network. We want to look at the state of of network just after the backward pass. mgl-bp:bp-learner has a monitors event hook corresponding to the moment after backpropagating the gradients. Suppose we are interested in how the training cost evolves:

(push (make-instance 'monitor
                     :measurer (lambda (instances bpn)
                                 (declare (ignore instances))
                                 (mgl-bp:cost bpn))
                     :counter (make-instance 'basic-counter))
      (monitors learner))

During training, this monitor will track the cost of training examples behind the scenes. If we want to print and reset this monitor periodically we can put another monitor on mgl-opt:iterative-optimizer's mgl-opt:on-n-instances-changed accessor:

(push (lambda (optimizer gradient-source n-instances)
        (declare (ignore optimizer))
        (when (zerop (mod n-instances 1000))
          (format t "n-instances: ~S~%" n-instances)
          (dolist (monitor (monitors gradient-source))
            (when (counter monitor)
              (format t "~A~%" (counter monitor))
              (reset-counter (counter monitor)))))
      (mgl-opt:on-n-instances-changed optimizer))

Note that the monitor we push can be anything as long as apply-monitor is implemented on it with the appropriate signature. Also note that the zerop + mod(0 1) logic is fragile, so you will likely want to use mgl-opt:monitor-optimization-periodically instead of doing the above.

So that's the general idea. Concrete events are documented where they are signalled. Often there are task specific utilities that create a reasonable set of default monitors (see Classification Monitors).

λ

6.1 Monitors

λ

6.2 Measurers

measurer is a part of monitor objects, an embedded monitor that computes a specific quantity (e.g. classification accuracy) from the arguments of event it is applied to (e.g. the model results). Measurers are often implemented by combining some kind of model specific extractor with a generic measurer function.

All generic measurer functions return their results as multiple values matching the arguments of add-to-counter for a counter of a certain type (see Counters) so as to make them easily used in a monitor:

(multiple-value-call #'add-to-counter <some-counter>
                     <call-to-some-measurer>)

The counter class compatible with the measurer this way is noted for each function.

For a list of measurer functions see Classification Measurers.

λ

6.3 Counters

λ

6.3.1 Attributes

λ

6.3.2 Counter classes

In addition to the really basic ones here, also see Classification Counters.

λ

7 Classification

[in package MGL-CORE]

To be able to measure classification related quantities, we need to define what the label of an instance is. Customization is possible by implementing a method for a specific type of instance, but these functions only ever appear as defaults that can be overridden.

The following two functions are basically the same as the previous two, but in batch mode: they return a sequence of label indices or distributions. These are called on results produced by models. Implement these for a model and the monitor maker functions below will automatically work. See FIXDOC: for bpn and boltzmann.

λ

7.1 Classification Monitors

The following functions return a list monitors. The monitors are for events of signature (instances model) such as those produced by monitor-model-results and its various model specific variations. They are model-agnostic functions, extensible to new classifier types.

The monitor makers above can be extended to support new classifier types via the following generic functions.

λ

7.2 Classification Measurers

The functions here compare some known good solution (also known as ground truth or target) to a prediction or approximation and return some measure of their [dis]similarity. They are model independent, hence one has to extract the ground truths and predictions first. Rarely used directly, they are mostly hidden behind Classification Monitors.

λ

7.3 Classification Counters

λ

7.3.1 Confusion Matrices

λ

8 Features

[in package MGL-CORE]

λ

8.1 Feature Selection

The following scoring functions all return an equal hash table that maps features to scores.

λ

8.2 Feature Encoding

Features can rarely be fed directly to algorithms as is, they need to be transformed in some way. Suppose we have a simple language model that takes a single word as input and predicts the next word. However, both input and output is to be encoded as float vectors of length 1000. What we do is find the top 1000 words by some measure (see Feature Selection) and associate these words with the integers in [0..999] (this is encodeing). By using for example one-hot encoding, we translate a word into a float vector when passing in the input. When the model outputs the probability distribution of the next word, we find the index of the max and find the word associated with it (this is decodeing)

Also see Bag of Words.

λ

9 Gradient Based Optimization

[in package MGL-OPT]

We have a real valued, differentiable function F and the task is to find the parameters that minimize its value. Optimization starts from a single point in the parameter space of F, and this single point is updated iteratively based on the gradient and value of F at or around the current point.

Note that while the stated problem is that of global optimization, for non-convex functions, most algorithms will tend to converge to a local optimum.

Currently, there are two optimization algorithms: Gradient Descent (with several variants) and Conjugate Gradient both of which are first order methods (they do not need second order gradients) but more can be added with the Extension API.

λ

9.1 Iterative Optimizer

Now let's discuss a few handy utilities.

λ

9.2 Cost Function

The function being minimized is often called the cost or the loss function.

λ

9.3 Gradient Descent

[in package MGL-GD]

Gradient descent is a first-order optimization algorithm. Relying completely on first derivatives, it does not even evaluate the function to be minimized. Let's see how to minimize a numerical lisp function with respect to some of its parameters.

(cl:defpackage :mgl-example-sgd
  (:use #:common-lisp #:mgl))

(in-package :mgl-example-sgd)

;;; Create an object representing the sine function.
(defparameter *diff-fn-1*
  (make-instance 'mgl-diffun:diffun
                 :fn #'sin
                 ;; We are going to optimize its only parameter.
                 :weight-indices '(0)))

;;; Minimize SIN. Note that there is no dataset involved because all
;;; parameters are being optimized.
(minimize (make-instance 'sgd-optimizer :termination 1000)
          *diff-fn-1*
          :weights (make-mat 1))
;;; => A MAT with a single value of about -pi/2.

;;; Create a differentiable function for f(x,y)=(x-y)^2. X is a
;;; parameter whose values come from the DATASET argument passed to
;;; MINIMIZE. Y is a parameter to be optimized (a 'weight').
(defparameter *diff-fn-2*
  (make-instance 'mgl-diffun:diffun
                 :fn (lambda (x y)
                       (expt (- x y) 2))
                 :parameter-indices '(0)
                 :weight-indices '(1)))

;;; Find the Y that minimizes the distance from the instances
;;; generated by the sampler.
(minimize (make-instance 'sgd-optimizer :batch-size 10)
          *diff-fn-2*
          :weights (make-mat 1)
          :dataset (make-instance 'function-sampler
                                  :generator (lambda ()
                                               (list (+ 10
                                                        (gaussian-random-1))))
                                  :max-n-samples 1000))
;;; => A MAT with a single value of about 10, the expected value of
;;; the instances in the dataset.

;;; The dataset can be a SEQUENCE in which case we'd better set
;;; TERMINATION else optimization would never finish.
(minimize (make-instance 'sgd-optimizer :termination 1000)
          *diff-fn-2*
          :weights (make-mat 1)
          :dataset '((0) (1) (2) (3) (4) (5)))
;;; => A MAT with a single value of about 2.5.

We are going to see a number of accessors for optimizer paramaters. In general, it's allowed to setf real slot accessors (as opposed to readers and writers) at any time during optimization and so is defining a method on an optimizer subclass that computes the value in any way. For example, to decay the learning rate on a per mini-batch basis:

(defmethod learning-rate ((optimizer my-sgd-optimizer))
  (* (slot-value optimizer 'learning-rate)
     (expt 0.998
           (/ (n-instances optimizer) 60000))))

λ

9.3.1 Batch Based Optimizers

First let's see everything common to all batch based optimizers, then discuss SGD Optimizer, Adam Optimizer and Normalized Batch Optimizer. All batch based optimizers are iterative-optimizers, so see Iterative Optimizer too.

λ

SGD Optimizer

λ

Adam Optimizer

λ

Normalized Batch Optimizer

λ

9.3.2 Segmented GD Optimizer

segmented-gd-optimizer inherits from iterative-optimizer, so see Iterative Optimizer too.

λ

9.3.3 Per-weight Optimization

λ

9.3.4 Utilities

λ

9.4 Conjugate Gradient

[in package MGL-CG]

Conjugate gradient is a first-order optimization algorithm. It's more advanced than gradient descent as it does line searches which unfortunately also makes it unsuitable for non-deterministic functions. Let's see how to minimize a numerical lisp function with respect to some of its parameters.

;;; Create an object representing the sine function.
(defparameter *diff-fn-1*
  (make-instance 'mgl-diffun:diffun
                 :fn #'sin
                 ;; We are going to optimize its only parameter.
                 :weight-indices '(0)))

;;; Minimize SIN. Note that there is no dataset involved because all
;;; parameters are being optimized.
(minimize (make-instance 'cg-optimizer
                         :batch-size 1
                         :termination 1)
          *diff-fn-1*
          :weights (make-mat 1))
;;; => A MAT with a single value of about -pi/2.

;;; Create a differentiable function for f(x,y)=(x-y)^2. X is a
;;; parameter whose values come from the DATASET argument passed to
;;; MINIMIZE. Y is a parameter to be optimized (a 'weight').
(defparameter *diff-fn-2*
  (make-instance 'mgl-diffun:diffun
                 :fn (lambda (x y)
                       (expt (- x y) 2))
                 :parameter-indices '(0)
                 :weight-indices '(1)))

;;; Find the Y that minimizes the distance from the instances
;;; generated by the sampler.
(minimize (make-instance 'cg-optimizer :batch-size 10)
          *diff-fn-2*
          :weights (make-mat 1)
          :dataset (make-instance 'function-sampler
                                  :generator (lambda ()
                                               (list (+ 10
                                                        (gaussian-random-1))))
                                  :max-n-samples 1000))
;;; => A MAT with a single value of about 10, the expected value of
;;; the instances in the dataset.

;;; The dataset can be a SEQUENCE in which case we'd better set
;;; TERMINATION else optimization would never finish. Note how a
;;; single epoch suffices.
(minimize (make-instance 'cg-optimizer :termination 6)
          *diff-fn-2*
          :weights (make-mat 1)
          :dataset '((0) (1) (2) (3) (4) (5)))
;;; => A MAT with a single value of about 2.5.

λ

9.5 Extension API

λ

9.5.1 Implementing Optimizers

The following generic functions must be specialized for new optimizer types.

The rest are just useful for utilities for implementing optimizers.

λ

9.5.2 Implementing Gradient Sources

Weights can be stored in a multitude of ways. Optimizers need to update weights, so it is assumed that weights are stored in any number of mat objects called segments.

The generic functions in this section must all be specialized for new gradient sources except where noted.

λ

9.5.3 Implementing Gradient Sinks

Optimizers call accumulate-gradients* on gradient sources. One parameter of accumulate-gradients* is the sink. A gradient sink knows what accumulator matrix (if any) belongs to a segment. Sinks are defined entirely by map-gradient-sink.

λ

10 Differentiable Functions

[in package MGL-DIFFUN]

λ

11 Backpropagation Neural Networks

[in package MGL-BP]

λ

11.1 Backprop Overview

Backpropagation Neural Networks are just functions with lots of parameters called weights and a layered structure when presented as a computational graph. The network is trained to minimize some kind of loss function whose value the network computes.

In this implementation, a bpn is assembled from several lumps (roughly corresponding to layers). Both feed-forward and recurrent neural nets are supported (fnn and rnn, respectively). bpns can contain not only lumps but other bpns, too. As we see, networks are composite objects and the abstract base class for composite and simple parts is called clump.

At this point, you may want to jump ahead to get a feel for how things work by reading the fnn Tutorial.

λ

11.2 Clump API

These are mostly for extension purposes. About the only thing needed from here for normal operation is nodes when clamping inputs or extracting predictions.

clumps' nodes holds the result computed by the most recent forward. For ->input lumps, this is where input values shall be placed (see set-input). Currently, the matrix is always two dimensional but this restriction may go away in the future.

In addition to the above, clumps also have to support size(0 1), n-stripes, max-n-stripes (and the setf methods of the latter two) which can be accomplished just by inheriting from bpn, fnn, rnn, or a lump.

λ

11.3 bpns

λ

11.3.1 Training

bpns are trained to minimize the loss function they compute. Before a bpn is passed to minimize (as its gradient-source argument), it must be wrapped in a bp-learner object. bp-learner has monitors slot which is used for example by reset-optimization-monitors.

Without the bells an whistles, the basic shape of training is this:

(minimize optimizer (make-instance 'bp-learner :bpn bpn)
          :dataset dataset)

λ

11.3.2 Monitoring

λ

11.3.3 Feed-Forward Nets

fnn and rnn have a lot in common (see their common superclass, bpn). There is very limited functionality that's specific to fnns so let's get them out of they way before we study a full example.

λ

fnn Tutorial

Hopefully this example from example/digit-fnn.lisp illustrates the concepts involved. If it's too dense despite the comments, then read up on Datasets, Gradient Based Optimization and come back.

(cl:defpackage :mgl-example-digit-fnn
  (:use #:common-lisp #:mgl))

(in-package :mgl-example-digit-fnn)

;;; There are 10 possible digits used as inputs ...
(defparameter *n-inputs* 10)
;;; and we want to learn the rule that maps the input digit D to (MOD
;;; (1+ D) 3).
(defparameter *n-outputs* 3)

;;; We define a feed-forward net to be able to specialize how inputs
;;; are translated by adding a SET-INPUT method later.
(defclass digit-fnn (fnn)
  ())

;;; Build a DIGIT-FNN with a single hidden layer of rectified linear
;;; units and a softmax output.
(defun make-digit-fnn (&key (n-hiddens 5))
  (build-fnn (:class 'digit-fnn)
    (input (->input :size *n-inputs*))
    (hidden-activation (->activation input :size n-hiddens))
    (hidden (->relu hidden-activation))
    (output-activation (->activation hidden :size *n-outputs*))
    (output (->softmax-xe-loss output-activation))))

;;; This method is called with batches of 'instances' (input digits in
;;; this case) by MINIMIZE and also by MONITOR-BPN-RESULTS before
;;; performing a forward pass (i.e. computing the value of the
;;; function represented by the network). Its job is to encode the
;;; inputs by populating rows of the NODES matrix of the INPUT clump.
;;;
;;; Each input is encoded as a row of zeros with a single 1 at index
;;; determined by the input digit. This is called one-hot encoding.
;;; The TARGET could be encoded the same way, but instead we use the
;;; sparse option supported by TARGET of ->SOFTMAX-XE-LOSS.
(defmethod set-input (digits (fnn digit-fnn))
  (let* ((input (nodes (find-clump 'input fnn)))
         (output-lump (find-clump 'output fnn)))
    (fill! 0 input)
    (loop for i upfrom 0
          for digit in digits
          do (setf (mref input i digit) 1))
    (setf (target output-lump)
          (mapcar (lambda (digit)
                    (mod (1+ digit) *n-outputs*))
                  digits))))

;;; Train the network by minimizing the loss (cross-entropy here) with
;;; stochastic gradient descent.
(defun train-digit-fnn ()
  (let ((optimizer
          ;; First create the optimizer for MINIMIZE.
          (make-instance 'segmented-gd-optimizer
                         :segmenter
                         ;; We train each weight lump with the same
                         ;; parameters and, in fact, the same
                         ;; optimizer. But it need not be so, in
                         ;; general.
                         (constantly
                          (make-instance 'sgd-optimizer
                                         :learning-rate 1
                                         :momentum 0.9
                                         :batch-size 100))))
        (fnn (make-digit-fnn)))
    ;; The number of instances the FNN can work with in parallel. It's
    ;; usually equal to the batch size or is a its divisor.
    (setf (max-n-stripes fnn) 50)
    ;; Initialize all weights randomly.
    (map-segments (lambda (weights)
                    (gaussian-random! (nodes weights) :stddev 0.01))
                  fnn)
    ;; Arrange for training and test error to be logged.
    (monitor-optimization-periodically
     optimizer '((:fn log-test-error :period 10000)
                 (:fn reset-optimization-monitors :period 1000)))
    ;; Finally, start the optimization.
    (minimize optimizer
              ;; Dress FNN in a BP-LEARNER and attach monitors for the
              ;; cost to it. These monitors are going to be logged and
              ;; reset after every 100 training instance by
              ;; RESET-OPTIMIZATION-MONITORS above.
              (make-instance 'bp-learner
                             :bpn fnn
                             :monitors (make-cost-monitors
                                        fnn :attributes `(:event "train")))
              ;; Training stops when the sampler runs out (after 10000
              ;; instances).
              :dataset (make-sampler 10000))))

;;; Return a sampler object that produces MAX-N-SAMPLES number of
;;; random inputs (numbers between 0 and 9).
(defun make-sampler (max-n-samples)
  (make-instance 'function-sampler :max-n-samples max-n-samples
                 :generator (lambda () (random *n-inputs*))))

;;; Log the test error. Also, describe the optimizer and the bpn at
;;; the beginning of training. Called periodically during training
;;; (see above).
(defun log-test-error (optimizer learner)
  (when (zerop (n-instances optimizer))
    (describe optimizer)
    (describe (bpn learner)))
  (log-padded
   (monitor-bpn-results (make-sampler 1000) (bpn learner)
                        (make-cost-monitors
                         (bpn learner) :attributes `(:event "pred.")))))

#|

;;; Transcript follows:
(repeatably ()
  (let ((*log-time* nil))
    (train-digit-fnn)))
.. training at n-instances: 0
.. train cost: 0.000e+0 (0)
.. #<SEGMENTED-GD-OPTIMIZER {100E112E93}>
..  SEGMENTED-GD-OPTIMIZER description:
..    N-INSTANCES = 0
..    OPTIMIZERS = (#<SGD-OPTIMIZER
..                    #<SEGMENT-SET
..                      (#<->WEIGHT # :SIZE 15 1/1 :NORM 0.04473>
..                       #<->WEIGHT # :SIZE 3 1/1 :NORM 0.01850>
..                       #<->WEIGHT # :SIZE 50 1/1 :NORM 0.07159>
..                       #<->WEIGHT # :SIZE 5 1/1 :NORM 0.03056>)
..                      {100E335B73}>
..                    {100E06DF83}>)
..    SEGMENTS = (#<->WEIGHT (HIDDEN OUTPUT-ACTIVATION) :SIZE
..                  15 1/1 :NORM 0.04473>
..                #<->WEIGHT (:BIAS OUTPUT-ACTIVATION) :SIZE
..                  3 1/1 :NORM 0.01850>
..                #<->WEIGHT (INPUT HIDDEN-ACTIVATION) :SIZE
..                  50 1/1 :NORM 0.07159>
..                #<->WEIGHT (:BIAS HIDDEN-ACTIVATION) :SIZE
..                  5 1/1 :NORM 0.03056>)
..  
.. #<SGD-OPTIMIZER {100E06DF83}>
..  GD-OPTIMIZER description:
..    N-INSTANCES = 0
..    SEGMENT-SET = #<SEGMENT-SET
..                    (#<->WEIGHT (HIDDEN OUTPUT-ACTIVATION) :SIZE
..                       15 1/1 :NORM 0.04473>
..                     #<->WEIGHT (:BIAS OUTPUT-ACTIVATION) :SIZE
..                       3 1/1 :NORM 0.01850>
..                     #<->WEIGHT (INPUT HIDDEN-ACTIVATION) :SIZE
..                       50 1/1 :NORM 0.07159>
..                     #<->WEIGHT (:BIAS HIDDEN-ACTIVATION) :SIZE
..                       5 1/1 :NORM 0.03056>)
..                    {100E335B73}>
..    LEARNING-RATE = 1.00000e+0
..    MOMENTUM = 9.00000e-1
..    MOMENTUM-TYPE = :NORMAL
..    WEIGHT-DECAY = 0.00000e+0
..    WEIGHT-PENALTY = 0.00000e+0
..    N-AFTER-UPATE-HOOK = 0
..    BATCH-SIZE = 100
..  
..  BATCH-GD-OPTIMIZER description:
..    N-BEFORE-UPATE-HOOK = 0
..  #<DIGIT-FNN {100E11A423}>
..   BPN description:
..     CLUMPS = #(#<->INPUT INPUT :SIZE 10 1/50 :NORM 0.00000>
..                #<->ACTIVATION
..                  (HIDDEN-ACTIVATION :ACTIVATION) :STRIPES 1/50
..                  :CLUMPS 4>
..                #<->RELU HIDDEN :SIZE 5 1/50 :NORM 0.00000>
..                #<->ACTIVATION
..                  (OUTPUT-ACTIVATION :ACTIVATION) :STRIPES 1/50
..                  :CLUMPS 4>
..                #<->SOFTMAX-XE-LOSS OUTPUT :SIZE 3 1/50 :NORM 0.00000>)
..     N-STRIPES = 1
..     MAX-N-STRIPES = 50
..   pred. cost: 1.100d+0 (1000.00)
.. training at n-instances: 1000
.. train cost: 1.093d+0 (1000.00)
.. training at n-instances: 2000
.. train cost: 5.886d-1 (1000.00)
.. training at n-instances: 3000
.. train cost: 3.574d-3 (1000.00)
.. training at n-instances: 4000
.. train cost: 1.601d-7 (1000.00)
.. training at n-instances: 5000
.. train cost: 1.973d-9 (1000.00)
.. training at n-instances: 6000
.. train cost: 4.882d-10 (1000.00)
.. training at n-instances: 7000
.. train cost: 2.771d-10 (1000.00)
.. training at n-instances: 8000
.. train cost: 2.283d-10 (1000.00)
.. training at n-instances: 9000
.. train cost: 2.123d-10 (1000.00)
.. training at n-instances: 10000
.. train cost: 2.263d-10 (1000.00)
.. pred. cost: 2.210d-10 (1000.00)
..
==> (#<->WEIGHT (:BIAS HIDDEN-ACTIVATION) :SIZE 5 1/1 :NORM 2.94294>
-->  #<->WEIGHT (INPUT HIDDEN-ACTIVATION) :SIZE 50 1/1 :NORM 11.48995>
-->  #<->WEIGHT (:BIAS OUTPUT-ACTIVATION) :SIZE 3 1/1 :NORM 3.39103>
-->  #<->WEIGHT (HIDDEN OUTPUT-ACTIVATION) :SIZE 15 1/1 :NORM 11.39339>)

|#

λ

11.3.4 Recurrent Neural Nets

λ

rnn Tutorial

Hopefully this example from example/sum-sign-fnn.lisp illustrates the concepts involved. Make sure you are comfortable with fnn Tutorial before reading this.

(cl:defpackage :mgl-example-sum-sign-rnn
  (:use #:common-lisp #:mgl))

(in-package :mgl-example-sum-sign-rnn)

;;; There is a single input at each time step...
(defparameter *n-inputs* 1)
;;; and we want to learn the rule that outputs the sign of the sum of
;;; inputs so far in the sequence.
(defparameter *n-outputs* 3)

;;; Generate a training example that's a sequence of random length
;;; between 1 and LENGTH. Elements of the sequence are lists of two
;;; elements:
;;;
;;; 1. The input for the network (a single random number).
;;;
;;; 2. The sign of the sum of inputs so far encoded as 0, 1, 2 (for
;;;    negative, zero and positive values). To add a twist, the sum is
;;;    reset whenever a negative input is seen.
(defun make-sum-sign-instance (&key (length 10))
  (let ((length (max 1 (random length)))
        (sum 0))
    (loop for i below length
          collect (let ((x (1- (* 2 (random 2)))))
                    (incf sum x)
                    (when (< x 0)
                      (setq sum x))
                    (list x (cond ((minusp sum) 0)
                                  ((zerop sum) 1)
                                  (t 2)))))))

;;; Build an RNN with a single lstm hidden layer and softmax output.
;;; For each time step, a SUM-SIGN-FNN will be instantiated.
(defun make-sum-sign-rnn (&key (n-hiddens 1))
  (build-rnn ()
    (build-fnn (:class 'sum-sign-fnn)
      (input (->input :size 1))
      (h (->lstm input :name 'h :size n-hiddens))
      (prediction (->softmax-xe-loss (->activation h :name 'prediction
                                                   :size *n-outputs*))))))

;;; We define this class to be able to specialize how inputs are
;;; translated by adding a SET-INPUT method later.
(defclass sum-sign-fnn (fnn)
  ())

;;; We have a batch of instances from MAKE-SUM-SIGN-INSTANCE for the
;;; RNN. This function is invoked with elements of these instances
;;; belonging to the same time step (i.e. at the same index) and sets
;;; the input and target up.
(defmethod set-input (instances (fnn sum-sign-fnn))
  (let ((input-nodes (nodes (find-clump 'input fnn))))
    (setf (target (find-clump 'prediction fnn))
          (loop for stripe upfrom 0
                for instance in instances
                collect
                ;; Sequences in the batch are not of equal length. The
                ;; RNN sends a NIL our way if a sequence has run out.
                (when instance
                  (destructuring-bind (input target) instance
                    (setf (mref input-nodes stripe 0) input)
                    target))))))

;;; Train the network by minimizing the loss (cross-entropy here) with
;;; the Adam optimizer.
(defun train-sum-sign-rnn ()
  (let ((rnn (make-sum-sign-rnn)))
    (setf (max-n-stripes rnn) 50)
    ;; Initialize the weights in the usual sqrt(1 / fan-in) style.
    (map-segments (lambda (weights)
                    (let* ((fan-in (mat-dimension (nodes weights) 0))
                           (limit (sqrt (/ 6 fan-in))))
                      (uniform-random! (nodes weights)
                                       :limit (* 2 limit))
                      (.+! (- limit) (nodes weights))))
                  rnn)
    (minimize (monitor-optimization-periodically
               (make-instance 'adam-optimizer
                              :learning-rate 0.2
                              :mean-decay 0.9
                              :mean-decay-decay 0.9
                              :variance-decay 0.9
                              :batch-size 100)
               '((:fn log-test-error :period 30000)
                 (:fn reset-optimization-monitors :period 3000)))
              (make-instance 'bp-learner
                             :bpn rnn
                             :monitors (make-cost-monitors rnn))
              :dataset (make-sampler 30000))))

;;; Return a sampler object that produces MAX-N-SAMPLES number of
;;; random inputs.
(defun make-sampler (max-n-samples &key (length 10))
  (make-instance 'function-sampler :max-n-samples max-n-samples
                 :generator (lambda ()
                              (make-sum-sign-instance :length length))))

;;; Log the test error. Also, describe the optimizer and the bpn at
;;; the beginning of training. Called periodically during training
;;; (see above).
(defun log-test-error (optimizer learner)
  (when (zerop (n-instances optimizer))
    (describe optimizer)
    (describe (bpn learner)))
  (let ((rnn (bpn learner)))
    (log-padded
     (append
      (monitor-bpn-results (make-sampler 1000) rnn
                           (make-cost-monitors
                            rnn :attributes '(:event "pred.")))
      ;; Same result in a different way: monitor predictions for
      ;; sequences up to length 20, but don't unfold the RNN
      ;; unnecessarily to save memory.
      (let ((*warp-time* t))
        (monitor-bpn-results (make-sampler 1000 :length 20) rnn
                             ;; Just collect and reset the warp
                             ;; monitors after each batch of
                             ;; instances.
                             (make-cost-monitors
                              rnn :attributes '(:event "warped pred."))))))
    ;; Verify that no further unfoldings took place.
    (assert (<= (length (clumps rnn)) 10)))
  (log-mat-room))

#|

;;; Transcript follows:
(let (;; Backprop nets do not need double float. Using single floats
      ;; is faster and needs less memory.
      (*default-mat-ctype* :float)
      ;; Enable moving data in and out of GPU memory so that the RNN
      ;; can work with sequences so long that the unfolded network
      ;; wouldn't otherwise fit in the GPU.
      (*cuda-window-start-time* 1)
      (*log-time* nil))
  ;; Seed the random number generators.
  (repeatably ()
    ;; Enable CUDA if available.
    (with-cuda* ()
      (train-sum-sign-rnn))))
.. training at n-instances: 0
.. cost: 0.000e+0 (0)
.. #<ADAM-OPTIMIZER {1006CD5663}>
..  GD-OPTIMIZER description:
..    N-INSTANCES = 0
..    SEGMENT-SET = #<SEGMENT-SET
..                    (#<->WEIGHT (H #) :SIZE 1 1/1 :NORM 1.73685>
..                     #<->WEIGHT (H #) :SIZE 1 1/1 :NORM 0.31893>
..                     #<->WEIGHT (#1=# #2=# :PEEPHOLE) :SIZE
..                       1 1/1 :NORM 1.81610>
..                     #<->WEIGHT (H #2#) :SIZE 1 1/1 :NORM 0.21965>
..                     #<->WEIGHT (#1# #3=# :PEEPHOLE) :SIZE
..                       1 1/1 :NORM 1.74939>
..                     #<->WEIGHT (H #3#) :SIZE 1 1/1 :NORM 0.40377>
..                     #<->WEIGHT (H PREDICTION) :SIZE
..                       3 1/1 :NORM 2.15898>
..                     #<->WEIGHT (:BIAS PREDICTION) :SIZE
..                       3 1/1 :NORM 2.94470>
..                     #<->WEIGHT (#1# #4=# :PEEPHOLE) :SIZE
..                       1 1/1 :NORM 0.97601>
..                     #<->WEIGHT (INPUT #4#) :SIZE 1 1/1 :NORM 0.65261>
..                     #<->WEIGHT (:BIAS #4#) :SIZE 1 1/1 :NORM 0.37653>
..                     #<->WEIGHT (INPUT #1#) :SIZE 1 1/1 :NORM 0.92334>
..                     #<->WEIGHT (:BIAS #1#) :SIZE 1 1/1 :NORM 0.01609>
..                     #<->WEIGHT (INPUT #5=#) :SIZE 1 1/1 :NORM 1.09995>
..                     #<->WEIGHT (:BIAS #5#) :SIZE 1 1/1 :NORM 1.41244>
..                     #<->WEIGHT (INPUT #6=#) :SIZE 1 1/1 :NORM 0.40475>
..                     #<->WEIGHT (:BIAS #6#) :SIZE 1 1/1 :NORM 1.75358>)
..                    {1006CD8753}>
..    LEARNING-RATE = 2.00000e-1
..    MOMENTUM = NONE
..    MOMENTUM-TYPE = :NONE
..    WEIGHT-DECAY = 0.00000e+0
..    WEIGHT-PENALTY = 0.00000e+0
..    N-AFTER-UPATE-HOOK = 0
..    BATCH-SIZE = 100
..  
..  BATCH-GD-OPTIMIZER description:
..    N-BEFORE-UPATE-HOOK = 0
..  
..  ADAM-OPTIMIZER description:
..    MEAN-DECAY-RATE = 1.00000e-1
..    MEAN-DECAY-RATE-DECAY = 9.00000e-1
..    VARIANCE-DECAY-RATE = 1.00000e-1
..    VARIANCE-ADJUSTMENT = 1.00000d-7
..  #<RNN {10047C77E3}>
..   BPN description:
..     CLUMPS = #(#<SUM-SIGN-FNN :STRIPES 1/50 :CLUMPS 4>
..                #<SUM-SIGN-FNN :STRIPES 1/50 :CLUMPS 4>)
..     N-STRIPES = 1
..     MAX-N-STRIPES = 50
..   
..   RNN description:
..     MAX-LAG = 1
..   pred.        cost: 1.223e+0 (4455.00)
.. warped pred. cost: 1.228e+0 (9476.00)
.. Foreign memory usage:
.. foreign arrays: 162 (used bytes: 39,600)
.. CUDA memory usage:
.. device arrays: 114 (used bytes: 220,892, pooled bytes: 19,200)
.. host arrays: 162 (used bytes: 39,600)
.. host->device copies: 6,164, device->host copies: 4,490
.. training at n-instances: 3000
.. cost: 3.323e-1 (13726.00)
.. training at n-instances: 6000
.. cost: 3.735e-2 (13890.00)
.. training at n-instances: 9000
.. cost: 1.012e-2 (13872.00)
.. training at n-instances: 12000
.. cost: 3.026e-3 (13953.00)
.. training at n-instances: 15000
.. cost: 9.267e-4 (13948.00)
.. training at n-instances: 18000
.. cost: 2.865e-4 (13849.00)
.. training at n-instances: 21000
.. cost: 8.893e-5 (13758.00)
.. training at n-instances: 24000
.. cost: 2.770e-5 (13908.00)
.. training at n-instances: 27000
.. cost: 8.514e-6 (13570.00)
.. training at n-instances: 30000
.. cost: 2.705e-6 (13721.00)
.. pred.        cost: 1.426e-6 (4593.00)
.. warped pred. cost: 1.406e-6 (9717.00)
.. Foreign memory usage:
.. foreign arrays: 216 (used bytes: 52,800)
.. CUDA memory usage:
.. device arrays: 148 (used bytes: 224,428, pooled bytes: 19,200)
.. host arrays: 216 (used bytes: 52,800)
.. host->device copies: 465,818, device->host copies: 371,990
..
==> (#<->WEIGHT (H (H :OUTPUT)) :SIZE 1 1/1 :NORM 0.10624>
-->  #<->WEIGHT (H (H :CELL)) :SIZE 1 1/1 :NORM 0.94460>
-->  #<->WEIGHT ((H :CELL) (H :FORGET) :PEEPHOLE) :SIZE 1 1/1 :NORM 0.61312>
-->  #<->WEIGHT (H (H :FORGET)) :SIZE 1 1/1 :NORM 0.38093>
-->  #<->WEIGHT ((H :CELL) (H :INPUT) :PEEPHOLE) :SIZE 1 1/1 :NORM 1.17956>
-->  #<->WEIGHT (H (H :INPUT)) :SIZE 1 1/1 :NORM 0.88011>
-->  #<->WEIGHT (H PREDICTION) :SIZE 3 1/1 :NORM 49.93808>
-->  #<->WEIGHT (:BIAS PREDICTION) :SIZE 3 1/1 :NORM 10.98112>
-->  #<->WEIGHT ((H :CELL) (H :OUTPUT) :PEEPHOLE) :SIZE 1 1/1 :NORM 0.67996>
-->  #<->WEIGHT (INPUT (H :OUTPUT)) :SIZE 1 1/1 :NORM 0.65251>
-->  #<->WEIGHT (:BIAS (H :OUTPUT)) :SIZE 1 1/1 :NORM 10.23003>
-->  #<->WEIGHT (INPUT (H :CELL)) :SIZE 1 1/1 :NORM 5.98116>
-->  #<->WEIGHT (:BIAS (H :CELL)) :SIZE 1 1/1 :NORM 0.10681>
-->  #<->WEIGHT (INPUT (H :FORGET)) :SIZE 1 1/1 :NORM 4.46301>
-->  #<->WEIGHT (:BIAS (H :FORGET)) :SIZE 1 1/1 :NORM 1.57195>
-->  #<->WEIGHT (INPUT (H :INPUT)) :SIZE 1 1/1 :NORM 0.36401>
-->  #<->WEIGHT (:BIAS (H :INPUT)) :SIZE 1 1/1 :NORM 8.63833>)

|#

λ

Time Warp

The unbounded memory usage of rnns with one bpn allocated per time step can become a problem. For training, where the gradients often have to be backpropagated from the last time step to the very beginning, this is hard to solve but with cuda-window-start-time the limit is no longer GPU memory.

For prediction on the other hand, one doesn't need to keep old steps around indefinitely: they can be discarded when future time steps will never reference them again.

λ

11.4 Lumps

λ

11.4.1 Lump Base Class

λ

11.4.2 Inputs

λ

Input Lump

λ

Embedding Lump

This lump is like an input and a simple activation molded together in the name of efficiency.

λ

11.4.3 Weight Lump

λ

11.4.4 Activations

λ

Activation Subnet

So we have some inputs. Usually the next step is to multiply the input vector with a weight matrix and add biases. This can be done directly with ->+, ->v*m and ->weight, but it's more convenient to use activation subnets to reduce the clutter.

λ

Batch-Normalization

λ

11.4.5 Activation Functions

Now we are moving on to the most important non-linearities to which activations are fed.

λ

Sigmoid Lump

λ

Tanh Lump

λ

Scaled Tanh Lump

λ

Relu Lump

We are somewhere around year 2007 by now.

λ

Max Lump

We are in about year 2011.

λ

Min Lump

λ

Max-Channel Lump

λ

11.4.6 Losses

Ultimately, we need to tell the network what to learn which means that the loss function to be minimized needs to be constructed as part of the network.

λ

Loss Lump

λ

Squared Difference Lump

In regression, the squared error loss is most common. The squared error loss can be constructed by combining ->squared-difference with a ->loss.

λ

Softmax Cross-Entropy Loss Lump

λ

11.4.7 Stochasticity

λ

Dropout Lump

λ

Gaussian Random Lump

λ

Binary Sampling Lump

λ

11.4.8 Arithmetic

λ

Sum Lump

λ

Vector-Matrix Multiplication Lump

λ

Elementwise Addition Lump

λ

Elementwise Multiplication Lump

λ

Abs Lump

λ

Exp Lump

λ

Normalized Lump

λ

11.4.9 Operations for rnns

λ

LSTM Subnet

λ

Sequence Barrier Lump

λ

11.5 Utilities

λ

12 Boltzmann Machines

λ

13 Gaussian Processes

λ

14 Natural Language Processing

[in package MGL-NLP]

This in nothing more then a couple of utilities for now which may grow into a more serious toolset for NLP eventually.

λ

14.1 Bag of Words