MGL Manual
Table of Contents
 1 mgl ASDF System Details
 2 Introduction
 3 Datasets
 4 Resampling
 5 Core
 6 Monitoring
 7 Classification
 8 Features
 9 Gradient Based Optimization
 10 Differentiable Functions
 11 Backpropagation Neural Networks
 12 Boltzmann Machines
 13 Gaussian Processes
 14 Natural Language Processing
[in package MGL]
1 mgl ASDF System Details
 Version: 0.1.0
 Description: MGL is a machine learning library for backpropagation neural networks, boltzmann machines, gaussian processes and more.
 Licence: MIT, see COPYING.
 Author: Gábor Melis mailto:mega@retes.hu
 Mailto: mega@retes.hu
 Homepage: http://melisgl.github.io/mgl
 Bug tracker: https://github.com/melisgl/mgl/issues
 Source control: GIT
2 Introduction
2.1 Overview
MGL is a Common Lisp machine learning library by Gábor
Melis with some parts originally contributed
by Ravenpack International. It mainly concentrates on various forms
of neural networks (boltzmann machines, feedforward and recurrent
backprop nets). Most of MGL is built on top of MGLMAT
so it has
BLAS and CUDA support.
In general, the focus is on power and performance not on ease of use. Perhaps one day there will be a cookie cutter interface with restricted functionality if a reasonable compromise is found between power and utility.
2.2 Links
Here is the official repository and the HTML documentation for the latest version.
2.3 Dependencies
MGL used to rely on LLA to
interface to BLAS and LAPACK. That's mostly history by now, but
configuration of foreign libraries is still done via LLA
. See the
README in LLA
on how to set things up. Note that these days OpenBLAS
is easier to set up and just as fast as ATLAS.
CLCUDA and
MGLMAT are the two main
dependencies and also the ones not yet in quicklisp, so just drop
them into quicklisp/localprojects/
. If there is no suitable GPU
on the system or the CUDA SDK is not installed, MGL will simply
fall back on using BLAS and Lisp code. Wrapping code in
MGLMAT:WITHCUDA*
is basically all that's needed to run on the GPU,
and with MGLMAT:CUDAAVAILABLEP
one can check whether the GPU is
really being used.
2.4 Code Organization
MGL consists of several packages dedicated to different tasks.
For example, package MGLRESAMPLE
is about Resampling and
MGLGD
is about Gradient Descent and so on. On one hand, having many
packages makes it easier to cleanly separate API and implementation
and also to explore into a specific task. At other times, they can
be a hassle, so the MGL
package itself reexports every external
symbol found in all the other packages that make up MGL and
MGLMAT (see MAT Manual) on which it heavily relies.
One exception to this rule is the bundled, but independent
MGLGNUPLOT
library.
The built in tests can be run with:
(ASDF:OOS 'ASDF:TESTOP '#:MGL)
Note, that most of the tests are rather stochastic and can fail once in a while.
2.5 Glossary
Ultimately machine learning is about creating models of some domain. The observations in the modelled domain are called instances (also known as examples or samples). Sets of instances are called datasets. Datasets are used when fitting a model or when making predictions. Sometimes the word predictions is too specific, and the results obtained from applying a model to some instances are simply called results.
3 Datasets
[in package MGLDATASET]
An instance can often be any kind of object of the user's choice.
It is typically represented by a set of numbers which is called a
feature vector or by a structure holding the feature vector, the
label, etc. A dataset is a SEQUENCE
of such instances or a
Samplers object that produces instances.
[function] MAPDATASET FN DATASET
Call
FN
with each instance inDATASET
. This is basically equivalent to iterating over the elements of a sequence or a sampler (see Samplers).
[function] MAPDATASETS FN DATASETS &KEY (IMPUTE
NIL
IMPUTEP
)Call
FN
with a list of instances, one from each dataset inDATASETS
. Return nothing. IfIMPUTE
is specified then iterate until the largest dataset is consumed imputingIMPUTE
for missing values. IfIMPUTE
is not specified then iterate until the smallest dataset runs out.(mapdatasets #'prin1 '((0 1 2) (:a :b))) .. (0 :A)(1 :B) (mapdatasets #'prin1 '((0 1 2) (:a :b)) :impute nil) .. (0 :A)(1 :B)(2 NIL)
It is of course allowed to mix sequences with samplers:
(mapdatasets #'prin1 (list '(0 1 2) (makesequencesampler '(:a :b) :maxnsamples 2))) .. (0 :A)(1 :B)
3.1 Samplers
Some algorithms do not need random access to the entire dataset and
can work with a stream observations. Samplers are simple generators
providing two functions: SAMPLE
and FINISHEDP
.
[genericfunction] SAMPLE SAMPLER
If
SAMPLER
has not run out of data (seeFINISHEDP
)SAMPLE
returns an object that represents a sample from the world to be experienced or, in other words, simply something the can be used as input for training or prediction. It is not allowed to callSAMPLE
ifSAMPLER
isFINISHEDP
.
[genericfunction] FINISHEDP SAMPLER
See if
SAMPLER
has run out of examples.
[function] LISTSAMPLES SAMPLER MAXSIZE
Return a list of samples of length at most
MAXSIZE
or less ifSAMPLER
runs out.
[function] MAKESEQUENCESAMPLER SEQ &KEY MAXNSAMPLES
Create a sampler that returns elements of
SEQ
in their original order. IfMAXNSAMPLES
is nonnil, then at mostMAXNSAMPLES
are sampled.
[function] MAKERANDOMSAMPLER SEQ &KEY MAXNSAMPLES (REORDER #'
MGLRESAMPLE:SHUFFLE
)Create a sampler that returns elements of
SEQ
in random order. IfMAXNSAMPLES
is nonnil, then at mostMAXNSAMPLES
are sampled. The first pass over a shuffled copy ofSEQ
, and this copy is reshuffled whenever the sampler reaches the end of it. Shuffling is performed by calling theREORDER
function.
[variable] *INFINITELYEMPTYDATASET* #<FUNCTIONSAMPLER "infinitely empty" >
This is the default dataset for
MGLOPT:MINIMIZE
. It's an infinite stream ofNIL
s.
3.1.1 Function Sampler

A sampler with a function in its
GENERATOR
that produces a stream of samples which may or may not be finite depending onMAXNSAMPLES
.FINISHEDP
returnsT
iffMAXNSAMPLES
is nonnil, and it's not greater than the number of samples generated (NSAMPLES
).(listsamples (makeinstance 'functionsampler :generator (lambda () (random 10)) :maxnsamples 5) 10) => (3 5 2 3 3)
[reader] GENERATOR FUNCTIONSAMPLER (:GENERATOR)
A generator function of no arguments that returns the next sample.
 [accessor] MAXNSAMPLES FUNCTIONSAMPLER (:MAXNSAMPLES =
NIL
)
[reader] NAME FUNCTIONSAMPLER (:NAME =
NIL
)An arbitrary object naming the sampler. Only used for printing the sampler object.
4 Resampling
[in package MGLRESAMPLE]
The focus of this package is on resampling methods such as crossvalidation and bagging which can be used for model evaluation, model selection, and also as a simple form of ensembling. Data partitioning and sampling functions are also provided because they tend to be used together with resampling.
4.1 Partitions
The following functions partition a dataset (currently only
SEQUENCE
s are supported) into a number of partitions. For each
element in the original dataset there is exactly one partition that
contains it.
[function] FRACTURE FRACTIONS SEQ &KEY WEIGHT
Partition
SEQ
into a number of subsequences.FRACTIONS
is either a positive integer or a list of nonnegative real numbers.WEIGHT
isNIL
or a function that returns a nonnegative real number when called with an element fromSEQ
. IfFRACTIONS
is a positive integer then return a list of that many subsequences with equal sum of weights bar rounding errors, else partitionSEQ
into subsequences, where the sum of weights of subsequence I is proportional to element I ofFRACTIONS
. IfWEIGHT
isNIL
, then it's element is assumed to have the same weight.To split into 5 sequences:
(fracture 5 '(0 1 2 3 4 5 6 7 8 9)) => ((0 1) (2 3) (4 5) (6 7) (8 9))
To split into two sequences whose lengths are proportional to 2 and 3:
(fracture '(2 3) '(0 1 2 3 4 5 6 7 8 9)) => ((0 1 2 3) (4 5 6 7 8 9))
[function] STRATIFY SEQ &KEY (KEY #'
IDENTITY
) (TEST #'EQL
)Return the list of strata of
SEQ
.SEQ
is a sequence of elements for which the functionKEY
returns the class they belong to. Such classes are opaque objects compared for equality withTEST
. A stratum is a sequence of elements with the same (underTEST
)KEY
.(stratify '(0 1 2 3 4 5 6 7 8 9) :key #'evenp) => ((0 2 4 6 8) (1 3 5 7 9))
[function] FRACTURESTRATIFIED FRACTIONS SEQ &KEY (KEY #'
IDENTITY
) (TEST #'EQL
) WEIGHTSimilar to
FRACTURE
, but also makes sure that keys are evenly distributed among the partitions (seeSTRATIFY
). It can be useful for classification tasks to partition the data set while keeping the distribution of classes the same.Note that the sets returned are not in random order. In fact, they are sorted internally by
KEY
.For example, to make two splits with approximately the same number of even and odd numbers:
(fracturestratified 2 '(0 1 2 3 4 5 6 7 8 9) :key #'evenp) => ((0 2 1 3) (4 6 8 5 7 9))
4.2 Crossvalidation
[function] CROSSVALIDATE DATA FN &KEY (NFOLDS 5) (FOLDS (
ALEXANDRIA:IOTA
NFOLDS
)) (SPLITFN #'SPLITFOLD/MOD
) PASSFOLDMap
FN
over theFOLDS
ofDATA
split withSPLITFN
and collect the results in a list. The simplest demonstration is:(crossvalidate '(0 1 2 3 4) (lambda (test training) (list test training)) :nfolds 5) => (((0) (1 2 3 4)) ((1) (0 2 3 4)) ((2) (0 1 3 4)) ((3) (0 1 2 4)) ((4) (0 1 2 3)))
Of course, in practice one would typically train a model and return the trained model and/or its score on
TEST
. Also, sometimes one may want to do only some of the folds and remember which ones they were:(crossvalidate '(0 1 2 3 4) (lambda (fold test training) (list :fold fold test training)) :folds '(2 3) :passfold t) => ((:fold 2 (2) (0 1 3 4)) (:fold 3 (3) (0 1 2 4)))
Finally, the way the data is split can be customized. By default
SPLITFOLD/MOD
is called with the argumentsDATA
, the fold (from amongFOLDS
) andNFOLDS
.SPLITFOLD/MOD
returns two values which are then passed on toFN
. One can useSPLITFOLD/CONT
orSPLITSTRATIFIED
or any other function that works with these arguments. The only real constraint is thatFN
has to take as many arguments (plus the fold argument ifPASSFOLD
) asSPLITFN
returns.
[function] SPLITFOLD/MOD SEQ FOLD NFOLDS
Partition
SEQ
into two sequences: one with elements ofSEQ
with indices whose remainder isFOLD
when divided withNFOLDS
, and a second one with the rest. The second one is the larger set. The order of elements remains stable. This function is suitable as theSPLITFN
argument ofCROSSVALIDATE
.
[function] SPLITFOLD/CONT SEQ FOLD NFOLDS
Imagine dividing
SEQ
intoNFOLDS
subsequences of the same size (bar rounding). Return the subsequence of indexFOLD
as the first value and the all the other subsequences concatenated into one as the second value. The order of elements remains stable. This function is suitable as theSPLITFN
argument ofCROSSVALIDATE
.
[function] SPLITSTRATIFIED SEQ FOLD NFOLDS &KEY (KEY #'
IDENTITY
) (TEST #'EQL
) WEIGHTSplit
SEQ
intoNFOLDS
partitions (as inFRACTURESTRATIFIED
). Return the partition of indexFOLD
as the first value, and the concatenation of the rest as the second value. This function is suitable as theSPLITFN
argument ofCROSSVALIDATE
(mostly likely as a closure withKEY
,TEST
,WEIGHT
bound).
4.3 Bagging
[function] BAG SEQ FN &KEY (RATIO 1) N WEIGHT (REPLACEMENT
T
) KEY (TEST #'EQL
) (RANDOMSTATE*RANDOMSTATE*
)Sample from
SEQ
withSAMPLEFROM
(passingRATIO
,WEIGHT
,REPLACEMENT
), orSAMPLESTRATIFIED
ifKEY
is notNIL
. CallFN
with the sample. IfN
isNIL
then keep repeating this untilFN
performs a nonlocal exit. ElseN
must be a nonnegative integer,N
iterations will be performed, the primary values returned byFN
collected into a list and returned. SeeSAMPLEFROM
andSAMPLESTRATIFIED
for examples.
[function] SAMPLEFROM RATIO SEQ &KEY WEIGHT REPLACEMENT (RANDOMSTATE
*RANDOMSTATE*
)Return a sequence constructed by sampling with or without
REPLACEMENT
fromSEQ
. The sum of weights in the result sequence will approximately be the sum of weights ofSEQ
timesRATIO
. IfWEIGHT
isNIL
then elements are assumed to have equal weights, elseWEIGHT
should return a nonnegative real number when called with an element ofSEQ
.To randomly select half of the elements:
(samplefrom 1/2 '(0 1 2 3 4 5)) => (5 3 2)
To randomly select some elements such that the sum of their weights constitute about half of the sum of weights across the whole sequence:
(samplefrom 1/2 '(0 1 2 3 4 5 6 7 8 9) :weight #'identity) => ;; sums to 28 that's near 45/2 (9 4 1 6 8)
To sample with replacement (that is, allowing the element to be sampled multiple times):
(samplefrom 1 '(0 1 2 3 4 5) :replacement t) => (1 1 5 1 4 4)
[function] SAMPLESTRATIFIED RATIO SEQ &KEY WEIGHT REPLACEMENT (KEY #'
IDENTITY
) (TEST #'EQL
) (RANDOMSTATE*RANDOMSTATE*
)Like
SAMPLEFROM
but makes sure that the weighted proportion of classes in the result is approximately the same as the proportion inSEQ
. SeeSTRATIFY
for the description ofKEY
andTEST
.
4.4 CV Bagging
[function] BAGCV DATA FN &KEY N (NFOLDS 5) (FOLDS (
ALEXANDRIA:IOTA
NFOLDS
)) (SPLITFN #'SPLITFOLD/MOD
) PASSFOLD (RANDOMSTATE*RANDOMSTATE*
)Perform crossvalidation on different shuffles of
DATA
N
times and collect the results. SinceCROSSVALIDATE
collects the return values ofFN
, the return value of this function is a list of lists ofFN
results. IfN
isNIL
, don't collect anything just keep doing repeated CVs untilFN
performs a nonlocal exit.The following example simply collects the test and training sets for 2fold CV repeated 3 times with shuffled data:
;;; This is nondeterministic. (bagcv '(0 1 2 3 4) #'list :n 3 :nfolds 2) => ((((2 3 4) (1 0)) ((1 0) (2 3 4))) (((2 1 0) (4 3)) ((4 3) (2 1 0))) (((1 0 3) (2 4)) ((2 4) (1 0 3))))
CV bagging is useful when a single CV is not producing stable results. As an ensemble method, CV bagging has the advantage over bagging that each example will occur the same number of times and after the first CV is complete there is a complete but less reliable estimate for each example which gets refined by further CVs.
4.5 Miscellaneous Operations
[function] SPREADSTRATA SEQ &KEY (KEY #'
IDENTITY
) (TEST #'EQL
)Return a sequence that's a reordering of
SEQ
such that elements belonging to different strata (underKEY
andTEST
, seeSTRATIFY
) are distributed evenly. The order of elements belonging to the same stratum is unchanged.For example, to make sure that even and odd numbers are distributed evenly:
(spreadstrata '(0 2 4 6 8 1 3 5 7 9) :key #'evenp) => (0 1 2 3 4 5 6 7 8 9)
Same thing with unbalanced classes:
(spreadstrata (vector 0 2 3 5 6 1 4) :key (lambda (x) (if (member x '(1 4)) t nil))) => #(0 1 2 3 4 5 6)
[function] ZIPEVENLY SEQS &KEY RESULTTYPE
Make a single sequence out of the sequences in
SEQS
so that in the returned sequence indices of elements belonging to the same source sequence are spread evenly across the whole range. The result is a list isRESULTTYPE
isLIST
, it's a vector ifRESULTTYPE
isVECTOR
. IfRESULTTYPE
isNIL
, then it's determined by the type of the first sequence inSEQS
.(zipevenly '((0 2 4) (1 3))) => (0 1 2 3 4)
5 Core
[in package MGLCORE]
5.1 Persistence
[function] LOADSTATE FILENAME OBJECT
Load weights of
OBJECT
fromFILENAME
. ReturnOBJECT
.
[function] SAVESTATE FILENAME OBJECT &KEY (IFEXISTS
:ERROR
) (ENSURET
)Save weights of
OBJECT
toFILENAME
. IfENSURE
, thenENSUREDIRECTORIESEXIST
is called onFILENAME
.IFEXISTS
is passed on toOPEN
. ReturnOBJECT
.
[function] READSTATE OBJECT STREAM
Read the weights of
OBJECT
from the bivalentSTREAM
where weights mean the learnt parameters. There is currently no sanity checking of data which will most certainly change in the future together with the serialization format. ReturnOBJECT
.
[function] WRITESTATE OBJECT STREAM
Write weight of
OBJECT
to the bivalentSTREAM
. ReturnOBJECT
.
[genericfunction] READSTATE* OBJECT STREAM CONTEXT
This is the extension point for
READSTATE
. It is guaranteed that primaryREADSTATE*
methods will be called only once for eachOBJECT
(under EQ).CONTEXT
is an opaque object and must be passed on to any recursiveREADSTATE*
calls.
[genericfunction] WRITESTATE* OBJECT STREAM CONTEXT
This is the extension point for
WRITESTATE
. It is guaranteed that primaryWRITESTATE*
methods will be called only once for eachOBJECT
(under EQ).CONTEXT
is an opaque object and must be passed on to any recursiveWRITESTATE*
calls.
5.2 Batch Processing
Processing instances one by one during training or prediction can be slow. The models that support batch processing for greater efficiency are said to be striped.
Typically, during or after creating a model, one sets MAXNSTRIPES
on it a positive integer. When a batch of instances is to be fed to
the model it is first broken into subbatches of length that's at
most MAXNSTRIPES
. For each subbatch, SETINPUT
(FIXDOC) is called
and a before method takes care of setting NSTRIPES
to the actual
number of instances in the subbatch. When MAXNSTRIPES
is set
internal data structures may be resized which is an expensive
operation. Setting NSTRIPES
is a comparatively cheap operation,
often implemented as matrix reshaping.
Note that for models made of different parts (for example,
MGLBP:BPN
consists of MGLBP:LUMP
s) , setting these
values affects the constituent parts, but one should never change
the number stripes of the parts directly because that would lead to
an internal inconsistency in the model.
[genericfunction] MAXNSTRIPES OBJECT
The number of stripes with which the
OBJECT
is capable of dealing simultaneously.
[genericfunction] SETMAXNSTRIPES MAXNSTRIPES OBJECT
Allocate the necessary stuff to allow for
MAXNSTRIPES
number of stripes to be worked with simultaneously inOBJECT
. This is called whenMAXNSTRIPES
isSETF
'ed.
[genericfunction] NSTRIPES OBJECT
The number of stripes currently present in
OBJECT
. This is at mostMAXNSTRIPES
.
[genericfunction] SETNSTRIPES NSTRIPES OBJECT
Set the number of stripes (out of
MAXNSTRIPES
) that are in use inOBJECT
. This is called whenNSTRIPES
isSETF
'ed.
[macro] WITHSTRIPES SPECS &BODY BODY
Bind start and optionally end indices belonging to stripes in striped objects.
(WITHSTRIPES ((STRIPE1 OBJECT1 START1 END1) (STRIPE2 OBJECT2 START2) ...) ...)
This is how one's supposed to find the index range corresponding to the Nth input in an input lump of a bpn:
(withstripes ((n inputlump start end)) (loop for i upfrom start below end do (setf (mref (nodes inputlump) i) 0d0)))
Note how the input lump is striped, but the matrix into which we are indexing (
NODES
) is not known toWITHSTRIPES
. In fact, for lumps the same stripe indices work withNODES
andMGLBP:DERIVATIVES
.
[genericfunction] STRIPESTART STRIPE OBJECT
Return the start index of
STRIPE
in some array or matrix ofOBJECT
.
[genericfunction] STRIPEEND STRIPE OBJECT
Return the end index (exclusive) of
STRIPE
in some array or matrix ofOBJECT
.
[genericfunction] SETINPUT INSTANCES MODEL
Set
INSTANCES
as inputs inMODEL
.SAMPLE
S is always aSEQUENCE
of instances even for models not capable of batch operation. It setsNSTRIPES
to (LENGTH
INSTANCES
) in a:BEFORE
method.
[function] MAPBATCHESFORMODEL FN DATASET MODEL
Call
FN
with batches of instances fromDATASET
suitable forMODEL
. The number of instances in a batch isMAXNSTRIPES
ofMODEL
or less if there are no more instances left.
[macro] DOBATCHESFORMODEL (BATCH (DATASET MODEL)) &BODY BODY
Convenience macro over
MAPBATCHESFORMODEL
.
5.3 Executors
[genericfunction] MAPOVEREXECUTORS FN INSTANCES PROTOTYPEEXECUTOR
Divide
INSTANCES
between executors that perform the same function asPROTOTYPEEXECUTOR
and callFN
with the instances and the executor for which the instances are.Some objects conflate function and call: the forward pass of a
MGLBP:BPN
computes output from inputs so it is like a function but it also doubles as a function call in the sense that the bpn (function) object changes state during the computation of the output. Hence not even the forward pass of a bpn is thread safe. There is also the restriction that all inputs must be of the same size.For example, if we have a function that builds bpn a for an input of a certain size, then we can create a factory that creates bpns for a particular call. The factory probably wants to keep the weights the same though. In Parameterized Executor Cache,
MAKEEXECUTORWITHPARAMETERS
is this factory.Parallelization of execution is another possibility
MAPOVEREXECUTORS
allows, but there is no prebuilt solution for it, yet.The default implementation simply calls
FN
withINSTANCES
andPROTOTYPEEXECUTOR
.
[macro] DOEXECUTORS (INSTANCES OBJECT) &BODY BODY
Convenience macro on top of
MAPOVEREXECUTORS
.
5.3.1 Parameterized Executor Cache
[class] PARAMETERIZEDEXECUTORCACHEMIXIN
Mix this into a model, implement
INSTANCETOEXECUTORPARAMETERS
andMAKEEXECUTORWITHPARAMETERS
andDOEXECUTORS
will be to able build executors suitable for different instances. The canonical example is using a BPN to compute the means and convariances of a gaussian process. Since each instance is made of a variable number of observations, the size of the input is not constant, thus we have a bpn (an executor) for each input dimension (the parameters).
[genericfunction] MAKEEXECUTORWITHPARAMETERS PARAMETERS CACHE
Create a new executor for
PARAMETERS
.CACHE
is aPARAMETERIZEDEXECUTORCACHEMIXIN
. In the BPN gaussian process example,PARAMETERS
would be a list of input dimensions.
[genericfunction] INSTANCETOEXECUTORPARAMETERS INSTANCE CACHE
Return the parameters for an executor able to handle
INSTANCE
. Called byMAPOVEREXECUTORS
onCACHE
(that's aPARAMETERIZEDEXECUTORCACHEMIXIN
). The returned parameters are keys in anEQUAL
parameters>executor hash table.
6 Monitoring
[in package MGLCORE]
When training or applying a model, one often wants to track various statistics. For example, in the case of training a neural network with crossentropy loss, these statistics could be the average crossentropy loss itself, classification accuracy, or even the entire confusion matrix and sparsity levels in hidden layers. Also, there is the question of what to do with the measured values (log and forget, add to some counter or a list).
So there may be several phases of operation when we want to keep an eye on. Let's call these events. There can also be many fairly independent things to do in response to an event. Let's call these monitors. Some monitors are a composition of two operations: one that extracts some measurements and another that aggregates those measurements. Let's call these two measurers and counters, respectively.
For example, consider training a backpropagation neural network. We
want to look at the state of of network just after the backward
pass. MGLBP:BPLEARNER
has a [MONITORS]
event hook corresponding to the moment after backpropagating the
gradients. Suppose we are interested in how the training cost
evolves:
(push (makeinstance 'monitor
:measurer (lambda (instances bpn)
(declare (ignore instances))
(mglbp:cost bpn))
:counter (makeinstance 'basiccounter))
(monitors learner))
During training, this monitor will track the cost of training
examples behind the scenes. If we want to print and reset this
monitor periodically we can put another monitor on
MGLOPT:ITERATIVEOPTIMIZER
's MGLOPT:ONNINSTANCESCHANGED
accessor:
(push (lambda (optimizer gradientsource ninstances)
(declare (ignore optimizer))
(when (zerop (mod ninstances 1000))
(format t "ninstances: ~S~%" ninstances)
(dolist (monitor (monitors gradientsource))
(when (counter monitor)
(format t "~A~%" (counter monitor))
(resetcounter (counter monitor)))))
(mglopt:onninstanceschanged optimizer))
Note that the monitor we push can be anything as long as
APPLYMONITOR
is implemented on it with the appropriate signature.
Also note that the ZEROP
+ MOD
logic is fragile, so you will likely
want to use MGLOPT:MONITOROPTIMIZATIONPERIODICALLY
instead of
doing the above.
So that's the general idea. Concrete events are documented where they are signalled. Often there are task specific utilities that create a reasonable set of default monitors (see Classification Monitors).
[function] APPLYMONITORS MONITORS &REST ARGUMENTS
Call
APPLYMONITOR
on each monitor inMONITORS
andARGUMENTS
. This is how an event is fired.
[genericfunction] APPLYMONITOR MONITOR &REST ARGUMENTS
Apply
MONITOR
toARGUMENTS
. This sound fairly generic, because it is.MONITOR
can be anything, even a simple function or symbol, in which case this is justCL:APPLY
. See Monitors for more.
[genericfunction] COUNTER MONITOR
Return an object representing the state of
MONITOR
orNIL
, if it doesn't have any (say because it's a simple logging function). Most monitors have counters into which they accumulate results until they are printed and reset. See Counters for more.
[function] MONITORMODELRESULTS FN DATASET MODEL MONITORS
Call
FN
with batches of instances fromDATASET
until it runs out (as inDOBATCHESFORMODEL
).FN
is supposed to applyMODEL
to the batch and return some kind of result (for neural networks, the result is the model state itself). ApplyMONITORS
to each batch and the result returned byFN
for that batch. Finally, return the list of counters ofMONITORS
.The purpose of this function is to collect various results and statistics (such as error measures) efficiently by applying the model only once, leaving extraction of quantities of interest from the model's results to
MONITORS
.See the model specific versions of this functions such as
MGLBP:MONITORBPNRESULTS
.
[genericfunction] MONITORS OBJECT
Return monitors associated with
OBJECT
. See various methods such as [MONITORS] for more documentation.
6.1 Monitors

A monitor that has another monitor called
MEASURER
embedded in it. When this monitor is applied, it applies the measurer and passes the returned values toADDTOCOUNTER
called on itsCOUNTER
slot. One may further specializeAPPLYMONITOR
to change that.This class is useful when the same event monitor is applied repeatedly over a period and its results must be aggregated such as when training statistics are being tracked or when predictions are begin made. Note that the monitor must be compatible with the event it handles. That is, the embedded
MEASURER
must be prepared to take the arguments that are documented to come with the event.
[reader] MEASURER MONITOR (:MEASURER)
This must be a monitor itself which only means that
APPLYMONITOR
is defined on it (but see Monitoring). The returned values are aggregated byCOUNTER
. See Measurers for a library of measurers.
[reader] COUNTER MONITOR (:COUNTER)
The
COUNTER
of a monitor carries out the aggregation of results returned byMEASURER
. The See Counters for a library of counters.
6.2 Measurers
MEASURER
is a part of MONITOR
objects, an embedded monitor that
computes a specific quantity (e.g. classification accuracy) from the
arguments of event it is applied to (e.g. the model results).
Measurers are often implemented by combining some kind of model
specific extractor with a generic measurer function.
All generic measurer functions return their results as multiple
values matching the arguments of ADDTOCOUNTER
for a counter of a
certain type (see Counters) so as to make them easily used in a
MONITOR
:
(multiplevaluecall #'addtocounter <somecounter>
<calltosomemeasurer>)
The counter class compatible with the measurer this way is noted for each function.
For a list of measurer functions see Classification Measurers.
6.3 Counters
[genericfunction] ADDTOCOUNTER COUNTER &REST ARGS
Add
ARGS
toCOUNTER
in some way. See specialized methods for type specific documentation. The kind of arguments to be supported is the what the measurer functions (see Measurers) intended to be paired with the counter return as multiple values.
[genericfunction] COUNTERVALUES COUNTER
Return any number of values representing the state of
COUNTER
. See specialized methods for type specific documentation.
[genericfunction] COUNTERRAWVALUES COUNTER
Return any number of values representing the state of
COUNTER
in such a way that passing the returned values as argumentsADDTOCOUNTER
on a fresh instance of the same type recreates the original state.
[genericfunction] RESETCOUNTER COUNTER
Restore state of
COUNTER
to what it was just after creation.
6.3.1 Attributes

This is a utility class that all counters subclass. The
ATTRIBUTES
plist can hold basically anything. Currently the attributes are only used when printing and they can be specified by the user. The monitor maker functions such as those in Classification Monitors also add attributes of their own to the counters they create.With the
:PREPENDATTRIBUTES
initarg when can easily add new attributes without clobbering the those in the:INITFORM
, (:TYPE
"rmse") in this case.(princ (makeinstance 'rmsecounter :prependattributes '(:event "pred." :dataset "test"))) ;; pred. test rmse: 0.000e+0 (0) => #<RMSECOUNTER pred. test rmse: 0.000e+0 (0)>
[accessor] ATTRIBUTES ATTRIBUTED (:ATTRIBUTES =
NIL
)A plist of attribute keys and values.
[method] NAME (ATTRIBUTED ATTRIBUTED)
Return a string assembled from the values of the
ATTRIBUTES
ofATTRIBUTED
. If there are multiple entries with the same key, then they are printed near together.Values may be padded according to an enclosing
WITHPADDEDATTRIBUTEPRINTING
.
[macro] WITHPADDEDATTRIBUTEPRINTING (ATTRIBUTEDS) &BODY BODY
Note the width of values for each attribute key which is the number of characters in the value's
PRINCTOSTRING
'ed representation. InBODY
, if attributes with they same key are printed they are forced to be at least this wide. This allows for nice, tablelike output:(let ((attributeds (list (makeinstance 'basiccounter :attributes '(:a 1 :b 23 :c 456)) (makeinstance 'basiccounter :attributes '(:a 123 :b 45 :c 6))))) (withpaddedattributeprinting (attributeds) (map nil (lambda (attributed) (format t "~A~%" attributed)) attributeds))) ;; 1 23 456: 0.000e+0 (0) ;; 123 45 6 : 0.000e+0 (0)
[function] LOGPADDED ATTRIBUTEDS
Log (see
LOGMSG
)ATTRIBUTEDS
nonescaped (as inPRINC
or ~A) with the output being as tablelike as possible.
6.3.2 Counter classes
In addition to the really basic ones here, also see Classification Counters.
[class] BASICCOUNTER ATTRIBUTED
A simple counter whose
ADDTOCOUNTER
takes two additional parameters: an increment to the internal sums of called theNUMERATOR
andDENOMINATOR
.COUNTERVALUES
returns two values:NUMERATOR
divided byDENOMINATOR
(or 0 ifDENOMINATOR
is 0) andDENOMINATOR
Here is an example the compute the mean of 5 things received in two batches:
(let ((counter (makeinstance 'basiccounter))) (addtocounter counter 6.5 3) (addtocounter counter 3.5 2) counter) => #<BASICCOUNTER 2.00000e+0 (5)>
[class] RMSECOUNTER BASICCOUNTER
A
BASICCOUNTER
with whose nominator accumulates the square of some statistics. It has the attribute:TYPE
"rmse".COUNTERVALUES
returns the square root of whatBASICCOUNTER
'sCOUNTERVALUES
would return.(let ((counter (makeinstance 'rmsecounter))) (addtocounter counter (+ (* 3 3) (* 4 4)) 2) counter) => #<RMSECOUNTER rmse: 3.53553e+0 (2)>
[class] CONCATCOUNTER ATTRIBUTED
A counter that simply concatenates sequences.
```cltranscript (let ((counter (makeinstance 'concatcounter))) (addtocounter counter '(1 2 3) #(4 5)) (addtocounter counter '(6 7)) (countervalues counter)) => (1 2 3 4 5 6 7) ````
[reader] CONCATENATIONTYPE CONCATCOUNTER (:CONCATENATIONTYPE = '
LIST
)A type designator suitable as the RESULTTYPE argument to
CONCATENATE
.
7 Classification
[in package MGLCORE]
To be able to measure classification related quantities, we need to define what the label of an instance is. Customization is possible by implementing a method for a specific type of instance, but these functions only ever appear as defaults that can be overridden.
[genericfunction] LABELINDEX INSTANCE
Return the label of
INSTANCE
as a nonnegative integer.
[genericfunction] LABELINDEXDISTRIBUTION INSTANCE
Return a one dimensional array of probabilities representing the distribution of labels. The probability of the label with
LABELINDEX
I
is element at indexI
of the returned arrray.
The following two functions are basically the same as the previous two, but in batch mode: they return a sequence of label indices or distributions. These are called on results produced by models. Implement these for a model and the monitor maker functions below will automatically work. See FIXDOC: for bpn and boltzmann.
[genericfunction] LABELINDICES RESULTS
Return a sequence of label indices for
RESULTS
produced by some model for a batch of instances. This is akin toLABELINDEX
.
[genericfunction] LABELINDEXDISTRIBUTIONS RESULT
Return a sequence of label index distributions for
RESULTS
produced by some model for a batch of instances. This is akin toLABELINDEXDISTRIBUTION
.
7.1 Classification Monitors
The following functions return a list monitors. The monitors are
for events of signature (INSTANCES
MODEL
) such as those produced by
MONITORMODELRESULTS
and its various model specific variations.
They are modelagnostic functions, extensible to new classifier
types.
[function] MAKECLASSIFICATIONACCURACYMONITORS MODEL &KEY OPERATIONMODE ATTRIBUTES (LABELINDEXFN #'
LABELINDEX
)Return a list of
MONITOR
objects associated withCLASSIFICATIONACCURACYCOUNTER
s.LABELINDEXFN
is a function likeLABELINDEX
. See that function for more.Implemented in terms of
MAKECLASSIFICATIONACCURACYMONITORS*
.
[function] MAKECROSSENTROPYMONITORS MODEL &KEY OPERATIONMODE ATTRIBUTES (LABELINDEXDISTRIBUTIONFN #'
LABELINDEXDISTRIBUTION
)Return a list of
MONITOR
objects associated withCROSSENTROPYCOUNTER
s.LABELINDEXDISTRIBUTIONFN
is a function likeLABELINDEXDISTRIBUTION
. See that function for more.Implemented in terms of
MAKECROSSENTROPYMONITORS*
.
[function] MAKELABELMONITORS MODEL &KEY OPERATIONMODE ATTRIBUTES (LABELINDEXFN #'
LABELINDEX
) (LABELINDEXDISTRIBUTIONFN #'LABELINDEXDISTRIBUTION
)Return classification accuracy and crossentropy monitors. See
MAKECLASSIFICATIONACCURACYMONITORS
andMAKECROSSENTROPYMONITORS
for a description of paramters.
The monitor makers above can be extended to support new classifier types via the following generic functions.
[genericfunction] MAKECLASSIFICATIONACCURACYMONITORS* MODEL OPERATIONMODE LABELINDEXFN ATTRIBUTES
Identical to
MAKECLASSIFICATIONACCURACYMONITORS
bar the keywords arguments. Specialize this to add to support for new model types. The default implementation also allows for some extensibility: ifLABELINDICES
is defined onMODEL
, then it will be used to extract label indices from model results.
[genericfunction] MAKECROSSENTROPYMONITORS* MODEL OPERATIONMODE LABELINDEXDISTRIBUTIONFN ATTRIBUTES
Identical to
MAKECROSSENTROPYMONITORS
bar the keywords arguments. Specialize this to add to support for new model types. The default implementation also allows for some extensibility: ifLABELINDEXDISTRIBUTIONS
is defined onMODEL
, then it will be used to extract label distributions from model results.
7.2 Classification Measurers
The functions here compare some known good solution (also known as ground truth or target) to a prediction or approximation and return some measure of their [dis][]similarity. They are model independent, hence one has to extract the ground truths and predictions first. Rarely used directly, they are mostly hidden behind Classification Monitors.
[function] MEASURECLASSIFICATIONACCURACY TRUTHS PREDICTIONS &KEY (TEST #'
EQL
) TRUTHKEY PREDICTIONKEY WEIGHTReturn the number of correct classifications and as the second value the number of instances (equal to length of
TRUTHS
in the nonweighted case).TRUTHS
(keyed byTRUTHKEY
) is a sequence of opaque class labels compared withTEST
to another sequence of classes labels inPREDICTIONS
(keyed byPREDICTIONKEY
). IfWEIGHT
is nonnil, then it is a function that returns the weight of an element ofTRUTHS
. Weighted cases add their weight to both counts (returned as the first and second values) instead of 1 as in the nonweighted case.Note how the returned values are suitable for
MULTIPLEVALUECALL
with #'ADDTOCOUNTER
and aCLASSIFICATIONACCURACYCOUNTER
.
[function] MEASURECROSSENTROPY TRUTHS PREDICTIONS &KEY TRUTHKEY PREDICTIONKEY (MINPREDICTIONPR 1.0d15)
Return the sum of the crossentropy between pairs of elements with the same index of
TRUTHS
andPREDICTIONS
.TRUTHKEY
is a function that's when applied to an element ofTRUTHS
returns a sequence representing some kind of discrete target distribution (P in the definition below).TRUTHKEY
may beNIL
which is equivalent to theIDENTITY
function.PREDICTIONKEY
is the same kind of key forPREDICTIONS
, but the sequence it returns represents a distribution that approximates (Q below) the true one.Crossentropy of the true and approximating distributions is defined as:
crossentropy(p,q) =  sum_i p(i) * log(q(i))
of which this function returns the sum over the pairs of elements of
TRUTHS
andPREDICTIONS
keyed byTRUTHKEY
andPREDICTIONKEY
.Due to the logarithm, if q(i) is close to zero, we run into numerical problems. To prevent this, all q(i) that are less than
MINPREDICTIONPR
are treated as if they wereMINPREDICTIONPR
.The second value returned is the sum of p(i) over all
TRUTHS
and allI
. This is normally equal to(LENGTH TRUTHS)
, since elements ofTRUTHS
represent a probability distribution, but this is not enforced which allows relative importance of elements to be controlled.The third value returned is a plist that maps each index occurring in the distribution sequences to a list of two elements:
sum_j p_j(i) * log(q_j(i))
and
sum_j p_j(i)
where
J
indexes intoTRUTHS
andPREDICTIONS
.(measurecrossentropy '((0 1 0)) '((0.1 0.7 0.2))) => 0.35667497 1 (2 (0.0 0) 1 (0.35667497 1) 0 (0.0 0))
Note how the returned values are suitable for
MULTIPLEVALUECALL
with #'ADDTOCOUNTER
and aCROSSENTROPYCOUNTER
.
[function] MEASUREROCAUC PREDICTIONS PRED &KEY (KEY #'
IDENTITY
) WEIGHTReturn the area under the ROC curve for
PREDICTIONS
representing predictions for a binary classification problem.PRED
is a predicate function for deciding whether a prediction belongs to the so called positive class.KEY
returns a number for each element which is the predictor's idea of how much that element is likely to belong to the class, although it's not necessarily a probability.If
WEIGHT
isNIL
, then all elements ofPREDICTIONS
count as 1 towards the unnormalized sum within AUC. ElseWEIGHT
must be a function likeKEY
, but it should return the importance (a positive real number) of elements. If the weight of an prediction is 2 then it's as if there were another identical copy of that prediction inPREDICTIONS
.The algorithm is based on algorithm 2 in the paper 'An introduction to ROC analysis' by Tom Fawcett.
ROC AUC is equal to the probability of a randomly chosen positive having higher
KEY
(score) than a randomly chosen negative element. With equal scores in mind, a more precise version is: AUC is the expectation of the above probability over all possible sequences sorted by scores.
[function] MEASURECONFUSION TRUTHS PREDICTIONS &KEY (TEST #'
EQL
) TRUTHKEY PREDICTIONKEY WEIGHTCreate a
CONFUSIONMATRIX
fromTRUTHS
andPREDICTIONS
.TRUTHS
(keyed byTRUTHKEY
) is a sequence of class labels compared withTEST
to another sequence of class labels inPREDICTIONS
(keyed byPREDICTIONKEY
). IfWEIGHT
is nonnil, then it is a function that returns the weight of an element ofTRUTHS
. Weighted cases add their weight to both counts (returned as the first and second values).Note how the returned confusion matrix can be added to another with
ADDTOCOUNTER
.
7.3 Classification Counters
[class] CLASSIFICATIONACCURACYCOUNTER BASICCOUNTER
A
BASICCOUNTER
with "acc." as its:TYPE
attribute and aPRINTOBJECT
method that prints percentages.
[class] CROSSENTROPYCOUNTER BASICCOUNTER
A
BASICCOUNTER
with "xent" as its:TYPE
attribute.
7.3.1 Confusion Matrices

A confusion matrix keeps count of classification results. The correct class is called
target' and the output of the classifier is called
prediction'.
[function] MAKECONFUSIONMATRIX &KEY (TEST #'
EQL
)Classes are compared with
TEST
.
[genericfunction] SORTCONFUSIONCLASSES MATRIX CLASSES
Return a list of
CLASSES
sorted for presentation purposes.
[genericfunction] CONFUSIONCLASSNAME MATRIX CLASS
Name of
CLASS
for presentation purposes.
 [genericfunction] CONFUSIONCOUNT MATRIX TARGET PREDICTION
[genericfunction] MAPCONFUSIONMATRIX FN MATRIX
Call
FN
withTARGET
,PREDICTION
,COUNT
paramaters for each cell in the confusion matrix. Cells with a zero count may be ommitted.
[genericfunction] CONFUSIONMATRIXCLASSES MATRIX
A list of all classes. The default is to collect classes from the counts. This can be overridden if, for instance, some classes are not present in the results.
[function] CONFUSIONMATRIXACCURACY MATRIX &KEY FILTER
Return the overall accuracy of the results in
MATRIX
. It's computed as the number of correctly classified cases (hits) divided by the name of cases. Return the number of hits and the number of cases as the second and third value. IfFILTER
function is given, then call it with the target and the prediction of the cell. Disregard cell for whichFILTER
returnsNIL
.Precision and recall can be easily computed by giving the right filter, although those are provided in separate convenience functions.
[function] CONFUSIONMATRIXPRECISION MATRIX PREDICTION
Return the accuracy over the cases when the classifier said
PREDICTION
.
[function] CONFUSIONMATRIXRECALL MATRIX TARGET
Return the accuracy over the cases when the correct class is
TARGET
.
[function] ADDCONFUSIONMATRIX MATRIX RESULTMATRIX
Add
MATRIX
intoRESULTMATRIX
.
8 Features
[in package MGLCORE]
8.1 Feature Selection
The following scoring functions all return an EQUAL
hash table
that maps features to scores.
[function] COUNTFEATURES DOCUMENTS MAPPER &KEY (KEY #'
IDENTITY
)Return scored features as an
EQUAL
hash table whose keys are features ofDOCUMENTS
and values are counts of occurrences of features.MAPPER
takes a function and a document and calls function with features of the document.(sort (alexandria:hashtablealist (countfeatures '(("hello" "world") ("this" "is" "our" "world")) (lambda (fn document) (map nil fn document)))) #'string< :key #'car) => (("hello" . 1) ("is" . 1) ("our" . 1) ("this" . 1) ("world" . 2))
[function] FEATURELLRS DOCUMENTS MAPPER CLASSFN &KEY (CLASSES (
ALLDOCUMENTCLASSES
DOCUMENTS
CLASSFN
))Return scored features as an
EQUAL
hash table whose keys are features ofDOCUMENTS
and values are their log likelihood ratios.MAPPER
takes a function and a document and calls function with features of the document.(sort (alexandria:hashtablealist (featurellrs '((:a "hello" "world") (:b "this" "is" "our" "world")) (lambda (fn document) (map nil fn (rest document))) #'first)) #'string< :key #'car) => (("hello" . 2.6032386) ("is" . 2.6032386) ("our" . 2.6032386) ("this" . 2.6032386) ("world" . 4.8428774e8))
[function] FEATUREDISAMBIGUITIES DOCUMENTS MAPPER CLASSFN &KEY (CLASSES (
ALLDOCUMENTCLASSES
DOCUMENTS
CLASSFN
))Return scored features as an
EQUAL
hash table whose keys are features ofDOCUMENTS
and values are their disambiguities.MAPPER
takes a function and a document and calls function with features of the document.From the paper 'Using Ambiguity Measure Feature Selection Algorithm for Support Vector Machine Classifier'.
8.2 Feature Encoding
Features can rarely be fed directly to algorithms as is, they need
to be transformed in some way. Suppose we have a simple language
model that takes a single word as input and predicts the next word.
However, both input and output is to be encoded as float vectors of
length 1000. What we do is find the top 1000 words by some
measure (see Feature Selection) and associate these words with
the integers in [0..999][] (this is ENCODE
ing). By using for
example onehot encoding, we
translate a word into a float vector when passing in the input. When
the model outputs the probability distribution of the next word, we
find the index of the max and find the word associated with it (this
is DECODE
ing)
[genericfunction] ENCODE ENCODER DECODED
Encode
DECODED
withENCODER
. This interface is generic enough to be almost meaningless. SeeENCODER/DECODER
for a simple,MGLNLP:BAGOFWORDSENCODER
for a slightly more involved example.If
ENCODER
is a function designator, then it's simplyFUNCALL
ed withDECODED
.
[genericfunction] DECODE DECODER ENCODED
Decode
ENCODED
withENCODER
. For anDECODER
/ENCODER
pair,(DECODE DECODER (ENCODE ENCODER OBJECT))
must be equal in some sense toOBJECT
.If
DECODER
is a function designator, then it's simplyFUNCALL
ed withENCODED
.

Implements O(1)
ENCODE
andDECODE
by having an internal decodedtoencoded and an encodedtodecodedEQUAL
hash table.ENCODER/DECODER
objects can be saved and loaded (see Persistence) as long as the elements in the hash tables have read/write consitency.(let ((indexer (makeindexer (alexandria:alisthashtable '(("I" . 3) ("me" . 2) ("mine" . 1))) 2))) (values (encode indexer "I") (encode indexer "me") (encode indexer "mine") (decode indexer 0) (decode indexer 1) (decode indexer 2))) => 0 => 1 => NIL => "I" => "me" => NIL
[function] MAKEINDEXER SCOREDFEATURES N &KEY (START 0) (CLASS '
ENCODER/DECODER
)Take the top
N
features fromSCOREDFEATURES
(see Feature Selection), assign indices to them starting fromSTART
. Return anENCODER/DECODER
(or anotherCLASS
) that converts between objects and indices.
Also see Bag of Words.
9 Gradient Based Optimization
[in package MGLOPT]
We have a real valued, differentiable function F and the task is to find the parameters that minimize its value. Optimization starts from a single point in the parameter space of F, and this single point is updated iteratively based on the gradient and value of F at or around the current point.
Note that while the stated problem is that of global optimization, for nonconvex functions, most algorithms will tend to converge to a local optimum.
Currently, there are two optimization algorithms: Gradient Descent (with several variants) and Conjugate Gradient both of which are first order methods (they do not need second order gradients) but more can be added with the Extension API.
[function] MINIMIZE OPTIMIZER GRADIENTSOURCE &KEY (WEIGHTS (
LISTSEGMENTS
GRADIENTSOURCE
)) (DATASET*INFINITELYEMPTYDATASET*
)Minimize the value of the real valued function represented by
GRADIENTSOURCE
by updating some of its parameters inWEIGHTS
(aMAT
or a sequence ofMAT
s). ReturnWEIGHTS
.DATASET
(see Datasets) is a set of unoptimized parameters of the same function. For example,WEIGHTS
may be the weights of a neural network whileDATASET
is the training set consisting of inputs suitable forSETINPUT
. The defaultDATASET
, (*INFINITELYEMPTYDATASET*
) is suitable for when all parameters are optimized, so there is nothing left to come from the environment.Optimization terminates if
DATASET
is a sampler and it runs out or when some other condition met (seeTERMINATION
, for example). IfDATASET
is aSEQUENCE
, then it is reused over and over again.Examples for various optimizers are provided in Gradient Descent and Conjugate Gradient.
9.1 Iterative Optimizer

An abstract base class of Gradient Descent and Conjugate Gradient based optimizers that iterate over instances until a termination condition is met.
[reader] NINSTANCES ITERATIVEOPTIMIZER (:NINSTANCES = 0)
The number of instances this optimizer has seen so far. Incremented automatically during optimization.
[accessor] TERMINATION ITERATIVEOPTIMIZER (:TERMINATION =
NIL
)If a number, it's the number of instances to train on in the sense of
NINSTANCES
. IfNINSTANCES
is equal or greater than this value optimization stops. IfTERMINATION
isNIL
, then optimization will continue. If it isT
, then optimization will stop. If it is a function of no arguments, then its return value is processed as if it was returned byTERMINATION
.
[accessor] ONOPTIMIZATIONSTARTED ITERATIVEOPTIMIZER (:ONOPTIMIZATIONSTARTED =
NIL
)An event hook with parameters
(OPTIMIZER GRADIENTSOURCE NINSTANCES)
. Called after initializations are performed (INITIALIZEOPTIMIZER, INITIALIZEGRADIENTSOURCE) but before optimization is started.
[accessor] ONOPTIMIZATIONFINISHED ITERATIVEOPTIMIZER (:ONOPTIMIZATIONFINISHED =
NIL
)An event hook with parameters
(OPTIMIZER GRADIENTSOURCE NINSTANCES)
. Called when optimization has finished.
[accessor] ONNINSTANCESCHANGED ITERATIVEOPTIMIZER (:ONNINSTANCESCHANGED =
NIL
)An event hook with parameters
(OPTIMIZER GRADIENTSOURCE NINSTANCES)
. Called when optimization of a batch of instances is done andNINSTANCES
is incremented.
Now let's discuss a few handy utilities.
[function] MONITOROPTIMIZATIONPERIODICALLY OPTIMIZER PERIODICFNS
For each periodic function in the list of
PERIODICFNS
, add a monitor toOPTIMIZER
'sONOPTIMIZATIONSTARTED
,ONOPTIMIZATIONFINISHED
andONNINSTANCESCHANGED
hooks. The monitors are simple functions that just call each periodic function with the event parameters (OPTIMIZER
GRADIENTSOURCE
NINSTANCES
). ReturnOPTIMIZER
.To log and reset the monitors of the gradient source after every 1000 instances seen by
OPTIMIZER
:(monitoroptimizationperiodically optimizer '((:fn logmytesterror :period 2000) (:fn resetoptimizationmonitors :period 1000 :lasteval 0)))
Note how we don't pass it's allowed to just pass the initargs for a
PERIODICFN
instead ofPERIODICFN
itself. The:LASTEVAL
0 bit preventsRESETOPTIMIZATIONMONITORS
from being called at the start of the optimization when the monitors are empty anyway.
[genericfunction] RESETOPTIMIZATIONMONITORS OPTIMIZER GRADIENTSOURCE
Report the state of
MONITORS
ofOPTIMIZER
andGRADIENTSOURCE
and reset their counters. SeeMONITOROPTIMIZATIONPERIODICALLY
for an example of how this is used.
[method] RESETOPTIMIZATIONMONITORS (OPTIMIZER ITERATIVEOPTIMIZER) GRADIENTSOURCE
Log the counters of the monitors of
OPTIMIZER
andGRADIENTSOURCE
and reset them.
[genericfunction] REPORTOPTIMIZATIONPARAMETERS OPTIMIZER GRADIENTSOURCE
A utility that's often called at the start of optimization (from
ONOPTIMIZATIONSTARTED
). The default implementation logs the description ofGRADIENTSOURCE
(as inDESCRIBE
) andOPTIMIZER
and callsLOGMATROOM
.
9.2 Cost Function
The function being minimized is often called the cost or the loss function.
[genericfunction] COST MODEL
Return the value of the cost function being minimized. Calling this only makes sense in the context of an ongoing optimization (see
MINIMIZE
). The cost is that of a batch of instances.
[function] MAKECOSTMONITORS MODEL &KEY OPERATIONMODE ATTRIBUTES
Return a list of
MONITOR
objects, each associated with oneBASICCOUNTER
with attribute:TYPE
"cost". Implemented in terms ofMAKECOSTMONITORS*
.
[genericfunction] MAKECOSTMONITORS* MODEL OPERATIONMODE ATTRIBUTES
Identical to
MAKECOSTMONITORS
bar the keywords arguments. Specialize this to add to support for new model types.
9.3 Gradient Descent
[in package MGLGD]
Gradient descent is a firstorder optimization algorithm. Relying completely on first derivatives, it does not even evaluate the function to be minimized. Let's see how to minimize a numerical lisp function with respect to some of its parameters.
(cl:defpackage :mglexamplesgd
(:use #:commonlisp #:mgl))
(inpackage :mglexamplesgd)
;;; Create an object representing the sine function.
(defparameter *difffn1*
(makeinstance 'mgldiffun:diffun
:fn #'sin
;; We are going to optimize its only parameter.
:weightindices '(0)))
;;; Minimize SIN. Note that there is no dataset involved because all
;;; parameters are being optimized.
(minimize (makeinstance 'sgdoptimizer :termination 1000)
*difffn1*
:weights (makemat 1))
;;; => A MAT with a single value of about pi/2.
;;; Create a differentiable function for f(x,y)=(xy)^2. X is a
;;; parameter whose values come from the DATASET argument passed to
;;; MINIMIZE. Y is a parameter to be optimized (a 'weight').
(defparameter *difffn2*
(makeinstance 'mgldiffun:diffun
:fn (lambda (x y)
(expt ( x y) 2))
:parameterindices '(0)
:weightindices '(1)))
;;; Find the Y that minimizes the distance from the instances
;;; generated by the sampler.
(minimize (makeinstance 'sgdoptimizer :batchsize 10)
*difffn2*
:weights (makemat 1)
:dataset (makeinstance 'functionsampler
:generator (lambda ()
(list (+ 10
(gaussianrandom1))))
:maxnsamples 1000))
;;; => A MAT with a single value of about 10, the expected value of
;;; the instances in the dataset.
;;; The dataset can be a SEQUENCE in which case we'd better set
;;; TERMINATION else optimization would never finish.
(minimize (makeinstance 'sgdoptimizer :termination 1000)
*difffn2*
:weights (makemat 1)
:dataset '((0) (1) (2) (3) (4) (5)))
;;; => A MAT with a single value of about 2.5.
We are going to see a number of accessors for optimizer paramaters.
In general, it's allowed to SETF
real slot accessors (as opposed to
readers and writers) at any time during optimization and so is
defining a method on an optimizer subclass that computes the value
in any way. For example, to decay the learning rate on a per
minibatch basis:
(defmethod learningrate ((optimizer mysgdoptimizer))
(* (slotvalue optimizer 'learningrate)
(expt 0.998
(/ (ninstances optimizer) 60000))))
9.3.1 Batch Based Optimizers
First let's see everything common to all batch based optimizers,
then discuss SGD Optimizer, Adam Optimizer and
Normalized Batch Optimizer. All batch based optimizers
are ITERATIVEOPTIMIZER
s, so see Iterative Optimizer
too.
[class] BATCHGDOPTIMIZER GDOPTIMIZER
Another abstract base class for gradient based optimizers tath updates all weights simultaneously after chewing through
BATCHSIZE
(0
1
2
) inputs. See subclassesSGDOPTIMIZER
,ADAMOPTIMIZER
andNORMALIZEDBATCHGDOPTIMIZER
.PERWEIGHTBATCHGDOPTIMIZER
may be a better choice when some weights can go unused for instance due to missing input values.
[accessor] BATCHSIZE GDOPTIMIZER (:BATCHSIZE = 1)
After having gone through
BATCHSIZE
number of inputs, weights are updated. WithBATCHSIZE
1, one gets Stochastics Gradient Descent. WithBATCHSIZE
equal to the number of instances in the dataset, one gets standard, 'batch' gradient descent. WithBATCHSIZE
between these two extremes, one gets the most practical 'minibatch' compromise.
[accessor] LEARNINGRATE GDOPTIMIZER (:LEARNINGRATE = 0.1)
This is the step size along the gradient. Decrease it if optimization diverges, increase it if it doesn't make progress.
[accessor] MOMENTUM GDOPTIMIZER (:MOMENTUM = 0)
A value in the [0, 1) interval.
MOMENTUM
times the previous weight change is added to the gradient. 0 means no momentum.
[reader] MOMENTUMTYPE GDOPTIMIZER (:MOMENTUMTYPE =
:NORMAL
)One of
:NORMAL
,:NESTEROV
or:NONE
. For pure optimization Nesterov's momentum may be better, but it may also increases chances of overfitting. Using:NONE
is equivalent to 0 momentum, but it also uses less memory. Note that with:NONE
,MOMENTUM
is ignored even it it is nonzero.
[accessor] WEIGHTDECAY GDOPTIMIZER (:WEIGHTDECAY = 0)
An L2 penalty. It discourages large weights, much like a zero mean gaussian prior.
WEIGHTDECAY
* WEIGHT is added to the gradient to penalize large weights. It's as if the function whose minimum is sought had WEIGHTDECAY*sum_i{0.5 * WEIGHT_i^2} added to it.
[accessor] WEIGHTPENALTY GDOPTIMIZER (:WEIGHTPENALTY = 0)
An L1 penalty. It encourages sparsity.
SIGN
(WEIGHT) *WEIGHTPENALTY
is added to the gradient pushing the weight towards negative infinity. It's as if the function whose minima is sought had WEIGHTPENALTY*sum_i{abs(WEIGHT_i)} added to it. Putting it on feature biases consitutes a sparsity constraint on the features.
[reader] USESEGMENTDERIVATIVESP GDOPTIMIZER (:USESEGMENTDERIVATIVESP =
NIL
)Save memory if both the gradient source (the model being optimized) and the optimizer support this feature. It works like this: the accumulator into which the gradient source is asked to place the derivatives of a segment will be
SEGMENTDERIVATIVES
of the segment. This allows the optimizer not to allocate an accumulator matrix into which the derivatives are summed.
[accessor] AFTERUPDATEHOOK GDOPTIMIZER (:AFTERUPDATEHOOK =
NIL
)A list of functions with no arguments called after each weight update.
[accessor] BEFOREUPDATEHOOK BATCHGDOPTIMIZER (:BEFOREUPDATEHOOK =
NIL
)A list of functions of no parameters. Each function is called just before a weight update takes place (after accumulated gradients have been divided the length of the batch). Convenient to hang some additional gradient accumulating code on.
SGD Optimizer
[class] SGDOPTIMIZER BATCHGDOPTIMIZER
With
BATCHSIZE
(0
1
2
) 1 this is Stochastic Gradient Descent. With higher batch sizes, one gets minibatch and Batch Gradient Descent.Assuming that
ACCUMULATOR
has the sum of gradients for a minibatch, the weight update looks like this:$$\Delta_w^{t+1} = momentum \Delta_w^t + \frac{accumulator}{batchsize} + l_2 w + l_1 sign(w)$$
$$w^{t+1} = w^{t}  learningrate \Delta_w$$
which is the same as the more traditional formulation:
$$\Delta_w^{t+1} = momentum * \Delta_w^{t} + learningrate \left(\frac{\frac{df}{dw}}{batchsize} + l_2 w + l_1 sign(w)\right)$$
$$w^{t+1} = w^{t}  \Delta_w$$
but the former works better when batch size, momentum or learning rate change during the course of optimization. The above is with normal momentum, Nesterov's momentum (see
MOMENTUMTYPE
) momentum is also available.See Batch Based Optimizers for the description of the various options common to all batch based optimizers.
Adam Optimizer
[class] ADAMOPTIMIZER BATCHGDOPTIMIZER
Adam is a firstorder stochasistic gradient descent optimizer. It maintains an internal estimation for the mean and raw variance of each derivative as exponential moving averages. The step it takes is basically
M/(sqrt(V)+E)
whereM
is the estimated mean,V
is the estimated variance, andE
is a small adjustment factor to prevent the gradient from blowing up. See version 5 of the paper for more.Note that using momentum is not supported with Adam. In fact, an error is signalled if it's not
:NONE
.See Batch Based Optimizers for the description of the various options common to all batch based optimizers.
[accessor] LEARNINGRATE ADAMOPTIMIZER (= 2.0e4)
Same thing as
LEARNINGRATE
but with the default suggested by the Adam paper.
[accessor] MEANDECAY ADAMOPTIMIZER (:MEANDECAY = 0.9)
A number between 0 and 1 that determines how fast the estimated mean of derivatives is updated. 0 basically gives you
RMSPROP
(ifVARIANCEDECAY
is not too large) or AdaGrad (ifVARIANCEDECAY
is close to 1 and the learning rate is annealed. This is $\beta_1$ in the paper.
[accessor] MEANDECAYDECAY ADAMOPTIMIZER (:MEANDECAYDECAY = ( 1 1.0d7))
A value that should be close to 1.
MEANDECAY
is multiplied by this value after each update. This is $\lambda$ in the paper.
[accessor] VARIANCEDECAY ADAMOPTIMIZER (:VARIANCEDECAY = 0.999)
A number between 0 and 1 that determines how fast the estimated variance of derivatives is updated. This is $\beta_2$ in the paper.
[accessor] VARIANCEADJUSTMENT ADAMOPTIMIZER (:VARIANCEADJUSTMENT = 1.0d7)
Within the bowels of adam, the estimated mean is divided by the square root of the estimated variance (per weight) which can lead to numerical problems if the denominator is near zero. To avoid this,
VARIANCEADJUSTMENT
, which should be a small positive number, is added to the denominator. This isepsilon
in the paper.
Normalized Batch Optimizer
[class] NORMALIZEDBATCHGDOPTIMIZER BATCHGDOPTIMIZER
Like
BATCHGDOPTIMIZER
but keeps count of how many times each weight was used in the batch and divides the accumulated gradient by this count instead of dividing byNINSTANCESINBATCH
. This only makes a difference if there are missing values in the learner that's being trained. The main feature that distuinguishes this class fromPERWEIGHTBATCHGDOPTIMIZER
is that batches end at same time for all weights.
[accessor] NWEIGHTUSESINBATCH NORMALIZEDBATCHGDOPTIMIZER
Number of uses of the weight in its current batch.
9.3.2 Segmented GD Optimizer
[class] SEGMENTEDGDOPTIMIZER BASEGDOPTIMIZER
An optimizer that delegates training of segments to other optimizers. Useful to delegate training of different segments to different optimizers (capable of working with segmentables) or simply to not train all segments.
[reader] SEGMENTER SEGMENTEDGDOPTIMIZER (:SEGMENTER)
When this optimizer is initialized it loops over the segment of the learner with
MAPSEGMENTS
.SEGMENTER
is a function that is called with each segment and returns an optimizer orNIL
. Several segments may be mapped to the same optimizer. After the segment>optimizer mappings are collected, each optimizer is initialized by INITIALIZEOPTIMIZER with the list of segments mapped to it.
SEGMENTEDGDOPTIMIZER
inherits from ITERATIVEOPTIMIZER
, so see
Iterative Optimizer too.
9.3.3 Perweight Optimization
[class] PERWEIGHTBATCHGDOPTIMIZER GDOPTIMIZER
This is much like Batch Based Optimizers but it is more clever about when to update weights. Basically every weight has its own batch independent from the batches of others. This has desirable properties. One can for example put two neural networks together without adding any connections between them and the learning will produce results equivalent to the separated case. Also, adding inputs with only missing values does not change anything.
Due to its very nonbatch nature, there is no CUDA implementation of this optimizer.
[accessor] NWEIGHTUSESINBATCH PERWEIGHTBATCHGDOPTIMIZER
Number of uses of the weight in its current batch.
9.3.4 Utilities
[function] CLIPL2NORM MATS L2UPPERBOUND &KEY CALLBACK
Scale
MATS
so that their $L_2$ norm does not exceedL2UPPERBOUND
.Compute the norm of of
MATS
as if they were a single vector. If the norm is greater thanL2UPPERBOUND
, then scale each matrix destructively by the norm divided byL2UPPERBOUND
and if nonNIL call the functionCALLBACK
with the scaling factor.
[function] ARRANGEFORCLIPPINGGRADIENTS BATCHGDOPTIMIZER L2UPPERBOUND &KEY CALLBACK
Make it so that the norm of the batch normalized gradients accumulated by
BATCHGDOPTIMIZER
is clipped toL2UPPERBOUND
before every update. SeeCLIPL2NORM
.
9.4 Conjugate Gradient
[in package MGLCG]
Conjugate gradient is a firstorder optimization algorithm. It's more advanced than gradient descent as it does line searches which unfortunately also makes it unsuitable for nondeterministic functions. Let's see how to minimize a numerical lisp function with respect to some of its parameters.
;;; Create an object representing the sine function.
(defparameter *difffn1*
(makeinstance 'mgldiffun:diffun
:fn #'sin
;; We are going to optimize its only parameter.
:weightindices '(0)))
;;; Minimize SIN. Note that there is no dataset involved because all
;;; parameters are being optimized.
(minimize (makeinstance 'cgoptimizer
:batchsize 1
:termination 1)
*difffn1*
:weights (makemat 1))
;;; => A MAT with a single value of about pi/2.
;;; Create a differentiable function for f(x,y)=(xy)^2. X is a
;;; parameter whose values come from the DATASET argument passed to
;;; MINIMIZE. Y is a parameter to be optimized (a 'weight').
(defparameter *difffn2*
(makeinstance 'mgldiffun:diffun
:fn (lambda (x y)
(expt ( x y) 2))
:parameterindices '(0)
:weightindices '(1)))
;;; Find the Y that minimizes the distance from the instances
;;; generated by the sampler.
(minimize (makeinstance 'cgoptimizer :batchsize 10)
*difffn2*
:weights (makemat 1)
:dataset (makeinstance 'functionsampler
:generator (lambda ()
(list (+ 10
(gaussianrandom1))))
:maxnsamples 1000))
;;; => A MAT with a single value of about 10, the expected value of
;;; the instances in the dataset.
;;; The dataset can be a SEQUENCE in which case we'd better set
;;; TERMINATION else optimization would never finish. Note how a
;;; single epoch suffices.
(minimize (makeinstance 'cgoptimizer :termination 6)
*difffn2*
:weights (makemat 1)
:dataset '((0) (1) (2) (3) (4) (5)))
;;; => A MAT with a single value of about 2.5.
[function] CG FN W &KEY (MAXNLINESEARCHES
*DEFAULTMAXNLINESEARCHES*
) (MAXNEVALUATIONSPERLINESEARCH*DEFAULTMAXNEVALUATIONSPERLINESEARCH*
) (MAXNEVALUATIONS*DEFAULTMAXNEVALUATIONS*
) (SIG*DEFAULTSIG*
) (RHO*DEFAULTRHO*
) (INT*DEFAULTINT*
) (EXT*DEFAULTEXT*
) (RATIO*DEFAULTRATIO*
) SPAREVECTORSCGOPTIMIZER
passes each batch of data to this function with itsCGARGS
passed on.Minimize a differentiable multivariate function with conjugate gradient. The PolakRibiere flavour of conjugate gradients is used to compute search directions, and a line search using quadratic and cubic polynomial approximations and the WolfePowell stopping criteria is used together with the slope ratio method for guessing initial step sizes. Additionally a bunch of checks are made to make sure that exploration is taking place and that extrapolation will not be unboundedly large.
FN
is a function of two parameters:WEIGHTS
(0
1
) andDERIVATIVES
.WEIGHTS
(0
1
) is aMAT
of the same size asW
that is where the search start from.DERIVATIVES
is also aMAT
of that size and it is whereFN
shall place the partial derivatives.FN
returns the value of the function that is being minimized.CG
performs a number of line searches and invokesFN
at each step. A line search invokesFN
at mostMAXNEVALUATIONSPERLINESEARCH
number of times and can succeed in improving the minimum by the sufficient margin or it can fail. Note, the even a failed line search may improve further and hence change the weights it's just that the improvement was deemed too small.CG
stops when either:two line searches fail in a row
MAXNLINESEARCHES
is reachedMAXNEVALUATIONS
is reached
CG
returns aMAT
that contains the best weights, the minimum, the number of line searches performed, the number of succesful line searches and the number of evaluations.When using
MAXNEVALUATIONS
remember that there is an extra evaluation ofFN
before the first line search.SPAREVECTORS
is a list of preallocatedMAT
s of the same size asW
. Passing 6 of them covers the current need of the algorithm and it will not cons up vectors of sizeW
at all.NOTE: If the function terminates within a few iterations, it could be an indication that the function values and derivatives are not consistent (ie, there may be a bug in the implementation of
FN
function).SIG
andRHO
are the constants controlling the WolfePowell conditions.SIG
is the maximum allowed absolute ratio between previous and new slopes (derivatives in the search direction), thus settingSIG
to low (positive) values forces higher precision in the linesearches.RHO
is the minimum allowed fraction of the expected (from the slope at the initial point in the linesearch). Constants must satisfy 0 <RHO
<SIG
< 1. Tuning ofSIG
(depending on the nature of the function to be optimized) may speed up the minimization; it is probably not worth playing much withRHO
.

Don't reevaluate within
INT
of the limit of the current bracket.

Extrapolate maximum
EXT
times the current stepsize.

SIG
andRHO
are the constants controlling the WolfePowell conditions.SIG
is the maximum allowed absolute ratio between previous and new slopes (derivatives in the search direction), thus settingSIG
to low (positive) values forces higher precision in the linesearches.
[variable] *DEFAULTRHO* 0.05
RHO
is the minimum allowed fraction of the expected (from the slope at the initial point in the linesearch). Constants must satisfy 0 <RHO
<SIG
< 1.

Maximum allowed slope ratio.
[class] CGOPTIMIZER ITERATIVEOPTIMIZER
Updates all weights simultaneously after chewing through
BATCHSIZE
(0
1
2
) inputs.
[accessor] BATCHSIZE CGOPTIMIZER (:BATCHSIZE)
After having gone through
BATCHSIZE
number of instances, weights are updated. Normally,CG
operates on all available data, but it may be useful to introduce some noise into the optimization to reduce overfitting by using smaller batch sizes. IfBATCHSIZE
is not set, it is initialized to the size of the dataset at the start of optimization.
 [accessor] CGARGS CGOPTIMIZER (:CGARGS = '
NIL
)
[accessor] ONCGBATCHDONE CGOPTIMIZER (:ONCGBATCHDONE =
NIL
)An event hook called when processing a conjugate gradient batch is done. The handlers on the hook are called with 8 arguments:
(optimizer gradientsource instances bestw bestf nlinesearches nsuccesfullinesearches nevaluations)
The latter 5 of which are the return values of the
CG
function.
[genericfunction] LOGCGBATCHDONE OPTIMIZER GRADIENTSOURCE INSTANCES BESTW BESTF NLINESEARCHES NSUCCESFULLINESEARCHES NEVALUATIONS
This is a function can be added to
ONCGBATCHDONE
. The default implementation simply logs the event arguments.
[reader] SEGMENTFILTER CGOPTIMIZER (:SEGMENTFILTER = (
CONSTANTLY
T
))A predicate function on segments that filters out uninteresting segments. Called from
INITIALIZEOPTIMIZER*
.
9.5 Extension API
9.5.1 Implementing Optimizers
The following generic functions must be specialized for new optimizer types.
[genericfunction] MINIMIZE* OPTIMIZER GRADIENTSOURCE WEIGHTS DATASET
Called by
MINIMIZE
afterINITIALIZEOPTIMIZER*
andINITIALIZEGRADIENTSOURCE*
, this generic function is the main extension point for writing optimizers.
[genericfunction] INITIALIZEOPTIMIZER* OPTIMIZER GRADIENTSOURCE WEIGHTS DATASET
Called automatically before training starts, this function sets up
OPTIMIZER
to be suitable for optimizingGRADIENTSOURCE
. It typically creates appropriately sized accumulators for the gradients.
[genericfunction] SEGMENTS OPTIMIZER
Several weight matrices known as segments can be optimized by a single optimizer. This function returns them as a list.
The rest are just useful for utilities for implementing optimizers.
[function] TERMINATEOPTIMIZATIONP NINSTANCES TERMINATION
Utility function for subclasses of
ITERATIVEOPTIMIZER
. It returns whether optimization is to be terminated based onNINSTANCES
andTERMINATION
that are values of the respective accessors ofITERATIVEOPTIMIZER
.
[function] SETNINSTANCES OPTIMIZER GRADIENTSOURCE NINSTANCES
Set
NINSTANCES
ofOPTIMIZER
and fireONNINSTANCESCHANGED
.ITERATIVEOPTIMIZER
subclasses must call this to incrementNINSTANCES
.

This is a utility class for optimizers that have a list of
SEGMENTS
and (the weights being optimized) is able to copy back and forth between those segments and a singleMAT
(the accumulator).
[macro] DOSEGMENTSET (SEGMENT &OPTIONAL START) SEGMENTSET &BODY BODY
Iterate over
SEGMENTS
inSEGMENTSET
. IfSTART
is specified, the it is bound to the start index ofSEGMENT
withinSEGMENTSET
. The start index is the sum of the sizes of previous segments.
[function] SEGMENTSET<MAT SEGMENTSET MAT
Copy the values of
MAT
to the weight matrices ofSEGMENTSET
as if they were concatenated into a singleMAT
.
[function] SEGMENTSET>MAT SEGMENTSET MAT
Copy the values of
SEGMENTSET
toMAT
as if they were concatenated into a singleMAT
.
9.5.2 Implementing Gradient Sources
Weights can be stored in a multitude of ways. Optimizers need to
update weights, so it is assumed that weights are stored in any
number of MAT
objects called segments.
The generic functions in this section must all be specialized for new gradient sources except where noted.
[genericfunction] MAPSEGMENTS FN GRADIENTSOURCE
Apply
FN
to each segment ofGRADIENTSOURCE
.
[genericfunction] MAPSEGMENTRUNS FN SEGMENT
Call
FN
with start and end of intervals of consecutive indices that are not missing inSEGMENT
. Called by optimizers that support partial updates. The default implementation assumes that all weights are present. This only needs to be specialized if one plans to use an optimizer that knows how to deal unused/missing weights such asMGLGD:NORMALIZEDBATCHGDOPTIMIZER
andOPTIMIZER
MGLGD:PERWEIGHTBATCHGDOPTIMIZER
.
[genericfunction] SEGMENTWEIGHTS SEGMENT
Return the weight matrix of
SEGMENT
. A segment doesn't need to be aMAT
object itself. For example, it may be aMGLBM:CHUNK
of a [MGLBM:BM] or aMGLBP:LUMP
of aMGLBP:BPN
whoseNODES
slot holds the weights.
[method] SEGMENTWEIGHTS (MAT MAT)
When the segment is really a
MAT
, then just return it.
[genericfunction] SEGMENTDERIVATIVES SEGMENT
Return the derivatives matrix of
SEGMENT
. A segment doesn't need to be aMAT
object itself. For example, it may be aMGLBM:CHUNK
of a [MGLBM:BM] or aMGLBP:LUMP
of aMGLBP:BPN
whose DERIVATIVES slot holds the gradient.
[function] LISTSEGMENTS GRADIENTSOURCE
A utility function that returns the list of segments from
MAPSEGMENTS
onGRADIENTSOURCE
.
[genericfunction] INITIALIZEGRADIENTSOURCE* OPTIMIZER GRADIENTSOURCE WEIGHTS DATASET
Called automatically before
MINIMIZE*
is called, this function may be specialized ifGRADIENTSOURCE
needs some kind of setup.
[method] INITIALIZEGRADIENTSOURCE* OPTIMIZER GRADIENTSOURCE WEIGHTS DATASET
The default method does nothing.
[genericfunction] ACCUMULATEGRADIENTS* GRADIENTSOURCE SINK BATCH MULTIPLIER VALUEP
Add
MULTIPLIER
times the sum of firstorder gradients to accumulators ofSINK
(normally accessed withDOGRADIENTSINK
) and ifVALUEP
, return the sum of values of the function being optimized for aBATCH
of instances.GRADIENTSOURCE
is the object representing the function being optimized,SINK
is gradient sink.Note the number of instances in
BATCH
may be larger than whatGRADIENTSOURCE
process in one go (in the sense of say,MAXNSTRIPES
), soDOBATCHESFORMODEL
or something like (GROUP
BATCH
MAXNSTRIPES
) can be handy.
9.5.3 Implementing Gradient Sinks
Optimizers call ACCUMULATEGRADIENTS*
on gradient sources. One
parameter of ACCUMULATEGRADIENTS*
is the SINK
. A gradient sink
knows what accumulator matrix (if any) belongs to a segment. Sinks
are defined entirely by MAPGRADIENTSINK
.
[genericfunction] MAPGRADIENTSINK FN SINK
Call
FN
of lambda list (SEGMENT
ACCUMULATOR
) on each segment and their corresponding accumulatorMAT
inSINK
.
[macro] DOGRADIENTSINK ((SEGMENT ACCUMULATOR) SINK) &BODY BODY
A convenience macro on top of
MAPGRADIENTSINK
.
10 Differentiable Functions
[in package MGLDIFFUN]

DIFFUN
dresses a lisp function (in itsFN
slot) as a gradient source (see Implementing Gradient Sources) which allows it to be used inMINIMIZE
. See the examples in Gradient Descent and Conjugate Gradient.
[reader] PARAMETERINDICES DIFFUN (:PARAMETERINDICES =
NIL
)The list of indices of parameters that we don't optimize. Values for these will come from the DATASET argument of
MINIMIZE
.
[reader] WEIGHTINDICES DIFFUN (:WEIGHTINDICES =
NIL
)The list of indices of parameters to be optimized, the values of which will come from the [WEIGHTS] argument of
MINIMIZE
.
11 Backpropagation Neural Networks
[in package MGLBP]
11.1 Backprop Overview
Backpropagation Neural Networks are just functions with lots of
parameters called weights and a layered structure when presented
as a computational
graph. The
network is trained to MINIMIZE
some kind of loss function whose
value the network computes.
In this implementation, a BPN
is assembled from several
LUMP
s (roughly corresponding to layers). Both feedforward and
recurrent neural nets are supported (FNN
and RNN
, respectively).
BPN
s can contain not only LUMP
s but other BPN
s, too. As we
see, networks are composite objects and the abstract base class for
composite and simple parts is called CLUMP
.

A
CLUMP
is aLUMP
or aBPN
. It represents a differentiable function. Arguments of clumps are given during instantiation. Some arguments are clumps themselves so they get permenantly wired together like this:(>v*m (>input :size 10 :name 'input) (>weight :dimensions '(10 20) :name 'weight) :name 'activation)
The above creates three clumps: the vectormatrix multiplication clumps called
ACTIVATION
which has a reference to its operands:INPUT
andWEIGHT
. Note that the example just defines a function, no actual computation has taken place, yet.This wiring of
CLUMP
s is how one builds feedforward nets (FNN
) or recurrent neural networks (RNN
) that areCLUMP
s themselves so one can build nets in a hiearchical style if desired. NoncompositeCLUMP
s are calledLUMP
(note the loss ofC
that stands for composite). The variousLUMP
subtypes correspond to different layer types (>SIGMOID
,>DROPOUT
,>RELU
,>TANH
, etc).
At this point, you may want to jump ahead to get a feel for how things work by reading the FNN Tutorial.
11.2 Clump API
These are mostly for extension purposes. About the only thing
needed from here for normal operation is NODES
when clamping inputs
or extracting predictions.
[genericfunction] STRIPEDP CLUMP
For efficiency, forward and backprop phases do their stuff in batch mode: passing a number of instances through the network in batches. Thus clumps must be able to store values of and gradients for each of these instances. However, some clumps produce the same result for each instance in a batch. These clumps are the weights, the parameters of the network.
STRIPEDP
returns true iffCLUMP
does not represent weights (i.e. it's not a>WEIGHT
).For striped clumps, their
NODES
andDERIVATIVES
areMAT
objects with a leading dimension (number of rows in the 2d case) equal to the number of instances in the batch. Nonstriped clumps have no restriction on their shape apart from what their usage dictates.
[genericfunction] NODES OBJECT
Returns a
MAT
object representing the state or result ofOBJECT
. The first dimension of the returned matrix is equal to the number of stripes.
CLUMP
s' NODES
holds the result computed by the most recent
FORWARD
. For >INPUT
lumps, this is where input values shall be
placed (see SETINPUT
). Currently, the matrix is always two
dimensional but this restriction may go away in the future.
[genericfunction] DERIVATIVES CLUMP
Return the
MAT
object representing the partial derivatives of the functionCLUMP
computes. The returned partial derivatives were accumulated by previousBACKWARD
calls.This matrix is shaped like the matrix returned by
NODES
.
[genericfunction] FORWARD CLUMP
Compute the values of the function represented by
CLUMP
for all stripes and place the results intoNODES
ofCLUMP
.
[genericfunction] BACKWARD CLUMP
Compute the partial derivatives of the function represented by
CLUMP
and add them toDERIVATIVES
of the corresponding argument clumps. TheDERIVATIVES
ofCLUMP
contains the sum of partial derivatives of all clumps by the corresponding output. This function is intended to be called after aFORWARD
pass.Take the
>SIGMOID
clump for example when the network is being applied to a batch of two instancesx1
andx2
.x1
andx2
are set in the>INPUT
lump X. The sigmoid computes1/(1+exp(x))
whereX
is its only argument clump.f(x) = 1/(1+exp(x))
When
BACKWARD
is called on the sigmoid lump, itsDERIVATIVES
is a 2x1MAT
object that contains the partial derivatives of the loss function:dL(x1)/df dL(x2)/df
Now the
BACKWARD
method of the sigmoid needs to adddL(x1)/dx1
anddL(x2)/dx2
toDERIVATIVES
ofX
. Now,dL(x1)/dx1 = dL(x1)/df * df(x1)/dx1
and the first term is what we have inDERIVATIVES
of the sigmoid so it only needs to calculate the second term.
In addition to the above, clumps also have to support SIZE
(0
1
),
NSTRIPES
, MAXNSTRIPES
(and the SETF
methods of the latter two)
which can be accomplished just by inheriting from BPN
, FNN
, RNN
, or
a LUMP
.
11.3 BPNs
[reader] NSTRIPES BPN (:NSTRIPES = 1)
The current number of instances the network has. This is automatically set to the number of instances passed to
SETINPUT
, so it rarely has to be manipulated directly although it can be set. When setNSTRIPES
of allCLUMPS
get set to the same value.
[reader] MAXNSTRIPES BPN (:MAXNSTRIPES =
NIL
)The maximum number of instances the network can operate on in parallel. Within
BUILDFNN
orBUILDRNN
, it defaults toMAXNSTRIPES
of that parent network, else it defaults to 1. When setMAXNSTRIPES
of allCLUMPS
get set to the same value.
[reader] CLUMPS BPN (:CLUMPS = (
MAKEARRAY
0:ELEMENTTYPE
'CLUMP
:ADJUSTABLE
T
:FILLPOINTER
T
))A topological sorted adjustable array with a fill pointer that holds the clumps that make up the network. Clumps are added to it by
ADDCLUMP
or, more often, automatically when within aBUILDFNN
orBUILDRNN
. Rarely needed,FINDCLUMP
takes care of most uses.
[function] FINDCLUMP NAME BPN &KEY (ERRORP
T
)Find the clump with
NAME
amongCLUMPS
ofBPN
. As always, names are compared withEQUAL
. If not found, then returnNIL
or signal and error depending onERRORP
.
[function] ADDCLUMP CLUMP BPN
Add
CLUMP
toBPN
.MAXNSTRIPES
ofCLUMP
gets set to that ofBPN
. It is an error to add a clump with a name already used by one of theCLUMPS
ofBPN
.
11.3.1 Training
BPN
s are trained to minimize the loss function they compute.
Before a BPN
is passed to MINIMIZE
(as its GRADIENTSOURCE
argument), it must be wrapped in a BPLEARNER
object. BPLEARNER
has
[MONITORS] slot which is used for example by
RESETOPTIMIZATIONMONITORS
.
Without the bells an whistles, the basic shape of training is this:
(minimize optimizer (makeinstance 'bplearner :bpn bpn)
:dataset dataset)
[reader] BPN BPLEARNER (:BPN)
The
BPN
for which thisBPLEARNER
provides the gradients.
11.3.2 Monitoring
[function] MONITORBPNRESULTS DATASET BPN MONITORS
For every batch (of size
MAXNSTRIPES
ofBPN
) of instances inDATASET
, set the batch as the next input withSETINPUT
, perform aFORWARD
pass and applyMONITORS
to theBPN
(withAPPLYMONITORS
). Finally, return the counters ofMONITORS
. This is built on top ofMONITORMODELRESULTS
.
[function] MAKESTEPMONITORMONITORS RNN &KEY (COUNTERVALUESFN #'
COUNTERRAWVALUES
) (MAKECOUNTER #'MAKESTEPMONITORMONITORCOUNTER
)Return a list of monitors, one for every monitor in
STEPMONITORS
ofRNN
. These monitors extract the results from their warp counterpairs withCOUNTERVALUESFN
and add them to their own counter that's created byMAKECOUNTER
. Wow. Ew. The idea is that one does something like this do monitor warped prediction:(let ((*warptime* t)) (setf (stepmonitors rnn) (makecostmonitors rnn :attributes '(:event "warped pred."))) (monitorbpnresults dataset rnn ;; Just collect and reset the warp ;; monitors after each batch of ;; instances. (makestepmonitormonitors rnn)))
[genericfunction] MAKESTEPMONITORMONITORCOUNTER STEPCOUNTER
In an
RNN
,STEPCOUNTER
aggregates results of all the time steps during the processing of instances in the current batch. Return a new counter into which results fromSTEPCOUNTER
can be accumulated when the processing of the batch is finished. The default implementation creates a copy ofSTEPCOUNTER
.
11.3.3 FeedForward Nets
FNN
and RNN
have a lot in common (see their common superclass, BPN
).
There is very limited functionality that's specific to FNN
s so let's
get them out of they way before we study a full example.
[macro] BUILDFNN (&KEY FNN (CLASS ''
FNN
) INITARGS MAXNSTRIPES NAME) &BODY CLUMPSSyntactic sugar to assemble
FNN
s fromCLUMPs
. LikeLET*
, it is a sequence of bindings (of symbols toCLUMPs
). The names of the clumps created default to the symbol of the binding. In case a clump is not bound to a symbol (because it was created in a nested expression), the local functionCLUMP
can be used to find the clump with the given name in the fnn being built. Example:(buildfnn () (features (>input :size nfeatures)) (biases (>weight :size nfeatures)) (weights (>weight :size (* nhiddens nfeatures))) (activations0 (>v*m :weights weights :x (clump 'features))) (activations (>+ :args (list biases activations0))) (output (>sigmoid :x activations)))
FNN Tutorial
Hopefully this example from example/digitfnn.lisp
illustrates
the concepts involved. If it's too dense despite the comments, then
read up on Datasets, Gradient Based Optimization and come back.
(cl:defpackage :mglexampledigitfnn
(:use #:commonlisp #:mgl))
(inpackage :mglexampledigitfnn)
;;; There are 10 possible digits used as inputs ...
(defparameter *ninputs* 10)
;;; and we want to learn the rule that maps the input digit D to (MOD
;;; (1+ D) 3).
(defparameter *noutputs* 3)
;;; We define a feedforward net to be able to specialize how inputs
;;; are translated by adding a SETINPUT method later.
(defclass digitfnn (fnn)
())
;;; Build a DIGITFNN with a single hidden layer of rectified linear
;;; units and a softmax output.
(defun makedigitfnn (&key (nhiddens 5))
(buildfnn (:class 'digitfnn)
(input (>input :size *ninputs*))
(hiddenactivation (>activation input :size nhiddens))
(hidden (>relu hiddenactivation))
(outputactivation (>activation hidden :size *noutputs*))
(output (>softmaxxeloss outputactivation))))
;;; This method is called with batches of 'instances' (input digits in
;;; this case) by MINIMIZE and also by MONITORBPNRESULTS before
;;; performing a forward pass (i.e. computing the value of the
;;; function represented by the network). Its job is to encode the
;;; inputs by populating rows of the NODES matrix of the INPUT clump.
;;;
;;; Each input is encoded as a row of zeros with a single 1 at index
;;; determined by the input digit. This is called onehot encoding.
;;; The TARGET could be encoded the same way, but instead we use the
;;; sparse option supported by TARGET of >SOFTMAXXELOSS.
(defmethod setinput (digits (fnn digitfnn))
(let* ((input (nodes (findclump 'input fnn)))
(outputlump (findclump 'output fnn)))
(fill! 0 input)
(loop for i upfrom 0
for digit in digits
do (setf (mref input i digit) 1))
(setf (target outputlump)
(mapcar (lambda (digit)
(mod (1+ digit) *noutputs*))
digits))))
;;; Train the network by minimizing the loss (crossentropy here) with
;;; stochastic gradient descent.
(defun traindigitfnn ()
(let ((optimizer
;; First create the optimizer for MINIMIZE.
(makeinstance 'segmentedgdoptimizer
:segmenter
;; We train each weight lump with the same
;; parameters and, in fact, the same
;; optimizer. But it need not be so, in
;; general.
(constantly
(makeinstance 'sgdoptimizer
:learningrate 1
:momentum 0.9
:batchsize 100))))
(fnn (makedigitfnn)))
;; The number of instances the FNN can work with in parallel. It's
;; usually equal to the batch size or is a its divisor.
(setf (maxnstripes fnn) 50)
;; Initialize all weights randomly.
(mapsegments (lambda (weights)
(gaussianrandom! (nodes weights) :stddev 0.01))
fnn)
;; Arrange for training and test error to be logged.
(monitoroptimizationperiodically
optimizer '((:fn logtesterror :period 10000)
(:fn resetoptimizationmonitors :period 1000)))
;; Finally, start the optimization.
(minimize optimizer
;; Dress FNN in a BPLEARNER and attach monitors for the
;; cost to it. These monitors are going to be logged and
;; reset after every 100 training instance by
;; RESETOPTIMIZATIONMONITORS above.
(makeinstance 'bplearner
:bpn fnn
:monitors (makecostmonitors
fnn :attributes `(:event "train")))
;; Training stops when the sampler runs out (after 10000
;; instances).
:dataset (makesampler 10000))))
;;; Return a sampler object that produces MAXNSAMPLES number of
;;; random inputs (numbers between 0 and 9).
(defun makesampler (maxnsamples)
(makeinstance 'functionsampler :maxnsamples maxnsamples
:generator (lambda () (random *ninputs*))))
;;; Log the test error. Also, describe the optimizer and the bpn at
;;; the beginning of training. Called periodically during training
;;; (see above).
(defun logtesterror (optimizer learner)
(when (zerop (ninstances optimizer))
(describe optimizer)
(describe (bpn learner)))
(logpadded
(monitorbpnresults (makesampler 1000) (bpn learner)
(makecostmonitors
(bpn learner) :attributes `(:event "pred.")))))
#
;;; Transcript follows:
(repeatably ()
(let ((*logtime* nil))
(traindigitfnn)))
.. training at ninstances: 0
.. train cost: 0.000e+0 (0)
.. #<SEGMENTEDGDOPTIMIZER {100E112E93}>
.. SEGMENTEDGDOPTIMIZER description:
.. NINSTANCES = 0
.. OPTIMIZERS = (#<SGDOPTIMIZER
.. #<SEGMENTSET
.. (#<>WEIGHT # :SIZE 15 1/1 :NORM 0.04473>
.. #<>WEIGHT # :SIZE 3 1/1 :NORM 0.01850>
.. #<>WEIGHT # :SIZE 50 1/1 :NORM 0.07159>
.. #<>WEIGHT # :SIZE 5 1/1 :NORM 0.03056>)
.. {100E335B73}>
.. {100E06DF83}>)
.. SEGMENTS = (#<>WEIGHT (HIDDEN OUTPUTACTIVATION) :SIZE
.. 15 1/1 :NORM 0.04473>
.. #<>WEIGHT (:BIAS OUTPUTACTIVATION) :SIZE
.. 3 1/1 :NORM 0.01850>
.. #<>WEIGHT (INPUT HIDDENACTIVATION) :SIZE
.. 50 1/1 :NORM 0.07159>
.. #<>WEIGHT (:BIAS HIDDENACTIVATION) :SIZE
.. 5 1/1 :NORM 0.03056>)
..
.. #<SGDOPTIMIZER {100E06DF83}>
.. GDOPTIMIZER description:
.. NINSTANCES = 0
.. SEGMENTSET = #<SEGMENTSET
.. (#<>WEIGHT (HIDDEN OUTPUTACTIVATION) :SIZE
.. 15 1/1 :NORM 0.04473>
.. #<>WEIGHT (:BIAS OUTPUTACTIVATION) :SIZE
.. 3 1/1 :NORM 0.01850>
.. #<>WEIGHT (INPUT HIDDENACTIVATION) :SIZE
.. 50 1/1 :NORM 0.07159>
.. #<>WEIGHT (:BIAS HIDDENACTIVATION) :SIZE
.. 5 1/1 :NORM 0.03056>)
.. {100E335B73}>
.. LEARNINGRATE = 1.00000e+0
.. MOMENTUM = 9.00000e1
.. MOMENTUMTYPE = :NORMAL
.. WEIGHTDECAY = 0.00000e+0
.. WEIGHTPENALTY = 0.00000e+0
.. NAFTERUPATEHOOK = 0
.. BATCHSIZE = 100
..
.. BATCHGDOPTIMIZER description:
.. NBEFOREUPATEHOOK = 0
.. #<DIGITFNN {100E11A423}>
.. BPN description:
.. CLUMPS = #(#<>INPUT INPUT :SIZE 10 1/50 :NORM 0.00000>
.. #<>ACTIVATION
.. (HIDDENACTIVATION :ACTIVATION) :STRIPES 1/50
.. :CLUMPS 4>
.. #<>RELU HIDDEN :SIZE 5 1/50 :NORM 0.00000>
.. #<>ACTIVATION
.. (OUTPUTACTIVATION :ACTIVATION) :STRIPES 1/50
.. :CLUMPS 4>
.. #<>SOFTMAXXELOSS OUTPUT :SIZE 3 1/50 :NORM 0.00000>)
.. NSTRIPES = 1
.. MAXNSTRIPES = 50
.. pred. cost: 1.100d+0 (1000.00)
.. training at ninstances: 1000
.. train cost: 1.093d+0 (1000.00)
.. training at ninstances: 2000
.. train cost: 5.886d1 (1000.00)
.. training at ninstances: 3000
.. train cost: 3.574d3 (1000.00)
.. training at ninstances: 4000
.. train cost: 1.601d7 (1000.00)
.. training at ninstances: 5000
.. train cost: 1.973d9 (1000.00)
.. training at ninstances: 6000
.. train cost: 4.882d10 (1000.00)
.. training at ninstances: 7000
.. train cost: 2.771d10 (1000.00)
.. training at ninstances: 8000
.. train cost: 2.283d10 (1000.00)
.. training at ninstances: 9000
.. train cost: 2.123d10 (1000.00)
.. training at ninstances: 10000
.. train cost: 2.263d10 (1000.00)
.. pred. cost: 2.210d10 (1000.00)
..
==> (#<>WEIGHT (:BIAS HIDDENACTIVATION) :SIZE 5 1/1 :NORM 2.94294>
> #<>WEIGHT (INPUT HIDDENACTIVATION) :SIZE 50 1/1 :NORM 11.48995>
> #<>WEIGHT (:BIAS OUTPUTACTIVATION) :SIZE 3 1/1 :NORM 3.39103>
> #<>WEIGHT (HIDDEN OUTPUTACTIVATION) :SIZE 15 1/1 :NORM 11.39339>)
#
11.3.4 Recurrent Neural Nets
RNN Tutorial
Hopefully this example from example/sumsignfnn.lisp
illustrates
the concepts involved. Make sure you are comfortable with
FNN Tutorial before reading this.
(cl:defpackage :mglexamplesumsignrnn
(:use #:commonlisp #:mgl))
(inpackage :mglexamplesumsignrnn)
;;; There is a single input at each time step...
(defparameter *ninputs* 1)
;;; and we want to learn the rule that outputs the sign of the sum of
;;; inputs so far in the sequence.
(defparameter *noutputs* 3)
;;; Generate a training example that's a sequence of random length
;;; between 1 and LENGTH. Elements of the sequence are lists of two
;;; elements:
;;;
;;; 1. The input for the network (a single random number).
;;;
;;; 2. The sign of the sum of inputs so far encoded as 0, 1, 2 (for
;;; negative, zero and positive values). To add a twist, the sum is
;;; reset whenever a negative input is seen.
(defun makesumsigninstance (&key (length 10))
(let ((length (max 1 (random length)))
(sum 0))
(loop for i below length
collect (let ((x (1 (* 2 (random 2)))))
(incf sum x)
(when (< x 0)
(setq sum x))
(list x (cond ((minusp sum) 0)
((zerop sum) 1)
(t 2)))))))
;;; Build an RNN with a single lstm hidden layer and softmax output.
;;; For each time step, a SUMSIGNFNN will be instantiated.
(defun makesumsignrnn (&key (nhiddens 1))
(buildrnn ()
(buildfnn (:class 'sumsignfnn)
(input (>input :size 1))
(h (>lstm input :name 'h :size nhiddens))
(prediction (>softmaxxeloss (>activation h :name 'prediction
:size *noutputs*))))))
;;; We define this class to be able to specialize how inputs are
;;; translated by adding a SETINPUT method later.
(defclass sumsignfnn (fnn)
())
;;; We have a batch of instances from MAKESUMSIGNINSTANCE for the
;;; RNN. This function is invoked with elements of these instances
;;; belonging to the same time step (i.e. at the same index) and sets
;;; the input and target up.
(defmethod setinput (instances (fnn sumsignfnn))
(let ((inputnodes (nodes (findclump 'input fnn))))
(setf (target (findclump 'prediction fnn))
(loop for stripe upfrom 0
for instance in instances
collect
;; Sequences in the batch are not of equal length. The
;; RNN sends a NIL our way if a sequence has run out.
(when instance
(destructuringbind (input target) instance
(setf (mref inputnodes stripe 0) input)
target))))))
;;; Train the network by minimizing the loss (crossentropy here) with
;;; the Adam optimizer.
(defun trainsumsignrnn ()
(let ((rnn (makesumsignrnn)))
(setf (maxnstripes rnn) 50)
;; Initialize the weights in the usual sqrt(1 / fanin) style.
(mapsegments (lambda (weights)
(let* ((fanin (matdimension (nodes weights) 0))
(limit (sqrt (/ 6 fanin))))
(uniformrandom! (nodes weights)
:limit (* 2 limit))
(.+! ( limit) (nodes weights))))
rnn)
(minimize (monitoroptimizationperiodically
(makeinstance 'adamoptimizer
:learningrate 0.2
:meandecay 0.9
:meandecaydecay 0.9
:variancedecay 0.9
:batchsize 100)
'((:fn logtesterror :period 30000)
(:fn resetoptimizationmonitors :period 3000)))
(makeinstance 'bplearner
:bpn rnn
:monitors (makecostmonitors rnn))
:dataset (makesampler 30000))))
;;; Return a sampler object that produces MAXNSAMPLES number of
;;; random inputs.
(defun makesampler (maxnsamples &key (length 10))
(makeinstance 'functionsampler :maxnsamples maxnsamples
:generator (lambda ()
(makesumsigninstance :length length))))
;;; Log the test error. Also, describe the optimizer and the bpn at
;;; the beginning of training. Called periodically during training
;;; (see above).
(defun logtesterror (optimizer learner)
(when (zerop (ninstances optimizer))
(describe optimizer)
(describe (bpn learner)))
(let ((rnn (bpn learner)))
(logpadded
(append
(monitorbpnresults (makesampler 1000) rnn
(makecostmonitors
rnn :attributes '(:event "pred.")))
;; Same result in a different way: monitor predictions for
;; sequences up to length 20, but don't unfold the RNN
;; unnecessarily to save memory.
(let ((*warptime* t))
(monitorbpnresults (makesampler 1000 :length 20) rnn
;; Just collect and reset the warp
;; monitors after each batch of
;; instances.
(makecostmonitors
rnn :attributes '(:event "warped pred."))))))
;; Verify that no further unfoldings took place.
(assert (<= (length (clumps rnn)) 10)))
(logmatroom))
#
;;; Transcript follows:
(let (;; Backprop nets do not need double float. Using single floats
;; is faster and needs less memory.
(*defaultmatctype* :float)
;; Enable moving data in and out of GPU memory so that the RNN
;; can work with sequences so long that the unfolded network
;; wouldn't otherwise fit in the GPU.
(*cudawindowstarttime* 1)
(*logtime* nil))
;; Seed the random number generators.
(repeatably ()
;; Enable CUDA if available.
(withcuda* ()
(trainsumsignrnn))))
.. training at ninstances: 0
.. cost: 0.000e+0 (0)
.. #<ADAMOPTIMIZER {1006CD5663}>
.. GDOPTIMIZER description:
.. NINSTANCES = 0
.. SEGMENTSET = #<SEGMENTSET
.. (#<>WEIGHT (H #) :SIZE 1 1/1 :NORM 1.73685>
.. #<>WEIGHT (H #) :SIZE 1 1/1 :NORM 0.31893>
.. #<>WEIGHT (#1=# #2=# :PEEPHOLE) :SIZE
.. 1 1/1 :NORM 1.81610>
.. #<>WEIGHT (H #2#) :SIZE 1 1/1 :NORM 0.21965>
.. #<>WEIGHT (#1# #3=# :PEEPHOLE) :SIZE
.. 1 1/1 :NORM 1.74939>
.. #<>WEIGHT (H #3#) :SIZE 1 1/1 :NORM 0.40377>
.. #<>WEIGHT (H PREDICTION) :SIZE
.. 3 1/1 :NORM 2.15898>
.. #<>WEIGHT (:BIAS PREDICTION) :SIZE
.. 3 1/1 :NORM 2.94470>
.. #<>WEIGHT (#1# #4=# :PEEPHOLE) :SIZE
.. 1 1/1 :NORM 0.97601>
.. #<>WEIGHT (INPUT #4#) :SIZE 1 1/1 :NORM 0.65261>
.. #<>WEIGHT (:BIAS #4#) :SIZE 1 1/1 :NORM 0.37653>
.. #<>WEIGHT (INPUT #1#) :SIZE 1 1/1 :NORM 0.92334>
.. #<>WEIGHT (:BIAS #1#) :SIZE 1 1/1 :NORM 0.01609>
.. #<>WEIGHT (INPUT #5=#) :SIZE 1 1/1 :NORM 1.09995>
.. #<>WEIGHT (:BIAS #5#) :SIZE 1 1/1 :NORM 1.41244>
.. #<>WEIGHT (INPUT #6=#) :SIZE 1 1/1 :NORM 0.40475>
.. #<>WEIGHT (:BIAS #6#) :SIZE 1 1/1 :NORM 1.75358>)
.. {1006CD8753}>
.. LEARNINGRATE = 2.00000e1
.. MOMENTUM = NONE
.. MOMENTUMTYPE = :NONE
.. WEIGHTDECAY = 0.00000e+0
.. WEIGHTPENALTY = 0.00000e+0
.. NAFTERUPATEHOOK = 0
.. BATCHSIZE = 100
..
.. BATCHGDOPTIMIZER description:
.. NBEFOREUPATEHOOK = 0
..
.. ADAMOPTIMIZER description:
.. MEANDECAYRATE = 1.00000e1
.. MEANDECAYRATEDECAY = 9.00000e1
.. VARIANCEDECAYRATE = 1.00000e1
.. VARIANCEADJUSTMENT = 1.00000d7
.. #<RNN {10047C77E3}>
.. BPN description:
.. CLUMPS = #(#<SUMSIGNFNN :STRIPES 1/50 :CLUMPS 4>
.. #<SUMSIGNFNN :STRIPES 1/50 :CLUMPS 4>)
.. NSTRIPES = 1
.. MAXNSTRIPES = 50
..
.. RNN description:
.. MAXLAG = 1
.. pred. cost: 1.223e+0 (4455.00)
.. warped pred. cost: 1.228e+0 (9476.00)
.. Foreign memory usage:
.. foreign arrays: 162 (used bytes: 39,600)
.. CUDA memory usage:
.. device arrays: 114 (used bytes: 220,892, pooled bytes: 19,200)
.. host arrays: 162 (used bytes: 39,600)
.. host>device copies: 6,164, device>host copies: 4,490
.. training at ninstances: 3000
.. cost: 3.323e1 (13726.00)
.. training at ninstances: 6000
.. cost: 3.735e2 (13890.00)
.. training at ninstances: 9000
.. cost: 1.012e2 (13872.00)
.. training at ninstances: 12000
.. cost: 3.026e3 (13953.00)
.. training at ninstances: 15000
.. cost: 9.267e4 (13948.00)
.. training at ninstances: 18000
.. cost: 2.865e4 (13849.00)
.. training at ninstances: 21000
.. cost: 8.893e5 (13758.00)
.. training at ninstances: 24000
.. cost: 2.770e5 (13908.00)
.. training at ninstances: 27000
.. cost: 8.514e6 (13570.00)
.. training at ninstances: 30000
.. cost: 2.705e6 (13721.00)
.. pred. cost: 1.426e6 (4593.00)
.. warped pred. cost: 1.406e6 (9717.00)
.. Foreign memory usage:
.. foreign arrays: 216 (used bytes: 52,800)
.. CUDA memory usage:
.. device arrays: 148 (used bytes: 224,428, pooled bytes: 19,200)
.. host arrays: 216 (used bytes: 52,800)
.. host>device copies: 465,818, device>host copies: 371,990
..
==> (#<>WEIGHT (H (H :OUTPUT)) :SIZE 1 1/1 :NORM 0.10624>
> #<>WEIGHT (H (H :CELL)) :SIZE 1 1/1 :NORM 0.94460>
> #<>WEIGHT ((H :CELL) (H :FORGET) :PEEPHOLE) :SIZE 1 1/1 :NORM 0.61312>
> #<>WEIGHT (H (H :FORGET)) :SIZE 1 1/1 :NORM 0.38093>
> #<>WEIGHT ((H :CELL) (H :INPUT) :PEEPHOLE) :SIZE 1 1/1 :NORM 1.17956>
> #<>WEIGHT (H (H :INPUT)) :SIZE 1 1/1 :NORM 0.88011>
> #<>WEIGHT (H PREDICTION) :SIZE 3 1/1 :NORM 49.93808>
> #<>WEIGHT (:BIAS PREDICTION) :SIZE 3 1/1 :NORM 10.98112>
> #<>WEIGHT ((H :CELL) (H :OUTPUT) :PEEPHOLE) :SIZE 1 1/1 :NORM 0.67996>
> #<>WEIGHT (INPUT (H :OUTPUT)) :SIZE 1 1/1 :NORM 0.65251>
> #<>WEIGHT (:BIAS (H :OUTPUT)) :SIZE 1 1/1 :NORM 10.23003>
> #<>WEIGHT (INPUT (H :CELL)) :SIZE 1 1/1 :NORM 5.98116>
> #<>WEIGHT (:BIAS (H :CELL)) :SIZE 1 1/1 :NORM 0.10681>
> #<>WEIGHT (INPUT (H :FORGET)) :SIZE 1 1/1 :NORM 4.46301>
> #<>WEIGHT (:BIAS (H :FORGET)) :SIZE 1 1/1 :NORM 1.57195>
> #<>WEIGHT (INPUT (H :INPUT)) :SIZE 1 1/1 :NORM 0.36401>
> #<>WEIGHT (:BIAS (H :INPUT)) :SIZE 1 1/1 :NORM 8.63833>)
#

A recurrent neural net (as opposed to a feedforward one. It is typically built with
BUILDRNN
that's no more than a shallow convenience macro.An
RNN
takes instances as inputs that are sequences of variable length. At each time step, the next unprocessed elements of these sequences are set as input until all input sequences in the batch run out. To be able to perform backpropagation, all intermediateLUMP
s must be kept around, so the recursive connections are transformed out by unfolding the network. Just how many lumps this means depends on the length of the sequences.When an
RNN
is created,MAXLAG + 1
BPN
s are instantiated so that all weights are present and one can start training it.
[reader] UNFOLDER RNN (:UNFOLDER)
The
UNFOLDER
of anRNN
is function of no arguments that builds and returns aBPN
. The unfolder is allowed to create networks with arbitrary topology even different ones for differentTIMESTEP
s with the help ofLAG
, or nestedRNN
s. Weights of the same name are shared between the folds. That is, if a>WEIGHT
lump were to be created and a weight lump of the same name already exists, then the existing lump will be added to theBPN
created byUNFOLDER
.
[reader] MAXLAG RNN (:MAXLAG = 1)
The networks built by
UNFOLDER
may contain new weights up to time stepMAXLAG
. Beyond that point, all weight lumps must be reappearances of weight lumps with the same name at previous time steps. Most recurrent networks reference only the state of lumps at the previous time step (with the functionLAG
), hence the default of 1. But it is possible to have connections to arbitrary time steps. The maximum connection lag must be specified when creating theRNN
.
[accessor] CUDAWINDOWSTARTTIME RNN (:CUDAWINDOWSTARTTIME =
*CUDAWINDOWSTARTTIME*
)Due to unfolding, the memory footprint of an
RNN
is almost linear in the number of time steps (i.e. the max sequence length). For prediction, this is addressed by Time Warp. For training, we cannot discard results of previous time steps because they are needed for backpropagation, but we can at least move them out of GPU memory if they are not going to be used for a while and copy them back before they are needed. Obviously, this is only relevant if CUDA is being used.If
CUDAWINDOWSTARTTIME
isNIL
, then this feature is turned off. Else, during training, atCUDAWINDOWSTARTTIME
or later time steps, matrices belonging to nonweight lumps may be forced out of GPU memory and later brought back as neeeded.This feature is implemented in terms of
MGLMAT:WITHSYNCINGCUDAFACETS
that uses CUDA host memory (also known as pagelocked or pinned memory) to do asynchronous copies concurrently with normal computation. The consequence of this is that it is now main memory usage that's unbounded which toghether with pagelocking makes it a potent weapon to bring a machine to a halt. You were warned.
[variable] *CUDAWINDOWSTARTTIME* NIL
The default for
CUDAWINDOWSTARTTIME
.
[macro] BUILDRNN (&KEY RNN (CLASS ''
RNN
) NAME INITARGS MAXNSTRIPES (MAXLAG 1)) &BODY BODYCreate an
RNN
withMAXNSTRIPES
andMAXLAG
whoseUNFOLDER
isBODY
wrapped in a lambda. Bind symbol given as theRNN
argument to theRNN
object so thatBODY
can see it.
[function] LAG NAME &KEY (LAG 1) RNN PATH
In
RNN
or if it'sNIL
theRNN
being extended with anotherBPN
(called unfolding), look up theCLUMP
withNAME
in theBPN
that'sLAG
number of time steps before theBPN
being added. If this function is called fromUNFOLDER
of anRNN
(which is what happens behind the scene in the body ofBUILDRNN
), then it returns an opaque object representing a lagged connection to a clump, else it returns theCLUMP
itself.FIXDOC:
PATH
[function] TIMESTEP &KEY (RNN
*RNN*
)Return the time step
RNN
is currently executing or being unfolded for. It is 0 when theRNN
is being unfolded for the first time.
[method] SETINPUT INSTANCES (RNN RNN)
RNN
s operate on batches of instances just likeFNN
s. But the instances here are like datasets: sequences or samplers and they are turned into sequences of batches of instances withMAPDATASETS
:IMPUTE
NIL
. The batch of instances at index 2 is clamped onto theBPN
at time step 2 withSETINPUT
.When the input sequences in the batch are not of the same length, already exhausted sequences will produce
NIL
(due to:IMPUTE
NIL
) above. When such aNIL
is clamped withSETINPUT
on aBPN
of theRNN
,SETINPUT
must set theIMPORTANCE
of the >ERROR lumps to 0 else training would operate on the noise left there by previous invocations.
Time Warp
The unbounded memory usage of RNN
s with one BPN
allocated per
time step can become a problem. For training, where the gradients
often have to be backpropagated from the last time step to the very
beginning, this is hard to solve but with CUDAWINDOWSTARTTIME
the
limit is no longer GPU memory.
For prediction on the other hand, one doesn't need to keep old steps around indefinitely: they can be discarded when future time steps will never reference them again.

Controls whether warping is enabled (see Time Warp). Don't enable it for training, as it would make backprop impossible.
[function] WARPEDTIME &KEY (RNN
*RNN*
) (TIME (TIMESTEP
:RNN
RNN
)) (LAG 0)Return the index of the
BPN
inCLUMPS
ofRNN
whose task it is to execute computation at( (TIMESTEP RNN) LAG)
. This is normally the same asTIMESTEP
(disregardingLAG
). That is,CLUMPS
can be indexed byTIMESTEP
to get theBPN
. However, when*WARPTIME*
is true, execution proceeds in a cycle as the structure of the network allows.Suppose we have a typical
RNN
that only ever references the previous time step so itsMAXLAG
is 1. ItsUNFOLDER
returnsBPN
s of identical structure bar a shift in their time lagged connections except for the very first, soWARPSTART
andWARPLENGTH
are both 1. If*WARPTIME*
isNIL
, then the mapping fromTIMESTEP
to theBPN
inCLUMPS
is straightforward:time:  0  1  2  3  4  5 ++++++ warped:  0  1  2  3  4  5 ++++++ bpn:  b0  b1  b2  b3  b4  b5
When
*WARPTIME*
is true, we reuse theB1
B2
bpns in a loop:time:  0  1  2  3  4  5 ++++++ warped:  0  1  2  1  2  1 ++++++ bpn:  b0  b1  b2  b1* b2  b1*
B1*
is the sameBPN
asB1
, but its connections created byLAG
go through warped time and end up referencingB2
. This way, memory consumption is independent of the number time steps needed to process a sequence or make predictions.To be able to pull this trick off
WARPSTART
andWARPLENGTH
must be specified when theRNN
is instantiated. In general, with*WARPTIME*
(+ WARPSTART (MAX 2 WARPLENGTH))
bpns are needed. The 2 comes from the fact that with cycle length 1 a bpn would need to takes its input from itself which is problematic because it hasNODES
for only one set of values.
[reader] WARPSTART RNN (:WARPSTART = 1)
The
TIMESTEP
from whichUNFOLDER
will createBPN
s that essentially repeat everyWARPLENGTH
steps.
[reader] WARPLENGTH RNN (:WARPLENGTH = 1)
An integer such that the
BPN
UNFOLDER
creates at time stepI
(where(<= WARPSTART I)
) is identical to theBPN
created at time step(+ WARPSTART (MOD ( I WARPSTART) WARPLENGTH))
except for a shift in its time lagged connections.
[accessor] STEPMONITORS RNN (:STEPMONITORS =
NIL
)During training, unfolded
BPN
s corresponding to previous time steps may be expensive to get at because they are no longer in GPU memory. This consideration also applies to making prediction with the additional caveat that with*WARPTIME*
true, previous states are discarded so it's not possible to gather statistics afterFORWARD
finished.Add monitor objects to this slot and they will be automatically applied to the
RNN
after each step whenFORWARD
ing theRNN
during training or prediction. To be able to easily switch between sets of monitors, in addition to a list of monitors this can be a symbol or a function, too. If it's a symbol, then its a designator for itsSYMBOLVALUE
. If it's a function, then it must have no arguments and it's a designator for its return value.
11.4 Lumps
11.4.1 Lump Base Class

A
LUMP
is a simple, layerlike component of a neural network. There are many kinds of lumps, each of which performs a specific operation or just stores inputs and weights. By convention, the names of lumps start with the prefix>
. Defined as classes, they also have a function of the same name as the class to create them easily. These maker functions typically have keyword arguments corresponding to initargs of the class, with some (mainly the input lumps) turned into normal positional arguments. So instead of having to do(makeinstance '>tanh :x someinput :name 'mytanh)
one can simply write
(>tanh someinput :name 'mytanh)
Lumps instantiated in any way within a
BUILDFNN
orBUILDRNN
are automatically added to the network being built.A lump has its own
NODES
andDERIVATIVES
matrices allocated for it in which the results of the forward and backward passes are stored. This is in contrast to aBPN
whoseNODES
andDERIVATIVES
are those of its last constituentCLUMP
.Since lumps almost always live within a
BPN
, theirNSTRIPES
andMAXNSTRIPES
are handled automagically behind the scenes.
[reader] DEFAULTVALUE LUMP (:DEFAULTVALUE = 0)
Upon creation or resize the lump's nodes get filled with this value.
[genericfunction] DEFAULTSIZE LUMP
Return a default for the
SIZE
ofLUMP
if one is not supplied at instantiation. The value is often computed based on the sizes of the inputs. This function is for implementing new lump types.

The values computed by the lump in the forward pass are stored here. It is an
NSTRIPES * SIZE
matrix that has storage allocated forMAXNSTRIPES * SIZE
elements for nonweight lumps.>WEIGHT
lumps have no stripes nor restrictions on their shape.
[reader] DERIVATIVES LUMP
The derivatives computed in the backward pass are stored here. This matrix is very much like
NODES
in shape and size.
11.4.2 Inputs
Input Lump

A lump that has no input lumps, does not change its values in the forward pass (except when
DROPOUT
is nonzero), and does not compute derivatives. Clamp inputs onNODES
of input lumps inSETINPUT
.For convenience,
>INPUT
can perform dropout itself although it defaults to no dropout.(>input :size 10 :name 'someinput) ==> #<>INPUT SOMEINPUT :SIZE 10 1/1 :NORM 0.00000>
[accessor] DROPOUT >INPUT (=
NIL
)See
DROPOUT
.
Embedding Lump
This lump is like an input and a simple activation molded together in the name of efficiency.

Select rows of
WEIGHTS
(0
1
), one row for each index inINPUTROWINDICES
. This lump is equivalent to adding an>INPUT
lump with a one hot encoding scheme and a>V*M
lump on top of it, but it is more efficient in execution and in memory usage, because it works with a sparse representation of the input.The
SIZE
(0
1
) of this lump is the number of columns ofWEIGHTS
(0
1
) which is determined automatically.(>embedding :weights (>weight :name 'embeddingweights :dimensions '(3 5)) :name 'embeddings) ==> #<>EMBEDDING EMBEDDINGS :SIZE 5 1/1 :NORM 0.00000>
[reader] WEIGHTS >EMBEDDING (:WEIGHTS)
A weight lump whose rows indexed by
INPUTROWINDICES
are copied to the output of this lump.
[reader] INPUTROWINDICES >EMBEDDING (:INPUTROWINDICES)
A sequence of batch size length of row indices. To be set in
SETINPUT
.
11.4.3 Weight Lump

A set of optimizable parameters of some kind. When a
BPN
is is trained (see Training) theNODES
of weight lumps will be changed. Weight lumps perform no computation.Weights can be created by specifying the total size or the dimensions:
(dimensions (>weight :size 10 :name 'w)) => (1 10) (dimensions (>weight :dimensions '(5 10) :name 'w)) => (5 10)
[reader] DIMENSIONS >WEIGHT (:DIMENSIONS)
NODES
andDERIVATIVES
of this lump will be allocated with these dimensions.
[macro] WITHWEIGHTSCOPIED (FROMBPN) &BODY BODY
In
BODY
>WEIGHT
will first look up if a weight lump of the same name exists inFROMBPN
and return that, or else create a weight lump normally. IfFROMBPN
isNIL
, then no weights are copied.
11.4.4 Activations
Activation Subnet
So we have some inputs. Usually the next step is to multiply the
input vector with a weight matrix and add biases. This can be done
directly with >+
, >V*M
and >WEIGHT
, but it's more convenient to
use activation subnets to reduce the clutter.

Activation subnetworks are built by the function
>ACTIVATION
and they have a number of lumps hidden inside them. Ultimately, this subnetwork computes a sum likesum_i x_i * W_i + sum_j y_j .* V_j + biases
wherex_i
are input lumps,W_i
are dense matrices representing connections, whileV_j
are peephole connection vectors that are mulitplied in an elementwise manner with their corresponding inputy_j
.
[function] >ACTIVATION INPUTS &KEY (NAME (
GENSYM
)) SIZE PEEPHOLES (ADDBIASPT
)Create a subnetwork of class
>ACTIVATION
that computes the over activation from dense connection from lumps inINPUTS
, and elementwise connection from lumps inPEEPHOLES
. Create new>WEIGHT
lumps as necessary.INPUTS
andPEEPHOLES
can be a single lump or a list of lumps. Finally, ifADDBIASP
, then add an elementwise bias too.SIZE
must be specified explicitly, because it is not possible to determine it unless there are peephole connections.(>activation (>input :size 10 :name 'input) :name 'h1 :size 4) ==> #<>ACTIVATION (H1 :ACTIVATION) :STRIPES 1/1 :CLUMPS 4>
This is the basic workhorse of neural networks which takes care of the linear transformation whose results and then fed to some nonlinearity (
>SIGMOID
,>TANH
, etc).The name of the subnetwork clump is
(,NAME :ACTIVATION)
. The bias weight lump (if any) is named(:BIAS ,NAME)
. Dense connection weight lumps are named are named after the input andNAME
:(,(NAME INPUT) ,NAME)
, while peepholes weight lumps are named(,(NAME INPUT) ,NAME :PEEPHOLE)
. This is useful to know if, for example, they are to be initialized differently.
BatchNormalization
[class] >BATCHNORMALIZED LUMP
This is an implementation of v3 of the Batch Normalization paper. The output of
>BATCHNORMALIZED
is its input normalized so that for all elements the mean across stripes is zero and the variance is 1. That is, the mean of the batch is subtracted from the inputs and they are rescaled by their sample stddev. Actually, after the normalization step the values are rescaled and shifted (but this time with learnt parameters) in order to keep the representational power of the model the same. The primary purpose of this lump is to speed up learning, but it also acts as a regularizer. See the paper for the details.To normalize the output of
LUMP
without no additional regularizer effect:(>batchnormalized lump :batchsize :usepopulation)
The above uses an exponential moving average to estimate the mean and variance of batches and these estimations are used at both training and test time. In contrast to this, the published version uses the sample mean and variance of the current batch at training time which injects noise into the process. The noise is higher for lower batch sizes and has a regularizing effect. This is the default behavior (equivalent to
:BATCHSIZE NIL
):(>batchnormalized lump)
For performance reasons one may wish to process a higher number of instances in a batch (in the sense of
NSTRIPES
) and get the regularization effect associated with a lower batch size. This is possible by setting:BATCHSIZE
to a divisor of the the number of stripes. Say, the number of stripes is 128, but we want as much regularization as we would get with 32:(>batchnormalized lump :batchsize 32)
The primary input of
>BATCHNORMALIZED
is often an>ACTIVATION
(0
1
) and its output is fed into an activation function (see Activation Functions).
[reader] BATCHNORMALIZATION >BATCHNORMALIZED (:NORMALIZATION)
The
>BATCHNORMALIZATION
of this lump. May be shared between multiple>BATCHNORMALIZED
lumps.Batch normalization is special in that it has state apart from the computed results (
NODES
) and its derivatives (DERIVATIVES
). This state is the estimated mean and variance of its inputs and they are encapsulated by>BATCHNORMALIZATION
.If
NORMALIZATION
is not given at instantiation, then a new>BATCHNORMALIZATION
object will be created automatically, passing:BATCHSIZE
,:VARIANCEADJUSTMENT
, and:POPULATIONDECAY
arguments on to>BATCHNORMALIZATION
. SeeBATCHSIZE
,VARIANCEADJUSTMENT
andPOPULATIONDECAY
. New scale and shift weight lumps will be created with names:`(,name :scale) `(,name :shift)
where
NAME
is theNAME
(0
1
) of this lump.This default behavior covers the usecase where the statistics kept by
>BATCHNORMALIZATION
are to be shared only between time steps of anRNN
.
[class] >BATCHNORMALIZATION >WEIGHT
The primary purpose of this class is to hold the estimated mean and variance of the inputs to be normalized and allow them to be shared between multiple
>BATCHNORMALIZED
lumps that carry out the computation. These estimations are saved and loaded bySAVESTATE
andLOADSTATE
.(>batchnormalization (>weight :name '(h1 :scale) :size 10) (>weight :name '(h1 :shift) :size 10) :name '(h1 :batchnormalization))
[reader] SCALE >BATCHNORMALIZATION (:SCALE)
A weight lump of the same size as
SHIFT
. This is $\gamma$ in the paper.
[reader] SHIFT >BATCHNORMALIZATION (:SHIFT)
A weight lump of the same size as
SCALE
. This is $\beta$ in the paper.
[reader] BATCHSIZE >BATCHNORMALIZATION (:BATCHSIZE =
NIL
)Normally all stripes participate in the batch. Lowering the number of stripes may increase the regularization effect, but it also makes the computation less efficient. By setting
BATCHSIZE
to a divisor ofNSTRIPES
one can decouple the concern of efficiency from that of regularization. The default value,NIL
, is equivalent toNSTRIPES
.BATCHSIZE
only affects training.With the special value
:USEPOPULATION
, instead of the mean and the variance of the current batch, use the population statistics for normalization. This effectively cancels the regularization effect, leaving only the faster learning.
[reader] VARIANCEADJUSTMENT >BATCHNORMALIZATION (:VARIANCEADJUSTMENT = 1.0e4)
A small positive real number that's added to the sample variance. This is $\epsilon$ in the paper.
[reader] POPULATIONDECAY >BATCHNORMALIZATION (:POPULATIONDECAY = 0.99)
While training, an exponential moving average of batch means and standard deviances (termed population statistics) is updated. When making predictions, normalization is performed using these statistics. These population statistics are persisted by
SAVESTATE
.
[function] >BATCHNORMALIZEDACTIVATION INPUTS &KEY (NAME (
GENSYM
)) SIZE PEEPHOLES BATCHSIZE VARIANCEADJUSTMENT POPULATIONDECAYA utility functions that creates and wraps an
>ACTIVATION
(0
1
) in>BATCHNORMALIZED
and with itsBATCHNORMALIZATION
the two weight lumps for the scale and shift parameters.(>BATCHNORMALIZEDACTIVATION INPUTS :NAME 'H1 :SIZE 10)
is equivalent to:(>batchnormalized (>activation inputs :name 'h1 :size 10 :addbiasp nil) :name '(h1 :batchnormalizedactivation))
Note how biases are turned off since normalization will cancel them anyway (but a shift is added which amounts to the same effect).
11.4.5 Activation Functions
Now we are moving on to the most important nonlinearities to which activations are fed.
Sigmoid Lump
[class] >SIGMOID >DROPOUT LUMP
Applies the
1/(1 + e^{x})
function elementwise to its inputs. This is one of the classic nonlinearities for neural networks.For convenience,
>SIGMOID
can perform dropout itself although it defaults to no dropout.(>sigmoid (>activation (>input :size 10) :size 5) :name 'this) ==> #<>SIGMOID THIS :SIZE 5 1/1 :NORM 0.00000>
The
SIZE
(0
1
) of this lump is the size of its input which is determined automatically.
[accessor] DROPOUT >SIGMOID (=
NIL
)See
DROPOUT
.
Tanh Lump

Applies the
TANH
function to its input in an elementwise manner. TheSIZE
(0
1
) of this lump is the size of its input which is determined automatically.
Scaled Tanh Lump

Pretty much like
TANH
but its input and output is scaled in such a way that the variance of its output is close to 1 if the variance of its input is close to 1 which is a nice property to combat vanishing gradients. The actual function is1.7159 * tanh(2/3 * x)
. TheSIZE
(0
1
) of this lump is the size of its input which is determined automatically.
Relu Lump
We are somewhere around year 2007 by now.

max(0,x)
activation function. Be careful, relu units can get stuck in the off state: if they move to far to negative territory it can be very difficult to get out of it. TheSIZE
(0
1
) of this lump is the size of its input which is determined automatically.
Max Lump
We are in about year 2011.

This is basically maxout without dropout (see http://arxiv.org/abs/1302.4389). It groups its inputs by
GROUPSIZE
, and outputs the maximum of each group. TheSIZE
(0
1
) of the output is automatically calculated, it is the size of the input divided byGROUPSIZE
.(>max (>input :size 120) :groupsize 3 :name 'mymax) ==> #<>MAX MYMAX :SIZE 40 1/1 :NORM 0.00000 :GROUPSIZE 3>
The advantage of
>MAX
over>RELU
is that flow gradient is never stopped so there is no problem of units getting stuck in off state.
[reader] GROUPSIZE >MAX (:GROUPSIZE)
The number of inputs in each group.
Min Lump
[reader] GROUPSIZE >MIN (:GROUPSIZE)
The number of inputs in each group.
MaxChannel Lump

Called LWTA (Local Winner Take All) or ChannelOut (see http://arxiv.org/abs/1312.1909) in the literature it is basically
>MAX
, but instead of producing one output per group, it just produces zeros for all unit but the one with the maximum value in the group. This allows the next layer to get some information about the path along which information flowed. TheSIZE
(0
1
) of this lump is the size of its input which is determined automatically.
[reader] GROUPSIZE >MAXCHANNEL (:GROUPSIZE)
The number of inputs in each group.
11.4.6 Losses
Ultimately, we need to tell the network what to learn which means that the loss function to be minimized needs to be constructed as part of the network.
Loss Lump

Calculate the loss for the instances in the batch. The main purpose of this lump is to provide a training signal.
An error lump is usually a leaf in the graph of lumps (i.e. there are no other lumps whose input is this one). The special thing about error lumps is that 1 (but see
IMPORTANCE
) is added automatically to their derivatives. Error lumps have exactly one node (per stripe) whose value is computed as the sum of nodes in their input lump.
[accessor] IMPORTANCE >LOSS (:IMPORTANCE =
NIL
)This is to support weighted instances. That is when not all training instances are equally important. If nonNIL, a 1d
MAT
with the importances of stripes of the batch. WhenIMPORTANCE
is given (typically inSETINPUT
), then instead of adding 1 to the derivatives of all stripes,IMPORTANCE
is added elemtwise.
Squared Difference Lump
In regression, the squared error loss is most common. The squared
error loss can be constructed by combining >SQUAREDDIFFERENCE
with
a >LOSS
.
[class] >SQUAREDDIFFERENCE LUMP
This lump takes two input lumps and calculates their squared difference
(x  y)^2
in an elementwise manner. TheSIZE
(0
1
) of this lump is automatically determined from the size of its inputs. This lump is often fed into>LOSS
that sums the squared differences and makes it part of the function to be minimized.(>loss (>squareddifference (>activation (>input :size 100) :size 10) (>input :name 'target :size 10)) :name 'squarederror) ==> #<>LOSS SQUAREDERROR :SIZE 1 1/1 :NORM 0.00000>
Currently this lump is not CUDAized, but it will copy data from the GPU if it needs to.
Softmax CrossEntropy Loss Lump
[class] >SOFTMAXXELOSS LUMP
A specialized lump that computes the softmax of its input in the forward pass and backpropagates a crossentropy loss. The advantage of doing these together is numerical stability. The total crossentropy is the sum of crossentropies per group of
GROUPSIZE
elements:$$XE(x) =  \sum_{i=1,g} t_i \ln(s_i)$$
where
g
is the number of classes (GROUPSIZE
),t_i
are the targets (i.e. the true probabilities of the class, often all zero but one),s_i
is the output of softmax calculated from inputX
:$$s_i = {softmax}(x_1, x_2, ..., x_g) = \frac{e^x_i}{\sum_{j=1,g} e^x_j}$$
In other words, in the forward phase this lump takes input
X
, computes its elementwiseEXP
, normalizes each group ofGROUPSIZE
elements to sum to 1 to get the softmax which is the result that goes intoNODES
. In the backward phase, there are two sources of gradients: the lumps that use the output of this lump as their input (currently not implemented and would result in an error) and an implicit crossentropy loss.One can get the crossentropy calculated in the most recent forward pass by calling
COST
on this lump.This is the most common loss function for classification. In fact, it is nearly ubiquitous. See the FNN Tutorial and the RNN Tutorial for how this loss and
SETINPUT
work together.
[reader] GROUPSIZE >SOFTMAXXELOSS (:GROUPSIZE)
The number of elements in a softmax group. This is the number of classes for classification. Often
GROUPSIZE
is equal toSIZE
(0
1
) (it is the default), but in general the only constraint is thatSIZE
(0
1
) is a multiple ofGROUPSIZE
.
[accessor] TARGET >SOFTMAXXELOSS (:TARGET =
NIL
)Set in
SETINPUT
, this is either aMAT
of the same size as the input lumpX
or if the target is very sparse, this can also be a sequence of batch size length that contains the index value pairs of nonzero entries:(;; first instance in batch has two nonzero targets (;; class 10 has 30% expected probability (10 . 0.3) ;; class 2 has 70% expected probability (2 . 0.7)) ;; second instance in batch puts 100% on class 7 7 ;; more instances in the batch follow ...)
Actually, in the rare case where
GROUPSIZE
is notSIZE
(0
1
) (i.e. there are several softmax normalization groups for every example), the length of the above target sequence isBATCHSIZE
(0
1
2
) * NGROUPS. Indices are always relative to the start of the group.If
GROUPSIZE
is large (for example, in neural language models with a huge number of words), using sparse targets can make things go much faster, because calculation of the derivative is no longer quadratic.Giving different weights to training instances is implicitly supported. While target values in a group should sum to 1, multiplying all target values with a weight
W
is equivalent to training thatW
times on the same example.
[function] ENSURESOFTMAXTARGETMATRIX SOFTMAXXELOSS N
Set
TARGET
ofSOFTMAXXELOSS
to aMAT
capable of holding the dense target values forN
stripes.
11.4.7 Stochasticity
Dropout Lump

The output of this lump is identical to its input, except it randomly zeroes out some of them during training which act as a very strong regularizer. See Geoffrey Hinton's 'Improving neural networks by preventing coadaptation of feature detectors'.
The
SIZE
(0
1
) of this lump is the size of its input which is determined automatically.
[accessor] DROPOUT >DROPOUT (:DROPOUT = 0.5)
If nonNIL, then in the forward pass zero out each node in this chunk with
DROPOUT
probability.
Gaussian Random Lump
[class] >GAUSSIANRANDOM LUMP
This lump has no input, it produces normally distributed independent random numbers with
MEAN
andVARIANCE
(orVARIANCEFORPREDICTION
). This is useful building block for noise based regularization methods.(>gaussianrandom :size 10 :name 'normal :mean 1 :variance 2) ==> #<>GAUSSIANRANDOM NORMAL :SIZE 10 1/1 :NORM 0.00000>
[accessor] MEAN >GAUSSIANRANDOM (:MEAN = 0)
The mean of the normal distribution.
[accessor] VARIANCE >GAUSSIANRANDOM (:VARIANCE = 1)
The variance of the normal distribution.
[accessor] VARIANCEFORPREDICTION >GAUSSIANRANDOM (:VARIANCEFORPREDICTION = 0)
If not
NIL
, then this value overridesVARIANCE
when not in training (i.e. when making predictions).
Binary Sampling Lump

Treating values of its input as probabilities, sample independent binomials. Turn true into 1 and false into 0. The
SIZE
(0
1
) of this lump is determined automatically from the size of its input.(>samplebinary (>input :size 10) :name 'binarizedinput) ==> #<>SAMPLEBINARY BINARIZEDINPUT :SIZE 10 1/1 :NORM 0.00000>
11.4.8 Arithmetic
Sum Lump
VectorMatrix Multiplication Lump

Perform
X * WEIGHTS
whereX
(the input) is of sizeM
andWEIGHTS
(0
1
) is a>WEIGHT
whose single stripe is taken to be of dimensionsM x N
stored in row major order.N
is the size of this lump. IfTRANSPOSEWEIGHTSP
thenWEIGHTS
(0
1
) isN x M
andX * WEIGHTS'
is computed.
[reader] TRANSPOSEWEIGHTSP >V*M (:TRANSPOSEWEIGHTSP =
NIL
)Determines whether the input is multiplied by
WEIGHTS
(0
1
) or its transpose.
Elementwise Addition Lump

Performs elementwise addition on its input lumps. The
SIZE
(0
1
) of this lump is automatically determined from the size of its inputs if there is at least one. If one of the inputs is a>WEIGHT
lump, then it is added to every stripe.(>+ (list (>input :size 10) (>weight :size 10 :name 'bias)) :name 'plus) ==> #<>+ PLUS :SIZE 10 1/1 :NORM 0.00000>
Elementwise Multiplication Lump

Performs elementwise multiplication on its two input lumps. The
SIZE
(0
1
) of this lump is automatically determined from the size of its inputs. Either input can be a>WEIGHT
lump.(>* (>input :size 10) (>weight :size 10 :name 'scale) :name 'mult) ==> #<>* MULT :SIZE 10 1/1 :NORM 0.00000>
Abs Lump
Exp Lump
Normalized Lump
11.4.9 Operations for RNNs
LSTM Subnet

LongShort Term Memory subnetworks are built by the function
>LSTM
and they have many lumps hidden inside them. These lumps are packaged into a subnetwork to reduce clutter.
[function] >LSTM INPUTS &KEY NAME CELLINIT OUTPUTINIT SIZE (ACTIVATIONFN '
>ACTIVATION
(0
1
)) (GATEFN '>SIGMOID
) (INPUTFN '>TANH
) (OUTPUTFN '>TANH
) (PEEPHOLEST
)Create an LSTM layer consisting of input, forget, output gates with which input, cell state and output are scaled. Lots of lumps are created, the final one representing to output of the LSTM has
NAME
. The rest of the lumps are named automatically based onNAME
. This function returns only the output lump (m
), but all created lumps are added automatically to theBPN
being built.There are many papers and tutorials on LSTMs. This version is well described in "Long ShortTerm Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling" (2014, Hasim Sak, Andrew Senior, Francoise Beaufays). Using the notation from that paper:
$$i_t = s(W_{ix} x_t + W_{im} m_{t1} + W_{ic} \odot c_{t1} + b_i)$$
$$f_t = s(W_{fx} x_t + W_{fm} m_{t1} + W_{fc} \odot c_{t1} + b_f)$$
$$c_t = f_t \odot c_{t1} + i_t \odot g(W_{cx} x_t + W_{cm} m_{t1} + b_c)$$
$$o_t = s(W_{ox} x_t + W_{om} m_{t1} + W_{oc} \odot c_t + b_o)$$
$$m_t = o_t \odot h(c_t)$$
... where
i
,f
, ando
are the input, forget and output gates.c
is the cell state andm
is the actual output.Weight matrices for connections from
c
(W_ic
,W_fc
andW_oc
) are diagonal and represented by just the vector of diagonal values. These connections are only added ifPEEPHOLES
is true.A notable difference from the paper is that in addition to being a single lump,
x_t
(INPUTS
) can also be a list of lumps. Whenever some activation is to be calculated based onx_t
, it is going to be the sum of individual activations. For example,W_ix * x_t
is reallysum_j W_ijx * inputs_j
.If
CELLINIT
is nonNIL, then it must be aCLUMP
ofSIZE
form which stands for the initial state of the value cell (c_{1}
).CELLINIT
beingNIL
is equivalent to the state of all zeros.ACTIVATIONFN
defaults to>ACTIVATION
(0
1
), but it can be for example>BATCHNORMALIZEDACTIVATION
. In general, functions like the aforementioned two with signature like (INPUTS
&KEY
NAME
SIZE
PEEPHOLES
) can be passed asACTIVATIONFN
.
Sequence Barrier Lump

In an
RNN
, processing of stripes (instances in the batch) may require different number of time step so the final state for stripe 0 is in stripe 0 of some lump L at time step 7, while for stripe 1 it is in stripe 1 of sump lump L at time step 42.This lump copies the perstripe states from different lumps into a single lump so that further processing can take place (typically when the
RNN
is embedded in another network).The
SIZE
(0
1
) of this lump is automatically set to the size of the lump returned by(FUNCALL SEQELTFN 0)
.
[reader] SEQELTFN >SEQBARRIER (:SEQELTFN)
A function of an [INDEX] argument that returns the lump with that index in some sequence.
[accessor] SEQINDICES >SEQBARRIER
A sequence of length batch size of indices. The element at index
I
is the index to be passed toSEQELTFN
to find the lump whose stripeI
is copied to stripeI
of this this lump.
11.5 Utilities
[function] RENORMALIZEACTIVATIONS >V*MLUMPS L2UPPERBOUND
If the l2 norm of the incoming weight vector of the a unit is larger than
L2UPPERBOUND
then renormalize it toL2UPPERBOUND
. The list of>V*MLUMPS
is assumed to be eventually fed to the same lump.To use it, group the activation clumps into the same GDOPTIMIZER and hang this function on
AFTERUPDATEHOOK
, that latter of which is done for youARRANGEFORRENORMALIZINGACTIVATIONS
.See "Improving neural networks by preventing coadaptation of feature detectors (Hinton, 2012)", http://arxiv.org/pdf/1207.0580.pdf.
[function] ARRANGEFORRENORMALIZINGACTIVATIONS BPN OPTIMIZER L2UPPERBOUND
By pushing a lambda to
AFTERUPDATEHOOK
ofOPTIMIZER
arrange for all weights beings trained byOPTIMIZER
to be renormalized (as inRENORMALIZEACTIVATIONS
withL2UPPERBOUND
).It is assumed that if the weights either belong to an activation lump or are simply added to the activations (i.e. they are biases).
12 Boltzmann Machines
13 Gaussian Processes
14 Natural Language Processing
[in package MGLNLP]
This in nothing more then a couple of utilities for now which may grow into a more serious toolset for NLP eventually.
[function] MAKENGRAMMAPPEE FUNCTION N
Make a function of a single argument that's suitable as the function argument to a mapper function. It calls
FUNCTION
with everyN
element.(map nil (makengrammappee #'print 3) '(a b c d e)) .. .. (A B C) .. (B C D) .. (C D E)
[function] BLEU CANDIDATES REFERENCES &KEY CANDIDATEKEY REFERENCEKEY (N 4)
Compute the BLEU score for bilingual CORPUS.
BLEU
measures how good a translation is compared to human reference translations.CANDIDATES
(keyed byCANDIDATEKEY
) andREFERENCES
(keyed byREFERENCEKEY
) are sequences of sentences. A sentence is a sequence of words. Words are compared withEQUAL
, and may be any kind of object (not necessarily strings).Currently there is no support for multiple reference translations.
N
determines the largest ngrams to consider.The first return value is the
BLEU
score (between 0 and 1, not as a percentage). The second value is the sum of the lengths ofCANDIDATES
divided by the sum of the lengths ofREFERENCES
(orNIL
, if the denominator is 0). The third is a list of ngram precisions (also between 0 and 1 orNIL
), one for each element in [1..N
][].This is basically a reimplementation of multibleu.perl.
(bleu '((1 2 3 4) (a b)) '((1 2 3 4) (1 2))) => 0.8408964 => 1 => (;; 1gram precision: 4/6 2/3 ;; 2gram precision: 3/4 3/4 ;; 3gram precision: 2/2 1 ;; 4gram precision: 1/1 1)
14.1 Bag of Words

ENCODE
all features of a document with a sparse vector. Get the features of document fromMAPPER
, encode each feature withFEATUREENCODER
.FEATUREENCODER
may returnNIL
if the feature is not used. The result is a vector of encodedfeature/value conses. encodedfeatures are unique (underENCODEDFEATURETEST
) within the vector but are in no particular order.Depending on
KIND
, value is calculated in various ways:For
:FREQUENCY
it is the number of times the corresponding feature was found inDOCUMENT
.For
:BINARY
it is always 1.For
:NORMALIZEDFREQUENCY
and:NORMALIZEDBINARY
are like the unnormalized counterparts except that as the final step values in the assembled sparse vector are normalized to sum to 1.Finally,
:COMPACTEDBINARY
is like:BINARY
but the return values is not a vector of conses, but a vector of elementtypeENCODEDFEATURETYPE
.
(let* ((featureindexer (makeindexer (alexandria:alisthashtable '(("I" . 3) ("me" . 2) ("mine" . 1))) 2)) (bagofwordsencoder (makeinstance 'bagofwordsencoder :featureencoder featureindexer :featuremapper (lambda (fn document) (map nil fn document)) :kind :frequency))) (encode bagofwordsencoder '("All" "through" "day" "I" "me" "mine" "I" "me" "mine" "I" "me" "mine"))) => #((0 . 3.0d0) (1 . 3.0d0))
 [reader] FEATUREENCODER BAGOFWORDSENCODER (:FEATUREENCODER)
 [reader] FEATUREMAPPER BAGOFWORDSENCODER (:FEATUREMAPPER)
 [reader] ENCODEDFEATURETEST BAGOFWORDSENCODER (:ENCODEDFEATURETEST = #'
EQL
)
 [reader] ENCODEDFEATURETYPE BAGOFWORDSENCODER (:ENCODEDFEATURETYPE =
T
)
 [reader] BAGOFWORDSKIND BAGOFWORDSENCODER (:KIND =
:BINARY
)