# MAT Manual

## 1 mgl-mat ASDF System Details

• Version: 0.1.0
• Description: MAT is library for working with multi-dimensional arrays which supports efficient interfacing to foreign and CUDA code. BLAS and CUBLAS bindings are available.
• Licence: MIT, see COPYING.
• Author: Gábor Melis
• Mailto: mega@retes.hu
• Homepage: http://quotenil.com

## 2 Introduction

### 2.1 What's MGL-MAT?

MGL-MAT is library for working with multi-dimensional arrays which supports efficient interfacing to foreign and CUDA code with automatic translations between cuda, foreign and lisp storage. BLAS and CUBLAS bindings are available.

### 2.2 What kind of matrices are supported?

Currently only row-major single and double float matrices are supported, but it would be easy to add single and double precision complex types too. Other numeric types, such as byte and native integer, can be added too, but they are not supported by CUBLAS. There are no restrictions on the number of dimensions, and reshaping is possible. All functions operate on the visible portion of the matrix (which is subject to displacement and shaping), invisible elements are not affected.

### 2.3 Where to Get it?

All dependencies are in quicklisp except for CL-CUDA that needs to be fetched from github. Just clone CL-CUDA and MGL-MAT into quicklisp/local-projects/ and you are set. MGL-MAT itself lives at github, too.

Prettier-than-markdown HTML documentation cross-linked with other libraries is available as part of PAX World.

## 3 Tutorial

We are going to see how to create matrices, access their contents.

Creating matrices is just like creating lisp arrays:

(make-mat '6)
==> #<MAT 6 A #(0.0d0 0.0d0 0.0d0 0.0d0 0.0d0 0.0d0)>

(make-mat '(2 3) :ctype :float :initial-contents '((1 2 3) (4 5 6)))
==> #<MAT 2x3 AB #2A((1.0 2.0 3.0) (4.0 5.0 6.0))>

(make-mat '(2 3 4) :initial-element 1)
==> #<MAT 2x3x4 A #3A(((1.0d0 1.0d0 1.0d0 1.0d0)
-->                    (1.0d0 1.0d0 1.0d0 1.0d0)
-->                    (1.0d0 1.0d0 1.0d0 1.0d0))
-->                   ((1.0d0 1.0d0 1.0d0 1.0d0)
-->                    (1.0d0 1.0d0 1.0d0 1.0d0)
-->                    (1.0d0 1.0d0 1.0d0 1.0d0)))>

The most prominent difference from lisp arrays is that MATs are always numeric and their element type (called CTYPE here) defaults to :DOUBLE.

Individual elements can be accessed or set with MREF:

(let ((m (make-mat '(2 3))))
(setf (mref m 0 0) 1)
(setf (mref m 0 1) (* 2 (mref m 0 0)))
(incf (mref m 0 2) 4)
m)
==> #<MAT 2x3 AB #2A((1.0d0 2.0d0 4.0d0) (0.0d0 0.0d0 0.0d0))>

Compared to AREF MREF is a very expensive operation and it's best used sparingly. Instead, typical code relies much more on matrix level operations:

(princ (scal! 2 (fill! 3 (make-mat 4))))
.. #<MAT 4 BF #(6.0d0 6.0d0 6.0d0 6.0d0)>
==> #<MAT 4 ABF #(6.0d0 6.0d0 6.0d0 6.0d0)>

The content of a matrix can be accessed in various representations called facets. MGL-MAT takes care of synchronizing these facets by copying data around lazily, but doing its best to share storage for facets that allow it.

Notice the ABF in the printed results. It illustrates that behind the scenes FILL! worked on the BACKING-ARRAY facet (hence the B) that's basically a 1d lisp array. SCAL! on the other hand made a foreign call to the BLAS dscal function for which it needed the FOREIGN-ARRAY facet (F). Finally, the A stands for the ARRAY facet that was created when the array was printed. All facets are up-to-date (else some of the characters would be lowercase). This is possible because these three facets actually share storage which is never the case for the CUDA-ARRAY facet. Now if we have a CUDA-capable GPU, CUDA can be enabled with WITH-CUDA*:

(with-cuda* ()
(princ (scal! 2 (fill! 3 (make-mat 4)))))
.. #<MAT 4 C #(6.0d0 6.0d0 6.0d0 6.0d0)>
==> #<MAT 4 A #(6.0d0 6.0d0 6.0d0 6.0d0)>

Note the lonely C showing that only the CUDA-ARRAY facet was used for both FILL! and SCAL!. When WITH-CUDA* exits and destroys the CUDA context, it destroys all CUDA facets, moving their data to the ARRAY facet, so the returned MAT only has that facet.

When there is no high-level operation that does what we want, we may need to add new operations. This is usually best accomplished by accessing one of the facets directly, as in the following example:

(defun logdet (mat)
"Logarithm of the determinant of MAT. Return -1, 1 or 0 (or
equivalent) to correct for the sign, as the second value."
(with-facets ((array (mat 'array :direction :input)))
(lla:logdet array)))


Notice that LOGDET doesn't know about CUDA at all. WITH-FACETS gives it the content of the matrix as a normal multidimensional lisp array, copying the data from the GPU or elsewhere if necessary. This allows new representations (FACETs) to be added easily and it also avoids copying if the facet is already up-to-date. Of course, adding CUDA support to LOGDET could make it more efficient.

Adding support for matrices that, for instance, live on a remote machine is thus possible with a new facet type and existing code would continue to work (albeit possibly slowly). Then one could optimize the bottleneck operations by sending commands over the network instead of copying data.

It is a bad idea to conflate resource management policy and algorithms. MGL-MAT does its best to keep them separate.

## 4 Basics

• A MAT is a data CUBE that is much like a lisp array, it supports DISPLACEMENT, arbitrary DIMENSIONS and INITIAL-ELEMENT with the usual semantics. However, a MAT supports different representations of the same data. See Tutorial for an introduction.

• [reader] MAT-DISPLACEMENT MAT (:DISPLACEMENT = 0)

A value in the [0,MAX-SIZE] interval. This is like the DISPLACED-INDEX-OFFSET of a lisp array, but displacement is relative to the start of the underlying storage vector.

Like ARRAY-DIMENSIONS. It holds a list of dimensions, but it is allowed to pass in scalars too.

• [function] MAT-DIMENSION MAT AXIS-NUMBER

Return the dimension along AXIS-NUMBER. Similar to ARRAY-DIMENSION.

• [reader] MAT-INITIAL-ELEMENT MAT (:INITIAL-ELEMENT = 0)

If non-nil, then when a facet is created, it is filled with INITIAL-ELEMENT coerced to the appropriate numeric type. If NIL, then no initialization is performed.

• The number of elements in the visible portion of the array. This is always the product of the elements MAT-DIMENSIONS and is similar to ARRAY-TOTAL-SIZE.

The number of elements for which storage may be allocated. This is DISPLACEMENT + MAT-SIZE + SLACK where SLACK is the number of trailing invisible elements.

• [function] REPLACE! MAT SEQ-OF-SEQS

Replace the contents of MAT with the elements of SEQ-OF-SEQS. SEQ-OF-SEQS is a nested sequence of sequences similar to the INITIAL-CONTENTS argument of MAKE-ARRAY. The total number of elements must match the size of MAT. Returns MAT.

SEQ-OF-SEQS may contain multi-dimensional arrays as leafs, so the following is legal:

(replace! (make-mat '(1 2 3)) '(#2A((1 2 3) (4 5 6))))
==> #<MAT 1x2x3 AB #3A(((1.0d0 2.0d0 3.0d0) (4.0d0 5.0d0 6.0d0)))>

## 5 Element types

• This is basically (MEMBER :FLOAT :DOUBLE).

• By default MATs are created with this ctype. One of :FLOAT or :DOUBLE.

## 6 Printing

• Controls whether the contents of a MAT object are printed as an array (subject to the standard printer control variables).

## 7 Shaping

We are going to discuss various ways to change the visible portion and dimensions of matrices. Conceptually a matrix has an underlying non-displaced storage vector. For (MAKE-MAT 10 :DISPLACEMENT 7 :MAX-SIZE 21) this underlying vector looks like this:

displacement | visible elements  | slack
. . . . . . . 0 0 0 0 0 0 0 0 0 0 . . . .


Whenever a matrix is reshaped (or displaced to in lisp terminology), its displacement and dimensions change but the underlying vector does not.

The rules for accessing displaced matrices is the same as always: multiple readers can run in parallel, but attempts to write will result in an error if there are either readers or writers on any of the matrices that share the same underlying vector.

### 7.1 Comparison to Lisp Arrays

One way to reshape and displace MAT objects is with MAKE-MAT and its DISPLACED-TO argument whose semantics are similar to that of MAKE-ARRAY in that the displacement is relative to the displacement of DISPLACED-TO.

(let* ((base (make-mat 10 :initial-element 5 :displacement 1))
(mat (make-mat 6 :displaced-to base :displacement 2)))
(fill! 1 mat)
(values base mat))
==> #<MAT 1+10+0 A #(5.0d0 5.0d0 1.0d0 1.0d0 1.0d0 1.0d0 1.0d0 1.0d0 5.0d0
-->                  5.0d0)>
==> #<MAT 3+6+2 AB #(1.0d0 1.0d0 1.0d0 1.0d0 1.0d0 1.0d0)>

There are important semantic differences compared to lisp arrays all which follow from the fact that displacement operates on the underlying conceptual non-displaced vector.

• Matrices can be displaced and have slack even without DISPLACED-TO just like BASE in the above example.

• It's legal to alias invisible elements of DISPLACED-TO as long as the new matrix fits into the underlying storage.

• Negative displacements are allowed with DISPLACED-TO as long as the adjusted displacement is non-negative.

• Further shaping operations can make invisible portions of the DISPLACED-TO matrix visible by changing the displacement.

• In contrast to ARRAY-DISPLACEMENT, MAT-DISPLACEMENT only returns an offset into the underlying storage vector.

### 7.2 Functional Shaping

The following functions are collectively called the functional shaping operations, since they don't alter their arguments in any way. Still, since storage is aliased modification to the returned matrix will affect the original.

• [function] RESHAPE-AND-DISPLACE MAT DIMENSIONS DISPLACEMENT

Return a new matrix of DIMENSIONS that aliases MAT's storage at offset DISPLACEMENT. DISPLACEMENT 0 is equivalent to the start of the storage of MAT regardless of MAT's displacement.

• [function] RESHAPE MAT DIMENSIONS

Return a new matrix of DIMENSIONS whose displacement is the same as the displacement of MAT.

• [function] DISPLACE MAT DISPLACEMENT

Return a new matrix that aliases MAT's storage at offset DISPLACEMENT. DISPLACEMENT 0 is equivalent to the start of the storage of MAT regardless of MAT's displacement. The returned matrix has the same dimensions as MAT.

### 7.3 Destructive Shaping

The following destructive operations don't alter the contents of the matrix, but change what is visible. ADJUST! is the odd one out, it may create a new MAT.

• [function] RESHAPE-AND-DISPLACE! MAT DIMENSIONS DISPLACEMENT

Change the visible (or active) portion of MAT by altering its displacement offset and dimensions. Future operations will only affect this visible portion as if the rest of the elements were not there. Return MAT.

DISPLACEMENT + the new size must not exceed MAT-MAX-SIZE. Furthermore, there must be no facets being viewed (with WITH-FACETS) when calling this function as the identity of the facets is not stable.

• Reshape the 2d MAT to make only a single ROW visible. This is made possible by the row-major layout, hence no column counterpart. Return MAT.

• [macro] WITH-SHAPE-AND-DISPLACEMENT (MAT &OPTIONAL (DIMENSIONS NIL DIMENSIONSP) (DISPLACEMENT NIL DISPLACEMENTP)) &BODY BODY

Reshape and displace MAT if DIMENSIONS and/or DISPLACEMENT is given and restore the original shape and displacement after BODY is executed. If neither is specificed, then nothing will be changed, but BODY is still allowed to alter the shape and displacement.

## 8 Assembling

The functions here assemble a single MAT from a number of MATs.

• [function] STACK! AXIS MATS MAT

Stack MATS along AXIS into MAT and return MAT. If AXIS is 0, place MATS into MAT below each other starting from the top. If AXIS is 1, place MATS side by side starting from the left. Higher AXIS are also supported. All dimensions except for AXIS must be the same for all MATS.

## 9 Caching

Allocating and initializing a MAT object and its necessary facets can be expensive. The following macros remember the previous value of a binding in the same thread and /place/. Only weak references are constructed so the cached objects can be garbage collected.

While the cache is global, thread safety is guaranteed by having separate subcaches per thread. Each subcache is keyed by a /place/ object that's either explicitly specified or else is unique to each invocation of the caching macro, so different occurrences of caching macros in the source never share data. Still, recursion could lead to data sharing between different invocations of the same function. To prevent this, the cached object is removed from the cache while it is used so other invocations will create a fresh one which isn't particularly efficient but at least it's safe.

• [macro] WITH-THREAD-CACHED-MAT (VAR DIMENSIONS &REST ARGS &KEY (PLACE :SCRATCH) (CTYPE '*DEFAULT-MAT-CTYPE*) (DISPLACEMENT 0) MAX-SIZE (INITIAL-ELEMENT 0) INITIAL-CONTENTS) &BODY BODY

Bind VAR to a matrix of DIMENSIONS, CTYPE, etc. Cache this matrix, and possibly reuse it later by reshaping it. When BODY exits the cached object is updated with the binding of VAR which BODY may change.

There is a separate cache for each thread and each PLACE (under EQ). Since every cache holds exactly one MAT per CTYPE, nested WITH-THREAD-CACHED-MAT often want to use different PLACEs. By convention, these places are called :SCRATCH-1, :SCRATCH-2, etc.

• [macro] WITH-THREAD-CACHED-MATS SPECS &BODY BODY

A shorthand for writing nested WITH-THREAD-CACHED-MAT calls.

(with-thread-cached-mat (a ...)
...))

is equivalent to:

(with-thread-cached-mat ((a ...)
(b ...))
...)

• [macro] WITH-ONES (VAR DIMENSIONS &KEY (CTYPE '*DEFAULT-MAT-CTYPE*) (PLACE :ONES)) &BODY BODY

Bind VAR to a matrix of DIMENSIONS whose every element is 1. The matrix is cached for efficiency.

## 10 BLAS Operations

Only some BLAS functions are implemented, but it should be easy to add more as needed. All of them default to using CUDA, if it is initialized and enabled (see USE-CUDA-P).

Level 1 BLAS operations

• [function] AXPY! ALPHA X Y &KEY (N (MAT-SIZE X)) (INCX 1) (INCY 1)

Set Y to ALPHA * X + Y. Return Y.

Level 3 BLAS operations

• [function] GEMM! ALPHA A B BETA C &KEY TRANSPOSE-A? TRANSPOSE-B? M N K LDA LDB LDC

Basically C = ALPHA * A' * B' + BETA * C. A' is A or its transpose depending on TRANSPOSE-A?. B' is B or its transpose depending on TRANSPOSE-B?. Returns C.

A' is an MxK matrix. B' is a KxN matrix. C is an MxN matrix.

LDA is the width of the matrix A (not of A'). If A is not transposed, then K <= LDA, if it's transposed then M <= LDA.

LDB is the width of the matrix B (not of B'). If B is not transposed, then N <= LDB, if it's transposed then K <= LDB.

In the example below M=3, N=2, K=5, LDA=6, LDB=3, LDC=4. The cells marked with + do not feature in the calculation.

           N
--+
--+
K -B+
--+
--+
+++
K
-----+  --++
M --A--+  -C++
-----+  --++
++++++  ++++


## 11 Destructive API

• [function] .EXPT! X POWER

Raise matrix X to POWER in an elementwise manner. Return X. Note that CUDA and non-CUDA implementations may disagree on the treatment of NaNs, infinities and complex results. In particular, the lisp implementation always computes the REALPART of the results while CUDA's pow() returns NaNs instead of complex numbers.

• [function] .+! ALPHA X

Add the scalar ALPHA to each element of X destructively modifying X. Return X.

• [function] GEERV! ALPHA A X BETA B

GEneric Elementwise Row - Vector multiplication. B = beta * B + alpha * a * diag(x). In other words, perform elementwise multiplication on each row of A with the vector X and add the scaled result to the corresponding row of B. Return B. This is not a standard BLAS routine.

• For each element of X and Y set Y to 1 if the element in Y is greater than the element in X, and to 0 otherwise. Return Y.

• [function] .MIN! ALPHA X

Set each element of X to ALPHA if it's greater than ALPHA. Return X.

• [function] .MAX! ALPHA X

Set each element of X to ALPHA if it's less than ALPHA. Return X.

• [function] ADD-SIGN! ALPHA A BETA B

Add the elementwise sign (-1, 0 or 1 for negative, zero and positive numbers respectively) of A times ALPHA to BETA * B. Return B.

• [function] SUM! X Y &KEY AXIS (ALPHA 1) (BETA 0)

Sum matrix X along AXIS and add ALPHA * SUMS to BETA * Y destructively modifying Y. Return Y. On a 2d matrix (nothing else is supported currently), if AXIS is 0, then columns are summed, if AXIS is 1 then rows are summed.

• [function] SCALE-ROWS! SCALES A &KEY (RESULT A)

Set RESULT to DIAG(SCALES)*A and return it. A is an MxN matrix, SCALES is treated as a length M vector.

• [function] SCALE-COLUMNS! SCALES A &KEY (RESULT A)

Set RESULT to A*DIAG(SCALES) and return it. A is an MxN matrix, SCALES is treated as a length N vector.

Finally, some neural network operations.

• [function] CONVOLVE! X W Y &KEY START STRIDE ANCHOR BATCHED

Y = Y + conv(X, W) and return Y. If BATCHED, then the first dimension of X and Y is the number of elements in the batch (B), else B is assumed to be 1. The rest of the dimensions encode the input (X) and output (Y} N dimensional feature maps. START, STRIDE and ANCHOR are lists of length N. START is the multi-dimensional index of the first element of the input feature map (for each element in the batch) for which the convolution must be computed. Then (ELT STRIDE (- N 1)) is added to the last element of START and so on until (ARRAY-DIMENSION X 1) is reached. Then the last element of START is reset, (ELT STRIDE (- N 2)) is added to the first but last element of START and we scan the last dimension again. Take a 2d example, START is (0 0), STRIDE is (1 2), and X is a B*2x7 matrix.

W is:

1 2 1
2 4 2
1 2 1


and ANCHOR is (1 1) which refers to the element of W whose value is 4. This anchor point of W is placed over elements of X whose multi dimensional index is in numbers in this figure (only one element in the batch is shown):

0,0 . 0,2 . 0,4 . 0,6
1,0 . 1,2 . 1,4 . 1,6


When applying W at position P of X, the convolution is the sum of the products of overlapping elements of X and W when W's ANCHOR is placed at P. Elements of W over the edges of X are multiplied with 0 so are effectively ignored. The order of application of W to positions defined by START, STRIDE and ANCHOR is undefined.

Y must be a B*2x4 (or 2x4 if not BATCHED) matrix in this example, just large enough to hold the results of the convolutions.

• [function] DERIVE-CONVOLVE! X XD W WD YD &KEY START STRIDE ANCHOR BATCHED

Add the dF/dX to XD and and dF/dW to WD where YD is dF/dY for some function F where Y is the result of convolution with the same arguments.

• [function] DERIVE-MAX-POOL! X XD Y YD &KEY START STRIDE ANCHOR BATCHED POOL-DIMENSIONS

Add the dF/dX to XD and and dF/dW to WD where YD is dF/dY for some function F where Y is the result of MAX-POOL! with the same arguments.

## 12 Non-destructive API

• Return a copy of the active portion with regards to displacement and shape of A.

• [function] COPY-ROW A ROW

Return ROW of A as a new 1d matrix.

• Return the first element of A. A must be of size 1.

• [function] SCALAR-AS-MAT X &KEY (CTYPE (LISP->CTYPE (TYPE-OF X)))

Return a matrix of one dimension and one element: X. CTYPE, the type of the matrix, defaults to the ctype corresponding to the type of X.

• [function] M= A B

Check whether A and B, which must be matrices of the same size, are elementwise equal.

• Return the transpose of A.

• [function] M* A B &KEY TRANSPOSE-A? TRANSPOSE-B?

Compute op(A) * op(B). Where op is either the identity or the transpose operation depending on TRANSPOSE-A? and TRANSPOSE-B?.

• [function] MM* M &REST ARGS

Convenience function to multiply several matrices.

(mm* a b c) => a * b * c

• [function] M- A B

Return A - B.

• [function] M+ A B

Return A + B.

• Return the inverse of A.

• Logarithm of the determinant of MAT. Return -1, 1 or 0 (or equivalent) to correct for the sign, as the second value.

## 13 Mappings

• [function] MAP-CONCAT FN MATS MAT &KEY KEY PASS-RAW-P

Call FN with each element of MATS and MAT temporarily reshaped to the dimensions of the current element of MATS and return MAT. For the next element the displacement is increased so that there is no overlap.

MATS is keyed by KEY just like the CL sequence functions. Normally, FN is called with the matrix returned by KEY. However, if PASS-RAW-P, then the matrix returned by KEY is only used to calculate dimensions and the element of MATS that was passed to KEY is passed to FN, too.

(map-concat #'copy! (list (make-mat 2) (make-mat 4 :initial-element 1))
(make-mat '(2 3)))
==> #<MAT 2x3 AB #2A((0.0d0 0.0d0 1.0d0) (1.0d0 1.0d0 1.0d0))>

• [function] MAP-DISPLACEMENTS FN MAT DIMENSIONS &KEY (DISPLACEMENT-START 0) DISPLACEMENT-STEP

Call FN with MAT reshaped to DIMENSIONS, first displaced by DISPLACEMENT-START that's incremented by DISPLACEMENT-STEP each iteration while there are enough elements left for DIMENSIONS at the current displacement. Returns MAT.

(let ((mat (make-mat 14 :initial-contents '(-1 0 1 2 3
4 5 6 7
8 9 10 11 12))))
(reshape-and-displace! mat '(4 3) 1)
(map-displacements #'print mat 4))
..
.. #<MAT 1+4+9 B #(0.0d0 1.0d0 2.0d0 3.0d0)>
.. #<MAT 5+4+5 B #(4.0d0 5.0d0 6.0d0 7.0d0)>
.. #<MAT 9+4+1 B #(8.0d0 9.0d0 10.0d0 11.0d0)> 

• [function] MAP-MATS-INTO RESULT-MAT FN &REST MATS

Like CL:MAP-INTO but for MAT objects. Destructively modifies RESULT-MAT to contain the results of applying FN to each element in the argument MATS in turn.

## 14 Random numbers

Unless noted these work efficiently with CUDA.

• Return a copy of STATE be it a lisp or cuda random state.

• [function] UNIFORM-RANDOM! MAT &KEY (LIMIT 1)

Fill MAT with random numbers sampled uniformly from the [0,LIMIT) interval of MAT's type.

• [function] GAUSSIAN-RANDOM! MAT &KEY (MEAN 0) (STDDEV 1)

Fill MAT with independent normally distributed random numbers with MEAN and STDDEV.

• [function] MV-GAUSSIAN-RANDOM &KEY MEANS COVARIANCES

Return a column vector of samples from the multivariate normal distribution defined by MEANS (Nx1) and COVARIANCES (NxN). No CUDA implementation.

• [function] ORTHOGONAL-RANDOM! M &KEY (SCALE 1)

Fill the matrix M with random values in such a way that M^T * M is the identity matrix (or something close if M is wide). Return M.

## 16 Debugging

The largest class of bugs has to do with synchronization of facets being broken. This is almost always caused by an operation that mispecifies the DIRECTION argument of WITH-FACET. For example, the matrix argument of SCAL! should be accessed with direciton :IO. But if it's :INPUT instead, then subsequent access to the ARRAY facet will not see the changes made by AXPY!, and if it's :OUTPUT, then any changes made to the ARRAY facet since the last update of the CUDA-ARRAY facet will not be copied and from the wrong input SCAL! will compute the wrong result.

Using the SLIME inspector or trying to access the CUDA-ARRAY facet from threads other than the one in which the corresponding CUDA context was initialized will fail. For now, the easy way out is to debug the code with CUDA disabled (see *CUDA-ENABLED*).

Another thing that tends to come up is figuring out where memory is used.

• [function] MAT-ROOM &KEY (STREAM *STANDARD-OUTPUT*) (VERBOSE T)

• [macro] WITH-MAT-COUNTERS (&KEY COUNT N-BYTES) &BODY BODY

Count all MAT allocations and also the number of bytes they may require. May require here really means an upper bound, because (MAKE-MAT (EXPT 2 60)) doesn't actually uses memory until one of its facets is accessed (don't simply evaluate it though, printing the result will access the ARRAY facet if *PRINT-MAT*). Also, while facets today all require the same number of bytes, this may change in the future. This is a debugging tool, don't use it in production.

(with-mat-counters (:count count :n-bytes n-bytes)
(assert (= count 0))
(assert (= n-bytes 0))
(make-mat '(2 3) :ctype :double)
(assert (= count 1))
(assert (= n-bytes (* 2 3 8)))
(with-mat-counters (:n-bytes n-bytes-1 :count count-1)
(make-mat '7 :ctype :float)
(assert (= count-1 1))
(assert (= n-bytes-1 (* 7 4))))
(assert (= n-bytes (+ (* 2 3 8) (* 7 4))))
(assert (= count 2)))


## 17 Facet API

### 17.1 Facets

A MAT is a CUBE (see Cube Manual) whose facets are different representations of numeric arrays. These facets can be accessed with WITH-FACETS with one of the following FACET-NAME locatives:

• The corresponding facet's value is a one dimensional lisp array or a static vector that also looks exactly like a lisp array but is allocated in foreign memory. See *FOREIGN-ARRAY-STRATEGY*.

• The facet's value is a CUDA-ARRAY which is an OFFSET-POINTER wrapping a CL-CUDA.DRIVER-API:CU-DEVICE-PTR, allocated with CL-CUDA.DRIVER-API:CU-MEM-ALLOC and freed automatically.

Facets bound by with WITH-FACETS are to be treated as dynamic extent: it is not allowed to keep a reference to them beyond the dynamic scope of WITH-FACETS.

For example, to implement the FILL! operation using only the BACKING-ARRAY, one could do this:

(let ((displacement (mat-displacement x))
(size (mat-size x)))
(with-facets ((x* (x 'backing-array :direction :output)))
(fill x* 1 :start displacement :end (+ displacement size))))

DIRECTION is :OUTPUT because we clobber all values in X. Armed with this knowledge about the direction, WITH-FACETS will not copy data from another facet if the backing array is not up-to-date.

To transpose a 2d matrix with the ARRAY facet:

(destructuring-bind (n-rows n-columns) (mat-dimensions x)
(with-facets ((x* (x 'array :direction :io)))
(dotimes (row n-rows)
(dotimes (column n-columns)
(setf (aref x* row column) (aref x* column row))))))

Note that DIRECTION is :IO, because we need the data in this facet to be up-to-date (that's the input part) and we are invalidating all other facets by changing values (that's the output part).

To sum the values of a matrix using the FOREIGN-ARRAY facet:

(let ((sum 0))
(with-facets ((x* (x 'foreign-array :direction :input)))
(let ((pointer (offset-pointer x*)))
(loop for index below (mat-size x)
do (incf sum (cffi:mem-aref pointer (mat-ctype x) index)))))
sum)

See DIRECTION for a complete description of :INPUT, :OUTPUT and :IO. For MAT objects, that needs to be refined. If a MAT is reshaped and/or displaced in a way that not all elements are visible then those elements are always kept intact and copied around. This is accomplished by turning :OUTPUT into :IO automatically on such MATs.

We have finished our introduction to the various facets. It must be said though that one can do anything without ever accessing a facet directly or even being aware of them as most operations on MATs take care of choosing the most appropriate facet behind the scenes. In particular, most operations automatically use CUDA, if available and initialized. See WITH-CUDA* for detail.

### 17.2 Foreign arrays

One facet of MAT objects is FOREIGN-ARRAY which is backed by a memory area that can be a pinned lisp array or is allocated in foreign memory depending on *FOREIGN-ARRAY-STRATEGY*.

• [variable] *FOREIGN-ARRAY-STRATEGY* "-see below-"

One of :PINNED, :STATIC and :CUDA-HOST (see type FOREIGN-ARRAY-STRATEGY). This variable controls how foreign arrays are handled and it can be changed at any time.

If it's :PINNED (only supported if (PINNING-SUPPORTED-P), then no separate storage is allocated for the foreign array. Instead, it aliases the lisp array (via the BACKING-ARRAY facet).

If it's :STATIC, then the lisp backing arrays are allocated statically via the static-vectors library. On some implementations, explicit freeing of static vectors is necessary, this is taken care of by finalizers or can be controlled with WITH-FACET-BARRIER. DESTROY-CUBE and DESTROY-FACET may also be of help.

:CUDA-HOST is the same as :STATIC, but any copies to/from the GPU (i.e. the CUDA-ARRAY facet) will be done via the CUDA-HOST-ARRAY facet whose memory pages will also be locked and registered with cuMemHostRegister which allows quicker and asynchronous copying to and from CUDA land.

The default is :PINNED if available, because it's the most efficient. If pinning is not available, then it's :STATIC.

• Return true iff the lisp implementation efficiently supports pinning lisp arrays. Pinning ensures that the garbage collector doesn't move the array in memory. Currently this is only supported on SBCL gencgc platforms.

• [function] FOREIGN-ROOM &KEY (STREAM *STANDARD-OUTPUT*) (VERBOSE T)

Print a summary of foreign memory usage to STREAM. If VERBOSE, make the output human easily readable, else try to present it in a very concise way. Sample output with VERBOSE:

Foreign memory usage:
foreign arrays: 450 (used bytes: 3,386,295,808)

The same data presented with VERBOSE false:

f: 450 (3,386,295,808)

### 17.3 CUDA

• [function] CUDA-AVAILABLE-P &KEY (DEVICE-ID 0)

Check that a cuda context is already in initialized in the current thread or a device with DEVICE-ID is available.

• Set or bind this to false to disable all use of cuda. If this is done from within WITH-CUDA, then cuda becomes temporarily disabled. If this is done from outside WITH-CUDA, then it changes the default values of the ENABLED argument of any future WITH-CUDA*s which turns off cuda initialization entirely.

• Incremented each time a host to device copy is performed. Bound to 0 by WITH-CUDA*. Useful for tracking down performance problems.

• Incremented each time a device to host copy is performed. Bound to 0 by WITH-CUDA*. Useful for tracking down performance problems.

#### 17.3.1 CUDA Memory Management

The GPU (called device in CUDA terminology) has its own memory and it can only perform computation on data in this device memory so there is some copying involved to and from main memory. Efficient algorithms often allocate device memory up front and minimize the amount of copying that has to be done by computing as much as possible on the GPU.

MGL-MAT reduces the cost of device of memory allocations by maintaining a cache of currently unused allocations from which it first tries to satisfy allocation requests. The total size of all the allocated device memory regions (be they in use or currently unused but cached) is never more than N-POOL-BYTES as specified in WITH-CUDA*. N-POOL-BYTES being NIL means no limit.

• [condition] CUDA-OUT-OF-MEMORY STORAGE-CONDITION

If an allocation request cannot be satisfied (either because of N-POOL-BYTES or physical device memory limits being reached), then CUDA-OUT-OF-MEMORY is signalled.

• [function] CUDA-ROOM &KEY (STREAM *STANDARD-OUTPUT*) (VERBOSE T)

When CUDA is in use (see USE-CUDA-P), print a summary of memory usage in the current CUDA context to STREAM. If VERBOSE, make the output human easily readable, else try to present it in a very concise way. Sample output with VERBOSE:

CUDA memory usage:
device arrays: 450 (used bytes: 3,386,295,808, pooled bytes: 1,816,657,920)
host arrays: 14640 (used bytes: 17,380,147,200)
host->device copies: 154,102,488, device->host copies: 117,136,434

The same data presented with VERBOSE false:

d: 450 (3,386,295,808 + 1,816,657,920), h: 14640 (17,380,147,200)
h->d: 154,102,488, d->h: 117,136,434

That's it about reducing the cost allocations. The other important performance consideration, minimizing the amount copying done, is very hard to do if the data doesn't fit in device memory which is often a very limited resource. In this case the next best thing is to do the copying concurrently with computation.

Also note that often the easiest thing to do is to prevent the use of CUDA (and consequently the creation of CUDA-ARRAY facets, and allocations). This can be done either by binding *CUDA-ENABLED* to NIL or by setting CUDA-ENABLED to NIL on specific matrices.

## 18 Writing Extensions

New operations are usually implemented in lisp, CUDA, or by calling a foreign function in, for instance, BLAS, CUBLAS, CURAND.

### 18.1 Lisp Extensions

• [macro] DEFINE-LISP-KERNEL (NAME &KEY (CTYPES '(:FLOAT :DOUBLE))) (&REST PARAMS) &BODY BODY

This is very much like DEFINE-CUDA-KERNEL but for normal lisp code. It knows how to deal with MAT objects and can define the same function for multiple CTYPES. Example:

(define-lisp-kernel (lisp-.+!)
((alpha single-float) (x :mat :input) (start-x index) (n index))
(loop for xi of-type index upfrom start-x
below (the! index (+ start-x n))
do (incf (aref x xi) alpha)))

Parameters are either of the form (<NAME> <LISP-TYPE) or (<NAME> :MAT <DIRECTION>). In the latter case, the appropriate CFFI pointer is passed to the kernel. <DIRECTION> is passed on to the WITH-FACET that's used to acquire the foreign array. Note that the return type is not declared.

Both the signature and the body are written as if for single floats, but one function is defined for each ctype in CTYPES by transforming types, constants and code by substituting them with their ctype equivalents. Currently this only means that one needs to write only one kernel for SINGLE-FLOAT and DOUBLE-FLOAT. All such functions get the declaration from *DEFAULT-LISP-KERNEL-DECLARATIONS*.

Finally, a dispatcher function with NAME is defined which determines the ctype of the MAT objects passed for :MAT typed parameters. It's an error if they are not of the same type. Scalars declared SINGLE-FLOAT are coerced to that type and the appropriate kernel is called.

### 18.2 CUDA Extensions

• [function] CHOOSE-1D-BLOCK-AND-GRID N MAX-N-WARPS-PER-BLOCK

Return two values, one suitable as the :BLOCK-DIM, the other as the :GRID-DIM argument for a cuda kernel call where both are one-dimensional (only the first element may be different from 1).

The number of threads in a block is a multiple of *CUDA-WARP-SIZE*. The number of blocks is between 1 and and *CUDA-MAX-N-BLOCKS*. This means that the kernel must be able handle any number of elements in each thread. For example, a strided kernel that adds a constant to each element of a length N vector looks like this:

(let ((stride (* block-dim-x grid-dim-x)))
(do ((i (+ (* block-dim-x block-idx-x) thread-idx-x)
(+ i stride)))
((>= i n))
(set (aref x i) (+ (aref x i) alpha))))

It is often the most efficient to have MAX-N-WARPS-PER-BLOCK around 4. Note that the maximum number of threads per block is limited by hardware (512 for compute capability < 2.0, 1024 for later versions), so *CUDA-MAX-N-BLOCKS* times MAX-N-WARPS-PER-BLOCK must not exceed that limit.

• [function] CHOOSE-2D-BLOCK-AND-GRID DIMENSIONS MAX-N-WARPS-PER-BLOCK

Return two values, one suitable as the :BLOCK-DIM, the other as the :GRID-DIM argument for a cuda kernel call where both are two-dimensional (only the first two elements may be different from 1).

The number of threads in a block is a multiple of *CUDA-WARP-SIZE*. The number of blocks is between 1 and and *CUDA-MAX-N-BLOCKS*. Currently - but this may change - the BLOCK-DIM-X is always *CUDA-WARP-SIZE* and GRID-DIM-X is always 1.

This means that the kernel must be able handle any number of elements in each thread. For example, a strided kernel that adds a constant to each element of a HEIGHT*WIDTH matrix looks like this:

(let ((id-x (+ (* block-dim-x block-idx-x) thread-idx-x))
(id-y (+ (* block-dim-y block-idx-y) thread-idx-y))
(stride-x (* block-dim-x grid-dim-x))
(stride-y (* block-dim-y grid-dim-y)))
(do ((row id-y (+ row stride-y)))
((>= row height))
(let ((i (* row width)))
(do ((column id-x (+ column stride-x)))
((>= column width))
(set (aref x i) (+ (aref x i) alpha))
(incf i stride-x)))))

• [function] CHOOSE-3D-BLOCK-AND-GRID DIMENSIONS MAX-N-WARPS-PER-BLOCK

Return two values, one suitable as the :BLOCK-DIM, the other as the :GRID-DIM argument for a cuda kernel call where both are two-dimensional (only the first two elements may be different from 1).

The number of threads in a block is a multiple of *CUDA-WARP-SIZE*. The number of blocks is between 1 and and *CUDA-MAX-N-BLOCKS*. Currently - but this may change - the BLOCK-DIM-X is always *CUDA-WARP-SIZE* and GRID-DIM-X is always 1.

This means that the kernel must be able handle any number of elements in each thread. For example, a strided kernel that adds a constant to each element of a THICKNESS * HEIGHT * WIDTH 3d array looks like this:

(let ((id-x (+ (* block-dim-x block-idx-x) thread-idx-x))
(id-y (+ (* block-dim-y block-idx-y) thread-idx-y))
(id-z (+ (* block-dim-z block-idx-z) thread-idx-z))
(stride-x (* block-dim-x grid-dim-x))
(stride-y (* block-dim-y grid-dim-y))
(stride-z (* block-dim-z grid-dim-z)))
(do ((plane id-z (+ plane stride-z)))
((>= plane thickness))
(do ((row id-y (+ row stride-y)))
((>= row height))
(let ((i (* (+ (* plane height) row)
width)))
(do ((column id-x (+ column stride-x)))
((>= column width))
(set (aref x i) (+ (aref x i) alpha))
(incf i stride-x))))))

• [macro] DEFINE-CUDA-KERNEL (NAME &KEY (CTYPES '(:FLOAT :DOUBLE))) (RETURN-TYPE PARAMS) &BODY BODY

This is an extended CL-CUDA:DEFKERNEL macro. It knows how to deal with MAT objects and can define the same function for multiple CTYPES. Example:

(define-cuda-kernel (cuda-.+!)
(void ((alpha float) (x :mat :input) (n int)))
(let ((stride (* block-dim-x grid-dim-x)))
(do ((i (+ (* block-dim-x block-idx-x) thread-idx-x)
(+ i stride)))
((>= i n))
(set (aref x i) (+ (aref x i) alpha)))))

The signature looks pretty much like in CL-CUDA:DEFKERNEL, but parameters can take the form of (<NAME> :MAT <DIRECTION>) too, in which case the appropriate CL-CUDA.DRIVER-API:CU-DEVICE-PTR is passed to the kernel. <DIRECTION> is passed on to the WITH-FACET that's used to acquire the cuda array.

Both the signature and the body are written as if for single floats, but one function is defined for each ctype in CTYPES by transforming types, constants and code by substituting them with their ctype equivalents. Currently this only means that one needs to write only one kernel for FLOAT and DOUBLE.

Finally, a dispatcher function with NAME is defined which determines the ctype of the MAT objects passed for :MAT typed parameters. It's an error if they are not of the same type. Scalars declared FLOAT are coerced to that type and the appropriate kernel is called.

#### 18.2.1 CUBLAS

In a WITH-CUDA* BLAS Operations will automatically use CUBLAS. No need to use these at all.

#### 18.2.2 CURAND

This the low level CURAND API. You probably want Random numbers instead.