λ

MAT Manual

Table of Contents

[in package MGL-MAT]

λ

1 mgl-mat ASDF System Details

λ

2 Introduction

λ

2.1 What's MGL-MAT?

MGL-MAT is library for working with multi-dimensional arrays which supports efficient interfacing to foreign and CUDA code with automatic translations between cuda, foreign and lisp storage. BLAS and CUBLAS bindings are available.

λ

2.2 What kind of matrices are supported?

Currently only row-major single and double float matrices are supported, but it would be easy to add single and double precision complex types too. Other numeric types, such as byte and native integer, can be added too, but they are not supported by CUBLAS. There are no restrictions on the number of dimensions, and reshaping is possible. All functions operate on the visible portion of the matrix (which is subject to displacement and shaping), invisible elements are not affected.

λ

2.3 Where to Get it?

All dependencies are in quicklisp except for CL-CUDA that needs to be fetched from github. Just clone CL-CUDA and MGL-MAT into quicklisp/local-projects/ and you are set. MGL-MAT itself lives at github, too.

Prettier-than-markdown HTML documentation cross-linked with other libraries is available as part of PAX World.

λ

3 Tutorial

We are going to see how to create matrices, access their contents.

Creating matrices is just like creating lisp arrays:

(make-mat '6)
==> #<MAT 6 A #(0.0d0 0.0d0 0.0d0 0.0d0 0.0d0 0.0d0)>

(make-mat '(2 3) :ctype :float :initial-contents '((1 2 3) (4 5 6)))
==> #<MAT 2x3 AB #2A((1.0 2.0 3.0) (4.0 5.0 6.0))>

(make-mat '(2 3 4) :initial-element 1)
==> #<MAT 2x3x4 A #3A(((1.0d0 1.0d0 1.0d0 1.0d0)
-->                    (1.0d0 1.0d0 1.0d0 1.0d0)
-->                    (1.0d0 1.0d0 1.0d0 1.0d0))
-->                   ((1.0d0 1.0d0 1.0d0 1.0d0)
-->                    (1.0d0 1.0d0 1.0d0 1.0d0)
-->                    (1.0d0 1.0d0 1.0d0 1.0d0)))>

The most prominent difference from lisp arrays is that MATs are always numeric and their element type (called CTYPE here) defaults to :DOUBLE.

Individual elements can be accessed or set with MREF:

(let ((m (make-mat '(2 3))))
  (setf (mref m 0 0) 1)
  (setf (mref m 0 1) (* 2 (mref m 0 0)))
  (incf (mref m 0 2) 4)
  m)
==> #<MAT 2x3 AB #2A((1.0d0 2.0d0 4.0d0) (0.0d0 0.0d0 0.0d0))>

Compared to AREF MREF is a very expensive operation and it's best used sparingly. Instead, typical code relies much more on matrix level operations:

(princ (scal! 2 (fill! 3 (make-mat 4))))
.. #<MAT 4 BF #(6.0d0 6.0d0 6.0d0 6.0d0)>
==> #<MAT 4 ABF #(6.0d0 6.0d0 6.0d0 6.0d0)>

The content of a matrix can be accessed in various representations called facets. MGL-MAT takes care of synchronizing these facets by copying data around lazily, but doing its best to share storage for facets that allow it.

Notice the ABF in the printed results. It illustrates that behind the scenes FILL! worked on the BACKING-ARRAY facet (hence the B) that's basically a 1d lisp array. SCAL! on the other hand made a foreign call to the BLAS dscal function for which it needed the FOREIGN-ARRAY facet (F). Finally, the A stands for the ARRAY facet that was created when the array was printed. All facets are up-to-date (else some of the characters would be lowercase). This is possible because these three facets actually share storage which is never the case for the CUDA-ARRAY facet. Now if we have a CUDA-capable GPU, CUDA can be enabled with WITH-CUDA*:

(with-cuda* ()
  (princ (scal! 2 (fill! 3 (make-mat 4)))))
.. #<MAT 4 C #(6.0d0 6.0d0 6.0d0 6.0d0)>
==> #<MAT 4 A #(6.0d0 6.0d0 6.0d0 6.0d0)>

Note the lonely C showing that only the CUDA-ARRAY facet was used for both FILL! and SCAL!. When WITH-CUDA* exits and destroys the CUDA context, it destroys all CUDA facets, moving their data to the ARRAY facet, so the returned MAT only has that facet.

When there is no high-level operation that does what we want, we may need to add new operations. This is usually best accomplished by accessing one of the facets directly, as in the following example:

(defun logdet (mat)
  "Logarithm of the determinant of MAT. Return -1, 1 or 0 (or
  equivalent) to correct for the sign, as the second value."
  (with-facets ((array (mat 'array :direction :input)))
    (lla:logdet array)))

Notice that LOGDET doesn't know about CUDA at all. WITH-FACETS gives it the content of the matrix as a normal multidimensional lisp array, copying the data from the GPU or elsewhere if necessary. This allows new representations (FACETs) to be added easily and it also avoids copying if the facet is already up-to-date. Of course, adding CUDA support to LOGDET could make it more efficient.

Adding support for matrices that, for instance, live on a remote machine is thus possible with a new facet type and existing code would continue to work (albeit possibly slowly). Then one could optimize the bottleneck operations by sending commands over the network instead of copying data.

It is a bad idea to conflate resource management policy and algorithms. MGL-MAT does its best to keep them separate.

λ

4 Basics

λ

5 Element types

λ

6 Printing

λ

7 Shaping

We are going to discuss various ways to change the visible portion and dimensions of matrices. Conceptually a matrix has an underlying non-displaced storage vector. For (MAKE-MAT 10 :DISPLACEMENT 7 :MAX-SIZE 21) this underlying vector looks like this:

displacement | visible elements  | slack
. . . . . . . 0 0 0 0 0 0 0 0 0 0 . . . .

Whenever a matrix is reshaped (or displaced to in lisp terminology), its displacement and dimensions change but the underlying vector does not.

The rules for accessing displaced matrices is the same as always: multiple readers can run in parallel, but attempts to write will result in an error if there are either readers or writers on any of the matrices that share the same underlying vector.

λ

7.1 Comparison to Lisp Arrays

One way to reshape and displace MAT objects is with MAKE-MAT and its DISPLACED-TO argument whose semantics are similar to that of MAKE-ARRAY in that the displacement is relative to the displacement of DISPLACED-TO.

(let* ((base (make-mat 10 :initial-element 5 :displacement 1))
       (mat (make-mat 6 :displaced-to base :displacement 2)))
  (fill! 1 mat)
  (values base mat))
==> #<MAT 1+10+0 A #(5.0d0 5.0d0 1.0d0 1.0d0 1.0d0 1.0d0 1.0d0 1.0d0 5.0d0
-->                  5.0d0)>
==> #<MAT 3+6+2 AB #(1.0d0 1.0d0 1.0d0 1.0d0 1.0d0 1.0d0)>

There are important semantic differences compared to lisp arrays all which follow from the fact that displacement operates on the underlying conceptual non-displaced vector.

λ

7.2 Functional Shaping

The following functions are collectively called the functional shaping operations, since they don't alter their arguments in any way. Still, since storage is aliased modification to the returned matrix will affect the original.

λ

7.3 Destructive Shaping

The following destructive operations don't alter the contents of the matrix, but change what is visible. ADJUST! is the odd one out, it may create a new MAT.

λ

8 Assembling

The functions here assemble a single MAT from a number of MATs.

λ

9 Caching

Allocating and initializing a MAT object and its necessary facets can be expensive. The following macros remember the previous value of a binding in the same thread and /place/. Only weak references are constructed so the cached objects can be garbage collected.

While the cache is global, thread safety is guaranteed by having separate subcaches per thread. Each subcache is keyed by a /place/ object that's either explicitly specified or else is unique to each invocation of the caching macro, so different occurrences of caching macros in the source never share data. Still, recursion could lead to data sharing between different invocations of the same function. To prevent this, the cached object is removed from the cache while it is used so other invocations will create a fresh one which isn't particularly efficient but at least it's safe.

λ

10 BLAS Operations

Only some BLAS functions are implemented, but it should be easy to add more as needed. All of them default to using CUDA, if it is initialized and enabled (see USE-CUDA-P).

Level 1 BLAS operations

Level 3 BLAS operations

λ

11 Destructive API

Finally, some neural network operations.

λ

12 Non-destructive API

λ

13 Mappings

λ

14 Random numbers

Unless noted these work efficiently with CUDA.

λ

15 I/O

λ

16 Debugging

The largest class of bugs has to do with synchronization of facets being broken. This is almost always caused by an operation that mispecifies the DIRECTION argument of WITH-FACET. For example, the matrix argument of SCAL! should be accessed with direciton :IO. But if it's :INPUT instead, then subsequent access to the ARRAY facet will not see the changes made by AXPY!, and if it's :OUTPUT, then any changes made to the ARRAY facet since the last update of the CUDA-ARRAY facet will not be copied and from the wrong input SCAL! will compute the wrong result.

Using the SLIME inspector or trying to access the CUDA-ARRAY facet from threads other than the one in which the corresponding CUDA context was initialized will fail. For now, the easy way out is to debug the code with CUDA disabled (see *CUDA-ENABLED*).

Another thing that tends to come up is figuring out where memory is used.

λ

17 Facet API

λ

17.1 Facets

A MAT is a CUBE (see Cube Manual) whose facets are different representations of numeric arrays. These facets can be accessed with WITH-FACETS with one of the following FACET-NAME locatives:

Facets bound by with WITH-FACETS are to be treated as dynamic extent: it is not allowed to keep a reference to them beyond the dynamic scope of WITH-FACETS.

For example, to implement the FILL! operation using only the BACKING-ARRAY, one could do this:

(let ((displacement (mat-displacement x))
      (size (mat-size x)))
 (with-facets ((x* (x 'backing-array :direction :output)))
   (fill x* 1 :start displacement :end (+ displacement size))))

DIRECTION is :OUTPUT because we clobber all values in X. Armed with this knowledge about the direction, WITH-FACETS will not copy data from another facet if the backing array is not up-to-date.

To transpose a 2d matrix with the ARRAY facet:

(destructuring-bind (n-rows n-columns) (mat-dimensions x)
  (with-facets ((x* (x 'array :direction :io)))
    (dotimes (row n-rows)
      (dotimes (column n-columns)
        (setf (aref x* row column) (aref x* column row))))))

Note that DIRECTION is :IO, because we need the data in this facet to be up-to-date (that's the input part) and we are invalidating all other facets by changing values (that's the output part).

To sum the values of a matrix using the FOREIGN-ARRAY facet:

(let ((sum 0))
  (with-facets ((x* (x 'foreign-array :direction :input)))
    (let ((pointer (offset-pointer x*)))
      (loop for index below (mat-size x)
            do (incf sum (cffi:mem-aref pointer (mat-ctype x) index)))))
  sum)

See DIRECTION for a complete description of :INPUT, :OUTPUT and :IO. For MAT objects, that needs to be refined. If a MAT is reshaped and/or displaced in a way that not all elements are visible then those elements are always kept intact and copied around. This is accomplished by turning :OUTPUT into :IO automatically on such MATs.

We have finished our introduction to the various facets. It must be said though that one can do anything without ever accessing a facet directly or even being aware of them as most operations on MATs take care of choosing the most appropriate facet behind the scenes. In particular, most operations automatically use CUDA, if available and initialized. See WITH-CUDA* for detail.

λ

17.2 Foreign arrays

One facet of MAT objects is FOREIGN-ARRAY which is backed by a memory area that can be a pinned lisp array or is allocated in foreign memory depending on *FOREIGN-ARRAY-STRATEGY*.

λ

17.3 CUDA

λ

17.3.1 CUDA Memory Management

The GPU (called device in CUDA terminology) has its own memory and it can only perform computation on data in this device memory so there is some copying involved to and from main memory. Efficient algorithms often allocate device memory up front and minimize the amount of copying that has to be done by computing as much as possible on the GPU.

MGL-MAT reduces the cost of device of memory allocations by maintaining a cache of currently unused allocations from which it first tries to satisfy allocation requests. The total size of all the allocated device memory regions (be they in use or currently unused but cached) is never more than N-POOL-BYTES as specified in WITH-CUDA*. N-POOL-BYTES being NIL means no limit.

That's it about reducing the cost allocations. The other important performance consideration, minimizing the amount copying done, is very hard to do if the data doesn't fit in device memory which is often a very limited resource. In this case the next best thing is to do the copying concurrently with computation.

Also note that often the easiest thing to do is to prevent the use of CUDA (and consequently the creation of CUDA-ARRAY facets, and allocations). This can be done either by binding *CUDA-ENABLED* to NIL or by setting CUDA-ENABLED to NIL on specific matrices.

λ

18 Writing Extensions

New operations are usually implemented in lisp, CUDA, or by calling a foreign function in, for instance, BLAS, CUBLAS, CURAND.

λ

18.1 Lisp Extensions

λ

18.2 CUDA Extensions

λ

18.2.1 CUBLAS

In a WITH-CUDA* BLAS Operations will automatically use CUBLAS. No need to use these at all.

λ

18.2.2 CURAND

This the low level CURAND API. You probably want Random numbers instead.