NIPS 2017. Summary, Thoughts, Photos (of research posters I liked).


NVIDIA Caffe fork install in Fedora 25 Workstation, CUDA 9, CUDNN 7, no nccl

So I was at the NVIDIA Deep Learning Institute Lunch & Labs at the NIPS 2017 (Neural Information Processing Systems) conference and the first lab was using Caffe with DIGITS.  I was sitting in front of a colleague who had experience with Caffe offline, and he suggested I install the NVIDIA fork of Caffe, NVCaffe.  I found it to be slightly easier to install than Berkeley’s Caffe, with its source here.   Nevertheless, installing Caffe is nontrivial.

My hardware specs/software setup:

  • Intel Xeon CPU E5-1650 v3 @ 3.50 GHz x 12 (cores)
  • NVIDIA GeForce GTX 980Ti (any hardware donation for a 1080Ti or Titan V would be greatly appreciated!)
  • 32 GB RAM, 1 TB hard drive
  • Fedora 25 Workstation (I did not upgrade to Fedora 26, 27 because its gcc 7 WILL NOT work with CUDA 9, because I already checked removing manually any version check flags; problem is incompatibility with the math libraries used for gcc 7 vs. CUDA 9)
  • I installed NVIDIA’s proprietary driver, CUDA 9, CUDNN 7 and followed their given installation instructions strictly (so I didn’t follow instructions given by others for driver, CUDA 9, CUDNN 7).
  • Other than those 3 (NVIDIA’s driver, CUDA 9, CUDNN 7), I only use dnf to install desired software packages.  No sudo pipNo Anaconda (Python).  Fedora 25 setup is about as “stock” as it can be.
  1. I went to NVCaffe, copied link given by clicking Download button and did, in my desired directory (usually ~/ )
    1. git clone
  2. I followed these instructions from Caffe’s RHEL / Fedora / CentOS Installation but modified for dnf:
    1. sudo dnf install protobuf-devel
    2. sudo dnf install leveldb-devel
    3. sudo dnf install snappy-devel
    4. sudo dnf install opencv-devel
    5. sudo dnf install boost-devel
    6. sudo dnf install hdf5-devel
    7. sudo dnf install gflags-devel
    8. sudo dnf install glog-devel
    9. sudo dnf install lmdb-devel
  3. For BLAS, I did
    1. sudo dnf install atlas-devel
  4. After doing all that in my administrator account, I went back to the user account I work in, and proceeded with compilation.  I did, in the first level directory containing the cloned NVCaffe,
    1. cp Makefile.config.example Makefile.config
    2. These are the changes I made to my Makefile.config.  I uncommented out the following lines:
      1.  USE_CUDNN := 1 # line 6
      2.  OPENCV_VERSION := 3 # line 24
    3. I did not uncomment out NCCL because I only have 1 GPU and I don’t know where exactly, in the installation of NCCL, following strictly NVIDIA’s given installation instructions, where to put symbolic links to NCCL to make it work.
  5. Then I did
    1. make all
    2. make test
    3. make runtest
    4. Each make took a long time and I should probably have done make all -j12 next time, since I had a multi-core CPU.  Here were some errors I came up with doing make all:
      1. NVCC src/caffe/util/
        NVCC src/caffe/
        AR -o .build_release/lib/libcaffe-nv.a
        LD -o .build_release/lib/
        /usr/bin/ld: cannot find -lnvidia-ml
        collect2: error: ld returned 1 exit status
        Makefile:601: recipe for target '.build_release/lib/' failed
        make: *** [.build_release/lib/] Error 1
        1. This appears to be related to CUDA 9 and where it puts (cf. cannot find -lnvidia-ml #1022 ).  I made a soft symbolic link to it in the same directory I found, /usr/lib64
          1. sudo ln -s
            1. I did this in /usr/lib64, where  I had followed the CUDA 9 instructions strictly from NVIDIA’s instructions directly from their website.  Not sure why I had to make this link.
      2. CXX/LD -o .build_release/examples/cifar10/convert_cifar_data.bin
        CXX examples/cpp_classification/classification.cpp
        CXX/LD -o .build_release/examples/cpp_classification/classification.bin
        /usr/bin/ld: .build_release/examples/cpp_classification/classification.o: undefined reference to symbol '_ZN2cv6imreadERKNS_6StringEi'
        /usr/lib64/ error adding symbols: DSO missing from command line
        collect2: error: ld returned 1 exit status
        Makefile:660: recipe for target '.build_release/examples/cpp_classification/classification.bin' failed
        make: *** [.build_release/examples/cpp_classification/classification.bin] Error 1
        1. This was caused by me forgetting to uncomment out OPENCV_VERSION := 3 # line 24 in Makefile.config to use the latest OpenCV.  Once I did, I didn’t get this error anymore.  cf. stackoverflow, Caffe: opencv error
    5. make test compiled without a hitch, other than these warnings I received:
      1. CXX src/caffe/test/test_im2col_layer.cpp
        CXX src/caffe/test/test_batch_norm_layer.cpp
        src/caffe/test/test_batch_norm_layer.cpp: In instantiation of ‘void caffe::CuDNNBatchNormLayerTest_TestForward_Test<gtest_TypeParam_>::TestBody() [with gtest_TypeParam_ = double]’:
        src/caffe/test/test_batch_norm_layer.cpp:241:1: required from here
        src/caffe/test/test_batch_norm_layer.cpp:217:9: warning: unused variable ‘mean’ [-Wunused-variable]
         Dtype mean, var;
        src/caffe/test/test_batch_norm_layer.cpp:217:15: warning: unused variable ‘var’ [-Wunused-variable]
         Dtype mean, var;
        src/caffe/test/test_batch_norm_layer.cpp: In instantiation of ‘void caffe::CuDNNBatchNormLayerTest_TestForward_Test<gtest_TypeParam_>::TestBody() [with gtest_TypeParam_ = float]’:
        src/caffe/test/test_batch_norm_layer.cpp:241:1: required from here
        src/caffe/test/test_batch_norm_layer.cpp:217:9: warning: unused variable ‘mean’ [-Wunused-variable]
         Dtype mean, var;
        src/caffe/test/test_batch_norm_layer.cpp:217:15: warning: unused variable ‘var’ [-Wunused-variable]
         Dtype mean, var;
        CXX src/caffe/test/test_flatten_layer.cpp
        CXX src/caffe/test/test_hdf5data_layer.cpp
    6. make runtest compiled without a hitch, passing all tests with Green, and being greeted by this glorious version check first:
      1. [topolo@localhost caffe]$ make runtest
        I1211 21:35:53.802299 4652 caffe.cpp:470] This is NVCaffe 0.16.4 started at Mon Dec 11 21:35:52 2017
        I1211 21:35:53.802570 4652 caffe.cpp:473] CuDNN version: 7003
        I1211 21:35:53.802579 4652 caffe.cpp:474] CuBLAS version: 9000
        I1211 21:35:53.802590 4652 caffe.cpp:475] CUDA version: 9000
        I1211 21:35:53.802598 4652 caffe.cpp:476] CUDA driver version: 9000
      2. ... [RUN ] RandomNumberGeneratorTest/1.TestRngUniformTimesBernoulli
        [ OK ] RandomNumberGeneratorTest/1.TestRngUniformTimesBernoulli (0 ms)
        [ RUN ] RandomNumberGeneratorTest/1.TestRngUniform
        [ OK ] RandomNumberGeneratorTest/1.TestRngUniform (1 ms)
        [----------] 18 tests from RandomNumberGeneratorTest/1 (50 ms total)
        [----------] Global test environment tear-down
        [==========] 2148 tests from 287 test cases ran. (972856 ms total)
        [ PASSED ] 2148 tests.

Sanity Check: Classifying ImageNet: using the C++ API

I figured we should do some actual machine learning as a sanity check.  First, I tried, in my directory with NVCaffe:

./scripts/ models/bvlc_reference_caffenet
This led to an error:

[topolo@localhost caffe]$ ./scripts/ models/bvlc_reference_caffenet
Traceback (most recent call last):
 File "./scripts/", line 5, in <module>
 import yaml
ImportError: No module named yaml

Because I hadn’t dnf installed PyYAML yet.  I did this in my administrator account:

sudo dnf install PyYAML
Last metadata expiration check: 0:54:05 ago on Mon Dec 11 21:48:11 2017.
Dependencies resolved.
 Package Arch Version Repository Size
 PyYAML x86_64 3.11-13.fc25 fedora 170 k
 libyaml x86_64 0.1.6-8.fc24 fedora 58 k

Transaction Summary
Install 2 Packages

which allowed me to redo the command

./scripts/ models/bvlc_reference_caffenet

so to use the pre-trained CaffeNet model with the classification example, downloading it from the “Model Zoo”.

I got the ImageNet labels file (also called the synset file), required to map a prediction to the name of the class by doing this:


Using the files that were downloaded, we can classify the provided cat image (examples/images/cat.jpg) using this command:

./build/examples/cpp_classification/classification.bin \
  models/bvlc_reference_caffenet/deploy.prototxt \
  models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel \
  data/ilsvrc12/imagenet_mean.binaryproto \
  data/ilsvrc12/synset_words.txt \

and I got

---------- Prediction for examples/images/cat.jpg ----------
0.3134 - "n02123045 tabby, tabby cat"
0.2380 - "n02123159 tiger cat"
0.1235 - "n02124075 Egyptian cat"
0.1003 - "n02119022 red fox, Vulpes vulpes"
0.0715 - "n02127052 lynx, catamount"

Further sanity check one can do is to run

watch -n1 nvidia-smi

in another terminal window and to run that command for classification.bin and see the GPU utilization go up and also be 1 of the processes with GPU Memory Usage in real-time.








Installation and getting things to work together: Fedora 25 Workstation (not 26), gcc 6 (not 7) for CUDA 9, CUDNN 7, not Tensorflow 1.3 (maybe Tensorflow 1.4?), not theano 1.0 (native CUDA backend no longer exists on theano =((((

I was on Fedora 23 Workstation and then beginning of last month, I decided to upgrade.  Fedora Linux’s upgrade cycle is “particular” (every half year or so, no LTS (long-term support) like Ubuntu Linux, it’s an entirely new kernel so for a “clean install” so that you’re not having an old version laying around, you’ll have to remove entirely the previous version (and your files, so I backed up before) and while this update cycle allows Fedora developers to try out and push for adoption of completely new features, it’s both a mini-heart attack and 2 day endeavor to update, so update both at your peril and infrequently, only when needed).

My priority of what I want to run was this, from most urgent, to last:

  1. Latest NVIDIA drivers for my GeForce GTX 980Ti (if anyone wants to do a hardware donation for a GTX 1080Ti, thanks in advance!  Looking at you, Titan Workstation computers, ASUS, EVGA, NVIDIA, …. =)))) )
  2. CUDA 9, CUDNN 7 (I want the latest versions of CUDA 9, CUDNN 7 because I write in CUDA C/C++14 and when I saw C++14 support for CUDA 9, CUDNN 7, I just had to have that).
    1. cuda-gdb – CUDA debugger, look at the actual memory addresses
  3. Latest gcc, g++, preferably gcc 7, g++ 7
  4. Latest Fedora Linux workstation, preferably 26, 25 is fine.
  5. Most stable version of Tensorflow, preferably not a release candidate, but official release available on pip, through virtualenv install.
  6. Most stable, latest official, version of theano, from only a pip install, –user level only (not system wide; I just learned this, to NOT sudo pip install.

NVIDIA driver version 380.90 for GeForce GTX 980Ti, Sept. 21, 2017, Linux 64, Manually find yours.

I always want the latest NVIDIA driver for my GPU.  Watch any of the many Linus Tech Tips videos benchmarking GPUs and you can empirically conclude that driver updates, better software with the driver updates, alone, yields increased performance for games, video, and compute even at the high-teens percentage (!!!).  Using older drivers lead to suboptimal performance.

Since I’m doing a fresh install, trashing everything, I start from scratch with my video driver install.  This also means that you should use the oldest monitor you have laying around because you’re going without video output (or video output compatible to the latest video device, like a 4K monitor) for a while, and should be good with terminal command line.  Also, you should have another computer at hand (ironic) like a Mac OS X box/laptop that’s easy to use because you may have to troubleshoot things when you don’t even have video output.

CUDA-GDB, getting it to work

I’m on Fedora 25 Linux Workstation and it is on version 6 for ncurses.  But CUDA-GDB wants version 5.  It appears that it’s because CUDA-GDB uses

Bottom line: You need to

sudo dnf install ncurses-compat-libs

for, so to run cuda-gdb.

Useful commands to get there:

# find the library/file first

dnf list ncurse* # or

dnf list ncurses*

rpm -ql ncurses  # find out where the packages' files were installed locally

rpm -ql ncurses-libs # find out where the packages' files were installed locally

CUDA incompatible with my gcc version 

CUDA 9 won’t work with gcc, g++ 7.  Period.

The reasons go beyond simply editing out the version guard checks, e.g.

“I found the line responsible in 81:/usr/local/cuda/include/host_config.h and changed it to:”

//#if GNUC > 4 || (GNUC == 4 && GNUC_MINOR > 4)
#if GNUC > 4 || (GNUC == 4 && GNUC_MINOR > 6)

cf. CUDA incompatible with my gcc version

Nor is the solution of adding softlinks (you’ll have to login as root/admin and go to /usr/.. , root directory and add them, not a recommended thing to do if not necessary) helps solve the fundamental problem.


Cristoforo Colombo, 1492, 2017

I do my best to work out DAILY to avoid the back pain, migraines, high blood pressure/cholesterol, colon problems (real talk, if you sit on your ass in front of a computer at work and at home for years, it becomes a problem (hemorrhoids, anal tearing, colon hemorrhaging)). In California, with yearly temperate and hot weather, I find it super convenient to workout outside, in the park. I run 1.7 km, do 3 sets of pull ups, chin ups, leg lifts, and some Insanity workout video in the park daily and try to get it all done in less than 30 mins.

Walking over to the adult pull up bar after 1.7 km, some skinny Latino kid was taunting another fat Latino kid who was trying to jump up and reach the pull bar.

“You fat fucking shit, you can’t jump up because you’re fat you fat shit!”

Goddamn these Latino hood kids are mean little shits.

“Help me aye, help me up so I can do one” pleaded the fat Latino kid.

“No, if I help you, I’m gonna break my back because you’re so fat, fat shit.”

So the fat kid gave up and walked over to where the skinny Latino kid and their female friend were next to the leg lifts. Admittedly I felt viscerally some pain synchronously with the kid because my middle and freshman HS school experience was shitty, especially about my lack of athletic prowess. Words are just words and I can see how being politically correct has gone overboard these days, but mean words do hurt.

I go and do my set (crushing it! Great form, going all the way down SLOW and back up, no jumping off).

“Hey, you want to do one? Yeah, let me help lift you up!”

So the fat kid runs over eagerly and he’s able to jump up on his own, but struggled to pull himself up. So I help him up. But holy shit he was corporally heavier than I had estimated. Goddamn America, the obesity epidemic at the children level is seriously going to be a problem.

He does 1 chin up and I guide him down gently, to show him good form (you were your other muscles coming down slowly).

“You wanna do 1 more?”
“No, I’m done. Yay, I did one! I did one!” The kid was joyous, happily running back to his group of friends.

“Just do 1 a day. Just try to do 1 a day. Even if it’s half, just try everyday. You’ll get there.”

And yeah, it’s Cristoforo Columbus Day (Oct. 12, 1492) today and good, bad, evil, I’m fairly sure I’d be stuck in imperial China now and those kids, half in Spain, and the other half in what is Mexico if America wasn’t discovered, however you interpret the past. For whatever reasons in the past, we’re here and in this together, so let’s all try to not be assholes to each other and help one another.

Bringing CUDA into the year 2011: C++11 smart pointers with CUDA, CUB, nccl, streams, and CUDA Unified Memory Management with CUB and CUBLAS

The github repository folder for all this code is here:


First, I was motivated by the need to load large arrays onto device GPU global memory, sometimes from batches of CPU host memory for machine learning/deep learning applications. This could also be necessitated by the bottleneck of having only so much data available externally, while GPU utilization is optimized for large device GPU arrays.

I show how this can be resolved with C++11 smart pointers. Usage of these C++11 smart pointers, not only being the latest, best practices, automates the freeing up of memory and provides a safe way to point to the raw pointer when needed.

I also show how to use CUDA Unified Memory Management to automate memory transfers between CPU to GPU and multi-GPUs. I show its use with CUB (for parallel reduce and scan algorithms), and CUBLAS, for linear algebra.

Also for CUB, nccl (parallel reduce and scan for multi-GPUs), CUDA streams, I show how to wrap device GPU arrays with C++11 smart pointers, to, again, automate the freeing up of memory and provide a safe way to point to the raw pointer when needed.

While I’ve seen and have only been able to encounter a great amount of CUDA code written in CUDA C, I’ve sought to show best practices in using CUDA C++11, setting up the stage for the next best practices standards, when CUDA will use C++17.

A brief recap of CUDA Unified Memory Management

The salient and amazing feature of CUDA Unified Memory Management is that CUDA is automating how to address the memory to be allocated for the desired array of data both on the CPU and the GPU. This is especially useful for multi-GPU setups. You don’t want to manually address memory on a number of GPUs.

Motivation; before CUDA Unified Memory Management, before CUDA 6

Before CUDA Unified Memory Management, before CUDA 6, one had to allocate, separately, 2 arrays, 1 on the host, and another, of exact, same size, on the device.

For example (cf.,

# host array
int *host_ret = (int *)malloc(1000 * sizeof(int));

# device array
int *ret;
cudaMalloc(&ret, 1000 * sizeof(int));

# after computation on GPU, it would be useful to leave the result on the GPU; we have to get reuslt out to the user
cudaMemcpy(host_ret, ret, 1000*sizeof(int), cudaMemcpyDeviceToHost);


Note that one needs to allocate (and free!) 2 separate arrays, host and device, and then cudaMemcpy between host and device – and CPU-GPU memory transfers are (relatively) slow!

With CUDA Unified Memory Management; cudaMallocManaged

With CUDA Unified Memory, allocate (and destroy) only 1 array with cudaMallocManaged (cf.

int *ret;
cudaMallocManaged(&ret, 1000*sizeof(int));
AplusB<<<1,1000>>>(ret, 10,100);

* In non-managed example, synchronous cudaMemcpy() routine is used both
* to synchronize the kernel (i.e. wait for it to finish running), &
* transfer data to host.
* The Unified Memory examples do not call cudaMemcpy() and so
* require an explicit cudaDeviceSynchronize() before host program
* can safely use output from GPU.
for (int i=0; i<1000; i++) {
printf("%d: A+B = %d\n", i,ret[i]);

It is very important that you now have to be considerate of synchronizing of a kernel run on the GPU with GPU-CPU data transfers, as mentioned above in the code. Thus, cudaDeviceSynchronize was inserted in between the example kernel (run on the GPU) AplusB and the printing of the array on the host CPU (printf).

See also

For completeness, one can also declare globally (“at the top of your code”)

__device__ __managed__ int ret[1000];


However, I’ve found that, unless, an array is specifically needed to have global scope, such as with OpenGL interoperability, it’s unwielding to hardcode a specific array for global scope (“at the top of the code”).



Again, for completeness, I will briefly describe cudaMallocHost.

cudaMallocHost allows for the allocation of page-locked memory on the host – meaning pinned memory; the memory is allocated “firmly” or its address is fixed on the host so that CUDA knows where it exactly is, and can automatically optimize CPU-GPU data transfers between this fixed host memory and device GPU memory (remember, CUDA cannot directly access host CPU memory!).

A full, working example is here,, but the gist of the creation (and important destruction) of a cudaMallocHost’ed array is here:

float *a;

cudaMallocHost(&a, N_0*sizeof(float));


cf. 4.9 Memory Management, CUDA Runtime API, CUDA Toolkit v9.0.176 Documentation


Doing (C++11) smart pointer arithmetic directly on device GPU memory so to load “batches” of data from the host onto portions of the device GPU array! (!!!)

This is one of the milestones of this discussion.

I was concerned deeply with the transfer of data on the CPU (RAM) to the device GPU memory in the application to running deep learning models.

In practice, the bottlenecks are the slow transfer of data between the CPU and GPU. Second of all, to optimize the utilization of the GPU, one should launch as many threads as possible (e.g. 65536 total allowed threads for the “Max. grid dimensions” on this GTX 1050), and, roughly speak, each of those threads should have as much data to work with as possible on GPU global memory.

As much data to be processed should be loaded from the CPU and onto a device GPU array as possible, and the device GPU array should be as large as possible so to provide all those threads with stuff to compute.

In fact, suppose the goal is to load an array of length (i.e. number of elements) Lx onto the device GPU global memory.

Suppose we can only load it in “batches”, say n=2 batches. Some information from the outside may come, sequentially (in time), before the other.

In this simple (but instructive) n=2 case, say we have data for the first Lx/2 elements coming in on 1 array from the host CPU, and the other Lx/2 elements on another array.

Thus, we’d want to do some pointer arithmetic to load half of the device GPU array with data, and the other half (starting from element Lx/2, in 0-based counting, counting from 0) later.

We should also do this in a “civilized manner”, utilizing best practices from C++11 to make accessing a raw pointer safe.

So say we’ve allocated host arrays (I’ll use std::vector and std::shared_ptr from C++11 on the CPU to show how, novelly, how it can interact nicely with CUDA C/C++11 in each cases), each of size Lx/n=Lx/2:

// Allocate host arrays
std::vector<float> f_vec(Lx/2,1.f);
std::shared_ptr<float> sp(new float[Lx/2],std::default_delete<float[]>());

Then allocate the device GPU array, 1 big array of size Lx:

// Allocate problem device arrays
auto deleter=[&](float* ptr){ cudaFree(ptr); };
std::shared_ptr<float> d_sh_in(new float[Lx], deleter);
cudaMalloc((void **) &d_sh_in, Lx * sizeof(float));

Then, here’s how to do cudaMemcpy with (smart) pointer arithmetic:

cudaMemcpy(d_sh_in.get(),, Lx/2*sizeof(float),cudaMemcpyHostToDevice);
cudaMemcpy(d_sh_in.get()+Lx/2, sp.get(), Lx/2*sizeof(float),cudaMemcpyHostToDevice);

We can also do this with std::unique_ptr:

auto deleter=[&](float* ptr){ cudaFree(ptr); };

// device pointers
std::unique_ptr<float[], decltype(deleter)> d_u_in(new float[Lx], deleter);
cudaMalloc((void **) &d_u_in, Lx * sizeof(float));

cudaMemcpy(d_u_in.get(), sp.get(), Lx/2*sizeof(float),cudaMemcpyHostToDevice);
cudaMemcpy(d_u_in.get()+Lx/2,, Lx/2*sizeof(float),cudaMemcpyHostToDevice);

The code is available here.


CUB and CUDA Unified Memory, and then with C++11 smart pointers; CUB allows for parallel reduce and scan for a single GPU

To use parallel reduce and scan algorithms (they are, briefly, doing summation or the product of numbers and doing a running summation, like a check book, respectively) for a single GPU, using CUB is the only way (for a library being actively updated to be optimized for the latest CUDA release). nccl cannot be used to do reduce and scan for a single GPU (cf. my stackoverflow question)

CUB with CUDA Unified Memory

An example of using CUDA Unified Memory (with global scope) with CUB for parallel reduce on a single GPU is here,

// Allocate arrays
__device__ __managed__ float f[Lx];
__device__ __managed__ float g;


cub::DeviceReduce::Sum( d_temp_storage, temp_storage_bytes, f, &g, Lx );

The result or output of this is this:

temp_storage_bytes : 1
n : 1499
Taken to the 2th power
summation : 1.12388e+09

Using CUDA Unified Memory with CUB (or even using CUB in general) is nontrivial because we need 2 “variables” (an array and then a single variable that’ll also act as a pointer to a single value), we need to request and allocate temporary storage to find out the “size of the problem” and do cub::DeviceReduce::Sum twice.

CUB with C++11 smart pointers

C++11 smart pointers makes working with CUB easier (or at least more organized) because:
* Use C++11 smart pointers to build in a deleter and so we don’t forget to free up memory at the end of the code
* make pointing to the raw pointer safe with .get()

Look at

// Allocate problem device arrays
auto deleter=[&](float* ptr){ cudaFree(ptr); };
std::shared_ptr<float> d_in(new float[Lx], deleter);
cudaMalloc((void **) &d_in, Lx * sizeof(float));

// Initialize device input
cudaMemcpy(d_in.get(),, Lx*sizeof(float),cudaMemcpyHostToDevice);

// Allocate device output array
std::shared_ptr<float> d_out(new float(0.f), deleter);
cudaMalloc((void **) &d_out, 1 * sizeof(float));

// Request and allocate temporary storage
std::shared_ptr<void> d_temp_storage(nullptr, deleter);

size_t temp_storage_bytes = 0;

cub::DeviceReduce::Sum( d_temp_storage.get(), temp_storage_bytes, d_in.get(),d_out.get(),Lx);

cudaMalloc((void **) &d_temp_storage, temp_storage_bytes);

// Run

Notice how we can use std::shared_ptr with CUB and not std::unique_ptr with CUB. I’ve found (with extensive experimentation) that it’s because CUB needs to “share” the pointer when it’s allocating the size of the problem and memory to work on given that size.

With std::shared_ptr, we can use .get() to get the raw pointer safely, it makes the creation and allocation of device arrays for CUB much more clearer (organized), and one can also use this with CUDA Unified Memory (I’ll have to try this)

nccl and C++11 smart pointers, and as a bonus, C++11 smart pointers for CUDA streams.

I have also wrapped nccl (briefly, it is for parallel reduce and scan algorithms, but for a multi-GPU setup) into C++11 smart pointers, for automatic cleaning up and safe pointing to the raw pointer.
Looking at

// managing a device
auto comm_deleter=[&](ncclComm_t* comm){ ncclCommDestroy( *comm ); };
std::unique_ptr<ncclComm_t, decltype(comm_deleter)> comm(new ncclComm_t, comm_deleter);

// device pointers
auto deleter=[&](float* ptr){ cudaFree(ptr); };
std::unique_ptr<float[], decltype(deleter)> d_in(new float[size], deleter);
cudaMalloc((void **) &d_in, size * sizeof(float));

std::unique_ptr<float[], decltype(deleter)> d_out(new float[size], deleter);
cudaMalloc((void **) &d_out, size * sizeof(float));

// CUDA stream smart pointer stream
auto stream_deleter=[&](cudaStream_t* stream){ cudaStreamDestroy( *stream ); };
std::unique_ptr<cudaStream_t, decltype(stream_deleter)> stream(new cudaStream_t, stream_deleter);



//initializing NCCL
ncclCommInitAll(comm.get(), nDev, devs);

ncclAllReduce( d_in.get(), d_out.get(), size, ncclFloat, ncclSum, *comm.get(), *stream.get() );

I want to emphasize that using std::unique_ptr makes the freeing up of device GPU memory automatic and safe, accessing the raw pointer safe, with .get().

Then also, with (concurrent) streams, we can wrap those up with a C++11 smart pointer, std::unique_ptr, automate the freeing up of the device stream (cudaStreamDestroy), and make pointing to the raw pointer safe with .get().

CUBLAS and CUDA Unified Memory Management

One can use CUDA Unified Memory with CUBLAS. As an example, for an array with global scope on the device GPU’s unified memory, and for doing matrix multiplication y = a1*a*x + bet*y, where a is a m x n matrix, x is a n-vector, y is a m-vector, and a1,bet are scalars, then 1 can do this:

__device__ __managed__ float a[m*n]; // a - m x n matrix on the managed device
__device__ __managed__ float x[n]; // x - n-vector on the managed device
__device__ __managed__ float y[m]; // y - m-vector on the managed device

int main(void) {
cudaError_t cudaStat; // cudaMalloc status
cublasStatus_t stat; // CUBLAS functions status
cublasHandle_t handle; // CUBLAS context





Note the use of cudaDeviceSynchronize() that is necessitated if you then need to use the array on the host CPU.

Code for this is found here.

Short Glossary of APIs (i.e. API documentation)


__host__ cudaError_t cudaMallocHost(void** ptr, size_t size)

Allocates page-locked memory on the host.

ptr – Pointer to allocated host memory
size – Requested allocation size in bytes

(brief) Description

Allocates size bytes of host memory that is page-locked and accessible to the device. The drive tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy*(). Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc().


Continue reading “Bringing CUDA into the year 2011: C++11 smart pointers with CUDA, CUB, nccl, streams, and CUDA Unified Memory Management with CUB and CUBLAS”

GPU accelerated tensor networks.

After participating in the Global AI Hackathon San Diego (June 23-25, 2017), where I implemented my own Python classes for Deep Neural Networks (with theano), I decided to “relax” by trying to keep abreast of the latest developments in theoretical physics by watching YouTube videos of lectures on the IHÉS channel (Institut des Hautes Études Scientifiques).

After watching Barbon’s introductory talk, I was convinced that numerical computations involving tensor networks are ripe for GPU acceleration. As a first step, I implemented the first few iterations of the construction of a matrix product state (MPS) of a 1-dim. quantum many body system – which involves applying singular value decomposition (SVD) and (dense) matrix multiplication to exponentially large matrices (2^L entries of complex double-precision numbers) – using CUDA C/C++, CUBLAS, and CUSOLVER, and with the entire computation taking place on the GPU (to eliminate slow CPU-GPU memory transfers). 2 iterations for L=16 complete in about 1.5 secs. on a nVidia GTX 980 Ti.

I’ve placed that code in this subdirectory
See files and, and verification of simple cases (before scaling up) with Python NumPy in cuSOLVERgesvd.ipynb

Tensor networks have only been developed within the last decade; 3 applications are interesting:

  • quantum many body physics: while the Hilbert space exponentially grows with the number of spins in the system, the methods of tensor networks, from MPS to so-called PEPS, which both involved applying SVD, QR decomposition, etc., reduces the state space that the system’s ground state could possibly be in. It has become a powerful tool for condensed matter physicists in the numerical simulation of quantum many-body physics problems, from high-termperature superconductors to strongly interacting ultracold atom gases.  cf. 1
  • Machine Learning: there is a use case for supervised learning and feature extraction with tensor networks. cf. 2,3
  • Quantum Gravity: wormholes (Einstein-Rosen (ER) bridge) and condensates of entangled quantum pairs (Einstein-Podolsky-Rosen (EPR) pairs) have been conjectured to be intimately connected – accumulation of a large density of EPR pairs (S>>1) seem to generate a wormhole, ER, the so-called EPR=ER relation. This relation is implied from the AdS/CFT conjecture. Tensor network representations have been applied to various entangled CFT states – large scale GPU-accelerated numerical computation of these tensor network representations and their dynamics could be useful (and unprecedented) simulations for the gravity dual (graviton) in the bulk, through AdS/CFT. cf. 4

I believe there is valuable work to be done for GPU acceleration of tensor networks. I am seeking 2 things that I am asking here for help with: 1. colleagues, advisors, mentors to collaborate with, so to obtain useful feedback 2. support, namely financial support for stipend(s), hardware, and software support (nVidia? The Simons Foundation?). Any help with meeting or placing me in contact with helpful persons would be helpful. Thanks!


  1. Ulrich Schollwoeck. The density-matrix renormalization group in the age of matrix product states. Annals of Physics 326, 96 (2011). arXiv:1008.3477 [cond-mat.str-el]
  2. Johann A. Bengua, Ho N. Phien, Hoang D. Tuan, and Minh N. D. Matrix Product State for Feature Extraction of Higher-Order Tensors. arXiv:1503.00516 [cs.CV]
  3. E. Miles Stoudenmire, David J. Schwab. Supervised Learning with Quantum-Inspired Tensor Networks. arXiv:1605.05775 [stat.ML]
  4. Juan Maldacena, Leonard Susskind. ”Cool horizons for entangled black holes.” arXiv:1306.0533 [hep-th]

Machine Learning (ML), Deep Learning stuff; including CUDA C/C++ stuff (utilizing and optimizing with CUDA C/C++)

(Incomplete) Table of Contents

  • GPU-accelerated Tensor Networks
  • “Are Neural Networks a black box?” My take.
  • Log
  • CUDA C/C++ stuff (utilizing CUDA and optimizing CUDA C/C++ code)
  • Fedora Linux installation of Docker for nVidia’s DIGITS – my experience
  • Miscellaneous Links

A lot has already been said about Machine Learning (ML), Deep Learning, and Neural Networks.  Note that this blog post (which I’ll infrequently update) is the “mirror” to my github repository github: ernestyalumni/MLgrabbag . Go to the github repo for the most latest updates, code, and jupyter notebooks.

A few things bother me that I sought to rectify myself:

  • There ought to be a clear dictionary between the mathematical formulation, Python’s sci-kit learn, Theano, and Tensorflow implementation.  I see math equations; here’s how to implement it, immediately.  I mean, if I was in class lectures, and with the preponderance of sample data, I ought to be able to play with examples immediately.
  • Someone ought to generalize the mathematical formulation, drawing from algebra, category theory, and differential geometry/topology.
  • CPUs have been a disappointment (see actual gamer benchmarks for Kaby Lake on YouTube); everything ought to be written in parallel for the GPU.  And if you’re using a wrapper that’s almost as fast as CUDA C/C++ or about as fast as CUDA C/C++, guess what?  You ought to rewrite the thing in CUDA C/C++.

So what I’ve started doing is put up my code and notes for these courses:

The github repository MLgrabbag should have all my stuff for it.  I’m cognizant that there are already plenty of notes and solutions out there.  What I’m trying to do is to, as above,

  1. write the code in Python’s sci-kit learn and Theano, first and foremost,
  2. generalize the mathematical formulation,
  3. implement on the GPU

I think those aspects are valuable and I don’t see anyone else have either such a clear implementation or real examples (not toy examples).

GPU-accelerated Tensor Networks

Go here:

Are neural networks a “black box”? My take.

I was watching a webinar HPC Exascale and AI given by Tom Gibbs for nVidia, and the first question for Q&A was whether neural networks were a “black box” or not, in that, how could anything be learned about the data presented (experimental or from simulation), if it’s unknown what neural networks do?

Here is my take on the question and how I’d push back.

For artificial neural networks (ANN), or the so-called “fully-connected layers” of Convolutional Neural Networks (CNN), Hornik, et. al. (1991) had already shown that neural networks act as a universal function approximator in that the neural networks uniformly converges to a function mapping the input data X to output y. The proof should delight pure math majors in that it employs the Stone-Weierstrass theorem. The necessary number of layers L is not known; it simply must be sufficiently large. But that a sufficiently deep neural network can converge uniformly to an approximate function that maps input data X to output y should be very comforting (and confidence-building in the technique).

For CNNs, this was an insight that struck me because I wrote a lot of incompressible Navier-Stokes equations solvers for Computational Fluid Dynamics (CFD) with finite-difference methods in CUDA C/C++: stencil operations in CUDA (or numerical computation in general) are needed for the finite-difference method for computing gradients, and further, the Hessian (second-order partial derivatives). CNNs formally do exactly these stencil operations, with the “weights” on the finite-difference being arbitrary (adjustable). Each successive convolution “layer” does a higher-order (partial) derivative from the previous; this is exactly what stencil operations for finite-difference does as well. This is also evidenced by how with each successive convolution “layer”, the total size of a block “shrinks” (if we’re not padding the boundaries), exactly as with the stencil operation for finite difference.

CNNs learn first-order and successively higher-order gradients, Hessians, partial derivatives as features from the input data. The formal mathematically structure for the whole sequence of partial derivatives over a whole set of input data are jet bundles. I would argue that this (jet bundles) should be the mathematical structure to consider for CNNs.

Nevertheless, in short, ANNs or the “fully-connected layers” was shown to be a universal function approximator for the function that maps input data X to output data y already by Hornik, et. al. (1991). CNNs are learning the gradients, and higher order derivatives associated with the image (and how the colors change across the grid) or video. They’re not as black box as a casual observer might think.



  • 20170209 Week 2 Linear Regression stuff for Coursera’s ML by Ng implemented in Python numpy, and some in Theano, see sklearn_ML.ipynb and theano_ML.ipynb, respectively.

CUDA C/C++ stuff (utilizing CUDA and optimizing CUDA C/C++ code)

cuSOLVER – Singular Value Decomposition (SVD), with and without CUDA unified memory management

I implemented simple examples illustrating Singular Value Decomposition (SVD) both with and without CUDA unified memory management, starting from the examples in the CUDA Toolkit Documentation.

Find those examples in the moreCUDA/CUSOLVER subdirectory of my CompPhys github repository.

Fedora Linux installation of Docker for nVidia’s DIGITS – my experience

I wanted to share my experience with installing Docker on Fedora Linux because I wanted to run nVidia’s DIGITS; I really want to make Docker work for Fedora Linux Workstation (23 as of today, 20170825; I will install 25 soon), but I’m having a few issues, some related to Docker, some related to Fedora:

  1. For some reason, in a user (non-admin account), when I do dnf list, I obtain the following error:
    1. ImportError: dynamic module does not define init function (PyInit__posixsubprocess)

Nevertheless, I did the following to install DIGITS:

git clone

python install


Miscellaneous Links