Cristoforo Colombo, 1492, 2017

I do my best to work out DAILY to avoid the back pain, migraines, high blood pressure/cholesterol, colon problems (real talk, if you sit on your ass in front of a computer at work and at home for years, it becomes a problem (hemorrhoids, anal tearing, colon hemorrhaging)). In California, with yearly temperate and hot weather, I find it super convenient to workout outside, in the park. I run 1.7 km, do 3 sets of pull ups, chin ups, leg lifts, and some Insanity workout video in the park daily and try to get it all done in less than 30 mins.

Walking over to the adult pull up bar after 1.7 km, some skinny Latino kid was taunting another fat Latino kid who was trying to jump up and reach the pull bar.

“You fat fucking shit, you can’t jump up because you’re fat you fat shit!”

Goddamn these Latino hood kids are mean little shits.

“Help me aye, help me up so I can do one” pleaded the fat Latino kid.

“No, if I help you, I’m gonna break my back because you’re so fat, fat shit.”

So the fat kid gave up and walked over to where the skinny Latino kid and their female friend were next to the leg lifts. Admittedly I felt viscerally some pain synchronously with the kid because my middle and freshman HS school experience was shitty, especially about my lack of athletic prowess. Words are just words and I can see how being politically correct has gone overboard these days, but mean words do hurt.

I go and do my set (crushing it! Great form, going all the way down SLOW and back up, no jumping off).

“Hey, you want to do one? Yeah, let me help lift you up!”

So the fat kid runs over eagerly and he’s able to jump up on his own, but struggled to pull himself up. So I help him up. But holy shit he was corporally heavier than I had estimated. Goddamn America, the obesity epidemic at the children level is seriously going to be a problem.

He does 1 chin up and I guide him down gently, to show him good form (you were your other muscles coming down slowly).

“You wanna do 1 more?”
“No, I’m done. Yay, I did one! I did one!” The kid was joyous, happily running back to his group of friends.

“Just do 1 a day. Just try to do 1 a day. Even if it’s half, just try everyday. You’ll get there.”

And yeah, it’s Cristoforo Columbus Day (Oct. 12, 1492) today and good, bad, evil, I’m fairly sure I’d be stuck in imperial China now and those kids, half in Spain, and the other half in what is Mexico if America wasn’t discovered, however you interpret the past. For whatever reasons in the past, we’re here and in this together, so let’s all try to not be assholes to each other and help one another.


Bringing CUDA into the year 2011: C++11 smart pointers with CUDA, CUB, nccl, streams, and CUDA Unified Memory Management with CUB and CUBLAS

The github repository folder for all this code is here:


First, I was motivated by the need to load large arrays onto device GPU global memory, sometimes from batches of CPU host memory for machine learning/deep learning applications. This could also be necessitated by the bottleneck of having only so much data available externally, while GPU utilization is optimized for large device GPU arrays.

I show how this can be resolved with C++11 smart pointers. Usage of these C++11 smart pointers, not only being the latest, best practices, automates the freeing up of memory and provides a safe way to point to the raw pointer when needed.

I also show how to use CUDA Unified Memory Management to automate memory transfers between CPU to GPU and multi-GPUs. I show its use with CUB (for parallel reduce and scan algorithms), and CUBLAS, for linear algebra.

Also for CUB, nccl (parallel reduce and scan for multi-GPUs), CUDA streams, I show how to wrap device GPU arrays with C++11 smart pointers, to, again, automate the freeing up of memory and provide a safe way to point to the raw pointer when needed.

While I’ve seen and have only been able to encounter a great amount of CUDA code written in CUDA C, I’ve sought to show best practices in using CUDA C++11, setting up the stage for the next best practices standards, when CUDA will use C++17.

A brief recap of CUDA Unified Memory Management

The salient and amazing feature of CUDA Unified Memory Management is that CUDA is automating how to address the memory to be allocated for the desired array of data both on the CPU and the GPU. This is especially useful for multi-GPU setups. You don’t want to manually address memory on a number of GPUs.

Motivation; before CUDA Unified Memory Management, before CUDA 6

Before CUDA Unified Memory Management, before CUDA 6, one had to allocate, separately, 2 arrays, 1 on the host, and another, of exact, same size, on the device.

For example (cf.,

# host array
int *host_ret = (int *)malloc(1000 * sizeof(int));

# device array
int *ret;
cudaMalloc(&ret, 1000 * sizeof(int));

# after computation on GPU, it would be useful to leave the result on the GPU; we have to get reuslt out to the user
cudaMemcpy(host_ret, ret, 1000*sizeof(int), cudaMemcpyDeviceToHost);


Note that one needs to allocate (and free!) 2 separate arrays, host and device, and then cudaMemcpy between host and device – and CPU-GPU memory transfers are (relatively) slow!

With CUDA Unified Memory Management; cudaMallocManaged

With CUDA Unified Memory, allocate (and destroy) only 1 array with cudaMallocManaged (cf.

int *ret;
cudaMallocManaged(&ret, 1000*sizeof(int));
AplusB<<<1,1000>>>(ret, 10,100);

* In non-managed example, synchronous cudaMemcpy() routine is used both
* to synchronize the kernel (i.e. wait for it to finish running), &
* transfer data to host.
* The Unified Memory examples do not call cudaMemcpy() and so
* require an explicit cudaDeviceSynchronize() before host program
* can safely use output from GPU.
for (int i=0; i<1000; i++) {
printf("%d: A+B = %d\n", i,ret[i]);

It is very important that you now have to be considerate of synchronizing of a kernel run on the GPU with GPU-CPU data transfers, as mentioned above in the code. Thus, cudaDeviceSynchronize was inserted in between the example kernel (run on the GPU) AplusB and the printing of the array on the host CPU (printf).

See also

For completeness, one can also declare globally (“at the top of your code”)

__device__ __managed__ int ret[1000];


However, I’ve found that, unless, an array is specifically needed to have global scope, such as with OpenGL interoperability, it’s unwielding to hardcode a specific array for global scope (“at the top of the code”).



Again, for completeness, I will briefly describe cudaMallocHost.

cudaMallocHost allows for the allocation of page-locked memory on the host – meaning pinned memory; the memory is allocated “firmly” or its address is fixed on the host so that CUDA knows where it exactly is, and can automatically optimize CPU-GPU data transfers between this fixed host memory and device GPU memory (remember, CUDA cannot directly access host CPU memory!).

A full, working example is here,, but the gist of the creation (and important destruction) of a cudaMallocHost’ed array is here:

float *a;

cudaMallocHost(&a, N_0*sizeof(float));


cf. 4.9 Memory Management, CUDA Runtime API, CUDA Toolkit v9.0.176 Documentation


Doing (C++11) smart pointer arithmetic directly on device GPU memory so to load “batches” of data from the host onto portions of the device GPU array! (!!!)

This is one of the milestones of this discussion.

I was concerned deeply with the transfer of data on the CPU (RAM) to the device GPU memory in the application to running deep learning models.

In practice, the bottlenecks are the slow transfer of data between the CPU and GPU. Second of all, to optimize the utilization of the GPU, one should launch as many threads as possible (e.g. 65536 total allowed threads for the “Max. grid dimensions” on this GTX 1050), and, roughly speak, each of those threads should have as much data to work with as possible on GPU global memory.

As much data to be processed should be loaded from the CPU and onto a device GPU array as possible, and the device GPU array should be as large as possible so to provide all those threads with stuff to compute.

In fact, suppose the goal is to load an array of length (i.e. number of elements) Lx onto the device GPU global memory.

Suppose we can only load it in “batches”, say n=2 batches. Some information from the outside may come, sequentially (in time), before the other.

In this simple (but instructive) n=2 case, say we have data for the first Lx/2 elements coming in on 1 array from the host CPU, and the other Lx/2 elements on another array.

Thus, we’d want to do some pointer arithmetic to load half of the device GPU array with data, and the other half (starting from element Lx/2, in 0-based counting, counting from 0) later.

We should also do this in a “civilized manner”, utilizing best practices from C++11 to make accessing a raw pointer safe.

So say we’ve allocated host arrays (I’ll use std::vector and std::shared_ptr from C++11 on the CPU to show how, novelly, how it can interact nicely with CUDA C/C++11 in each cases), each of size Lx/n=Lx/2:

// Allocate host arrays
std::vector<float> f_vec(Lx/2,1.f);
std::shared_ptr<float> sp(new float[Lx/2],std::default_delete<float[]>());

Then allocate the device GPU array, 1 big array of size Lx:

// Allocate problem device arrays
auto deleter=[&](float* ptr){ cudaFree(ptr); };
std::shared_ptr<float> d_sh_in(new float[Lx], deleter);
cudaMalloc((void **) &d_sh_in, Lx * sizeof(float));

Then, here’s how to do cudaMemcpy with (smart) pointer arithmetic:

cudaMemcpy(d_sh_in.get(),, Lx/2*sizeof(float),cudaMemcpyHostToDevice);
cudaMemcpy(d_sh_in.get()+Lx/2, sp.get(), Lx/2*sizeof(float),cudaMemcpyHostToDevice);

We can also do this with std::unique_ptr:

auto deleter=[&](float* ptr){ cudaFree(ptr); };

// device pointers
std::unique_ptr<float[], decltype(deleter)> d_u_in(new float[Lx], deleter);
cudaMalloc((void **) &d_u_in, Lx * sizeof(float));

cudaMemcpy(d_u_in.get(), sp.get(), Lx/2*sizeof(float),cudaMemcpyHostToDevice);
cudaMemcpy(d_u_in.get()+Lx/2,, Lx/2*sizeof(float),cudaMemcpyHostToDevice);

The code is available here.


CUB and CUDA Unified Memory, and then with C++11 smart pointers; CUB allows for parallel reduce and scan for a single GPU

To use parallel reduce and scan algorithms (they are, briefly, doing summation or the product of numbers and doing a running summation, like a check book, respectively) for a single GPU, using CUB is the only way (for a library being actively updated to be optimized for the latest CUDA release). nccl cannot be used to do reduce and scan for a single GPU (cf. my stackoverflow question)

CUB with CUDA Unified Memory

An example of using CUDA Unified Memory (with global scope) with CUB for parallel reduce on a single GPU is here,

// Allocate arrays
__device__ __managed__ float f[Lx];
__device__ __managed__ float g;


cub::DeviceReduce::Sum( d_temp_storage, temp_storage_bytes, f, &g, Lx );

The result or output of this is this:

temp_storage_bytes : 1
n : 1499
Taken to the 2th power
summation : 1.12388e+09

Using CUDA Unified Memory with CUB (or even using CUB in general) is nontrivial because we need 2 “variables” (an array and then a single variable that’ll also act as a pointer to a single value), we need to request and allocate temporary storage to find out the “size of the problem” and do cub::DeviceReduce::Sum twice.

CUB with C++11 smart pointers

C++11 smart pointers makes working with CUB easier (or at least more organized) because:
* Use C++11 smart pointers to build in a deleter and so we don’t forget to free up memory at the end of the code
* make pointing to the raw pointer safe with .get()

Look at

// Allocate problem device arrays
auto deleter=[&](float* ptr){ cudaFree(ptr); };
std::shared_ptr<float> d_in(new float[Lx], deleter);
cudaMalloc((void **) &d_in, Lx * sizeof(float));

// Initialize device input
cudaMemcpy(d_in.get(),, Lx*sizeof(float),cudaMemcpyHostToDevice);

// Allocate device output array
std::shared_ptr<float> d_out(new float(0.f), deleter);
cudaMalloc((void **) &d_out, 1 * sizeof(float));

// Request and allocate temporary storage
std::shared_ptr<void> d_temp_storage(nullptr, deleter);

size_t temp_storage_bytes = 0;

cub::DeviceReduce::Sum( d_temp_storage.get(), temp_storage_bytes, d_in.get(),d_out.get(),Lx);

cudaMalloc((void **) &d_temp_storage, temp_storage_bytes);

// Run

Notice how we can use std::shared_ptr with CUB and not std::unique_ptr with CUB. I’ve found (with extensive experimentation) that it’s because CUB needs to “share” the pointer when it’s allocating the size of the problem and memory to work on given that size.

With std::shared_ptr, we can use .get() to get the raw pointer safely, it makes the creation and allocation of device arrays for CUB much more clearer (organized), and one can also use this with CUDA Unified Memory (I’ll have to try this)

nccl and C++11 smart pointers, and as a bonus, C++11 smart pointers for CUDA streams.

I have also wrapped nccl (briefly, it is for parallel reduce and scan algorithms, but for a multi-GPU setup) into C++11 smart pointers, for automatic cleaning up and safe pointing to the raw pointer.
Looking at

// managing a device
auto comm_deleter=[&](ncclComm_t* comm){ ncclCommDestroy( *comm ); };
std::unique_ptr<ncclComm_t, decltype(comm_deleter)> comm(new ncclComm_t, comm_deleter);

// device pointers
auto deleter=[&](float* ptr){ cudaFree(ptr); };
std::unique_ptr<float[], decltype(deleter)> d_in(new float[size], deleter);
cudaMalloc((void **) &d_in, size * sizeof(float));

std::unique_ptr<float[], decltype(deleter)> d_out(new float[size], deleter);
cudaMalloc((void **) &d_out, size * sizeof(float));

// CUDA stream smart pointer stream
auto stream_deleter=[&](cudaStream_t* stream){ cudaStreamDestroy( *stream ); };
std::unique_ptr<cudaStream_t, decltype(stream_deleter)> stream(new cudaStream_t, stream_deleter);



//initializing NCCL
ncclCommInitAll(comm.get(), nDev, devs);

ncclAllReduce( d_in.get(), d_out.get(), size, ncclFloat, ncclSum, *comm.get(), *stream.get() );

I want to emphasize that using std::unique_ptr makes the freeing up of device GPU memory automatic and safe, accessing the raw pointer safe, with .get().

Then also, with (concurrent) streams, we can wrap those up with a C++11 smart pointer, std::unique_ptr, automate the freeing up of the device stream (cudaStreamDestroy), and make pointing to the raw pointer safe with .get().

CUBLAS and CUDA Unified Memory Management

One can use CUDA Unified Memory with CUBLAS. As an example, for an array with global scope on the device GPU’s unified memory, and for doing matrix multiplication y = a1*a*x + bet*y, where a is a m x n matrix, x is a n-vector, y is a m-vector, and a1,bet are scalars, then 1 can do this:

__device__ __managed__ float a[m*n]; // a - m x n matrix on the managed device
__device__ __managed__ float x[n]; // x - n-vector on the managed device
__device__ __managed__ float y[m]; // y - m-vector on the managed device

int main(void) {
cudaError_t cudaStat; // cudaMalloc status
cublasStatus_t stat; // CUBLAS functions status
cublasHandle_t handle; // CUBLAS context





Note the use of cudaDeviceSynchronize() that is necessitated if you then need to use the array on the host CPU.

Code for this is found here.

Short Glossary of APIs (i.e. API documentation)


__host__ cudaError_t cudaMallocHost(void** ptr, size_t size)

Allocates page-locked memory on the host.

ptr – Pointer to allocated host memory
size – Requested allocation size in bytes

(brief) Description

Allocates size bytes of host memory that is page-locked and accessible to the device. The drive tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy*(). Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc().


Continue reading “Bringing CUDA into the year 2011: C++11 smart pointers with CUDA, CUB, nccl, streams, and CUDA Unified Memory Management with CUB and CUBLAS”

GPU accelerated tensor networks.

After participating in the Global AI Hackathon San Diego (June 23-25, 2017), where I implemented my own Python classes for Deep Neural Networks (with theano), I decided to “relax” by trying to keep abreast of the latest developments in theoretical physics by watching YouTube videos of lectures on the IHÉS channel (Institut des Hautes Études Scientifiques).

After watching Barbon’s introductory talk, I was convinced that numerical computations involving tensor networks are ripe for GPU acceleration. As a first step, I implemented the first few iterations of the construction of a matrix product state (MPS) of a 1-dim. quantum many body system – which involves applying singular value decomposition (SVD) and (dense) matrix multiplication to exponentially large matrices (2^L entries of complex double-precision numbers) – using CUDA C/C++, CUBLAS, and CUSOLVER, and with the entire computation taking place on the GPU (to eliminate slow CPU-GPU memory transfers). 2 iterations for L=16 complete in about 1.5 secs. on a nVidia GTX 980 Ti.

I’ve placed that code in this subdirectory
See files and, and verification of simple cases (before scaling up) with Python NumPy in cuSOLVERgesvd.ipynb

Tensor networks have only been developed within the last decade; 3 applications are interesting:

  • quantum many body physics: while the Hilbert space exponentially grows with the number of spins in the system, the methods of tensor networks, from MPS to so-called PEPS, which both involved applying SVD, QR decomposition, etc., reduces the state space that the system’s ground state could possibly be in. It has become a powerful tool for condensed matter physicists in the numerical simulation of quantum many-body physics problems, from high-termperature superconductors to strongly interacting ultracold atom gases.  cf. 1
  • Machine Learning: there is a use case for supervised learning and feature extraction with tensor networks. cf. 2,3
  • Quantum Gravity: wormholes (Einstein-Rosen (ER) bridge) and condensates of entangled quantum pairs (Einstein-Podolsky-Rosen (EPR) pairs) have been conjectured to be intimately connected – accumulation of a large density of EPR pairs (S>>1) seem to generate a wormhole, ER, the so-called EPR=ER relation. This relation is implied from the AdS/CFT conjecture. Tensor network representations have been applied to various entangled CFT states – large scale GPU-accelerated numerical computation of these tensor network representations and their dynamics could be useful (and unprecedented) simulations for the gravity dual (graviton) in the bulk, through AdS/CFT. cf. 4

I believe there is valuable work to be done for GPU acceleration of tensor networks. I am seeking 2 things that I am asking here for help with: 1. colleagues, advisors, mentors to collaborate with, so to obtain useful feedback 2. support, namely financial support for stipend(s), hardware, and software support (nVidia? The Simons Foundation?). Any help with meeting or placing me in contact with helpful persons would be helpful. Thanks!


  1. Ulrich Schollwoeck. The density-matrix renormalization group in the age of matrix product states. Annals of Physics 326, 96 (2011). arXiv:1008.3477 [cond-mat.str-el]
  2. Johann A. Bengua, Ho N. Phien, Hoang D. Tuan, and Minh N. D. Matrix Product State for Feature Extraction of Higher-Order Tensors. arXiv:1503.00516 [cs.CV]
  3. E. Miles Stoudenmire, David J. Schwab. Supervised Learning with Quantum-Inspired Tensor Networks. arXiv:1605.05775 [stat.ML]
  4. Juan Maldacena, Leonard Susskind. ”Cool horizons for entangled black holes.” arXiv:1306.0533 [hep-th]

Machine Learning (ML), Deep Learning stuff; including CUDA C/C++ stuff (utilizing and optimizing with CUDA C/C++)

(Incomplete) Table of Contents

  • GPU-accelerated Tensor Networks
  • “Are Neural Networks a black box?” My take.
  • Log
  • CUDA C/C++ stuff (utilizing CUDA and optimizing CUDA C/C++ code)
  • Fedora Linux installation of Docker for nVidia’s DIGITS – my experience
  • Miscellaneous Links

A lot has already been said about Machine Learning (ML), Deep Learning, and Neural Networks.  Note that this blog post (which I’ll infrequently update) is the “mirror” to my github repository github: ernestyalumni/MLgrabbag . Go to the github repo for the most latest updates, code, and jupyter notebooks.

A few things bother me that I sought to rectify myself:

  • There ought to be a clear dictionary between the mathematical formulation, Python’s sci-kit learn, Theano, and Tensorflow implementation.  I see math equations; here’s how to implement it, immediately.  I mean, if I was in class lectures, and with the preponderance of sample data, I ought to be able to play with examples immediately.
  • Someone ought to generalize the mathematical formulation, drawing from algebra, category theory, and differential geometry/topology.
  • CPUs have been a disappointment (see actual gamer benchmarks for Kaby Lake on YouTube); everything ought to be written in parallel for the GPU.  And if you’re using a wrapper that’s almost as fast as CUDA C/C++ or about as fast as CUDA C/C++, guess what?  You ought to rewrite the thing in CUDA C/C++.

So what I’ve started doing is put up my code and notes for these courses:

The github repository MLgrabbag should have all my stuff for it.  I’m cognizant that there are already plenty of notes and solutions out there.  What I’m trying to do is to, as above,

  1. write the code in Python’s sci-kit learn and Theano, first and foremost,
  2. generalize the mathematical formulation,
  3. implement on the GPU

I think those aspects are valuable and I don’t see anyone else have either such a clear implementation or real examples (not toy examples).

GPU-accelerated Tensor Networks

Go here:

Are neural networks a “black box”? My take.

I was watching a webinar HPC Exascale and AI given by Tom Gibbs for nVidia, and the first question for Q&A was whether neural networks were a “black box” or not, in that, how could anything be learned about the data presented (experimental or from simulation), if it’s unknown what neural networks do?

Here is my take on the question and how I’d push back.

For artificial neural networks (ANN), or the so-called “fully-connected layers” of Convolutional Neural Networks (CNN), Hornik, et. al. (1991) had already shown that neural networks act as a universal function approximator in that the neural networks uniformly converges to a function mapping the input data X to output y. The proof should delight pure math majors in that it employs the Stone-Weierstrass theorem. The necessary number of layers L is not known; it simply must be sufficiently large. But that a sufficiently deep neural network can converge uniformly to an approximate function that maps input data X to output y should be very comforting (and confidence-building in the technique).

For CNNs, this was an insight that struck me because I wrote a lot of incompressible Navier-Stokes equations solvers for Computational Fluid Dynamics (CFD) with finite-difference methods in CUDA C/C++: stencil operations in CUDA (or numerical computation in general) are needed for the finite-difference method for computing gradients, and further, the Hessian (second-order partial derivatives). CNNs formally do exactly these stencil operations, with the “weights” on the finite-difference being arbitrary (adjustable). Each successive convolution “layer” does a higher-order (partial) derivative from the previous; this is exactly what stencil operations for finite-difference does as well. This is also evidenced by how with each successive convolution “layer”, the total size of a block “shrinks” (if we’re not padding the boundaries), exactly as with the stencil operation for finite difference.

CNNs learn first-order and successively higher-order gradients, Hessians, partial derivatives as features from the input data. The formal mathematically structure for the whole sequence of partial derivatives over a whole set of input data are jet bundles. I would argue that this (jet bundles) should be the mathematical structure to consider for CNNs.

Nevertheless, in short, ANNs or the “fully-connected layers” was shown to be a universal function approximator for the function that maps input data X to output data y already by Hornik, et. al. (1991). CNNs are learning the gradients, and higher order derivatives associated with the image (and how the colors change across the grid) or video. They’re not as black box as a casual observer might think.



  • 20170209 Week 2 Linear Regression stuff for Coursera’s ML by Ng implemented in Python numpy, and some in Theano, see sklearn_ML.ipynb and theano_ML.ipynb, respectively.

CUDA C/C++ stuff (utilizing CUDA and optimizing CUDA C/C++ code)

cuSOLVER – Singular Value Decomposition (SVD), with and without CUDA unified memory management

I implemented simple examples illustrating Singular Value Decomposition (SVD) both with and without CUDA unified memory management, starting from the examples in the CUDA Toolkit Documentation.

Find those examples in the moreCUDA/CUSOLVER subdirectory of my CompPhys github repository.

Fedora Linux installation of Docker for nVidia’s DIGITS – my experience

I wanted to share my experience with installing Docker on Fedora Linux because I wanted to run nVidia’s DIGITS; I really want to make Docker work for Fedora Linux Workstation (23 as of today, 20170825; I will install 25 soon), but I’m having a few issues, some related to Docker, some related to Fedora:

  1. For some reason, in a user (non-admin account), when I do dnf list, I obtain the following error:
    1. ImportError: dynamic module does not define init function (PyInit__posixsubprocess)

Nevertheless, I did the following to install DIGITS:

git clone

python install


Miscellaneous Links




I will try to collect my notes and solutions on math and physics, and links to them here.

Open-source; PayPal only

From the beginning of 2016, I decided to cease all explicit crowdfunding for any of my materials on physics, math. I failed to raise any funds from previous crowdfunding efforts. I decided that if I was going to live in abundance, I must lose a scarcity attitude. I am committed to keeping all of my material open-sourced. I give all my stuff for free.

In the beginning of 2017, I received a very generous donation from a reader from Norway who found these notes useful, through PayPal. If you find these notes useful, feel free to donate directly and easily through PayPal, which won’t go through a 3rd. party such as indiegogo, kickstarter, patreon.

Otherwise, under the open-source MIT license, feel free to copy, edit, paste, make your own versions, share, use as you wish.

Algebraic Geometry

(symbolic computational) Algebraic Geometry with Sage Math on a jupyter notebook


I did a Google search for “Sage Math groebner” and I came across Martin Albrecht’s slides on “Groebner Bases” (22 October 2013).  I implemented fully on Sage Math all the topics on the slides up to the F4 algorithm.  In particular, I implemented in Sage Math/Python the generalized division algorithm, and Buchberger’s Algorithm with and without the first criterion (I did plenty of Google searches and couldn’t find someone who had a working implementation on Sage Math/Python).  Another bonus is the interactivity of having it on a jupyter notebook.  If this jupyter notebook helps yourself (reader), students/colleagues, that’d be good, as I quickly picked up the basic and foundations of using computational algebraic geometry quickly (over the weekend) from looking at the slides and working it out running Sage Math on a jupyter notebook.

I’ll update the github file as much as I can as I’m going through Cox, Little, O’Shea (2015), Ideals, Varieties, and Algorithms, and implementing what I need from there.

Algebraic Geometry and Algebraic Topology dump (AGDT_dump.tex and DGDT_dump.pdf)

20171002 – I’ve consolidated by notes on Algebraic Geometry and Algebraic Topology.  Because central extensions of groups, Lie group, Lie algebras play an important role in Conformal Field Theory, I include notes on Conformal Field Theory (CFT) in these notes.

Of note, I compare 2 definitions of semi-direct product and show how they’re related and the same.

Differential Geometry and Differential Topology dump (DGDT_dump.tex and DGDT_dump.pdf)

I continue to take notes on differential geometry and differential topology and its relation to physics, with an emphasis on topological quantum field theory.  I dump all my note and thoughts immediately in the LaTeX and compiled pdf file here and here.  I don’t try to polish or organize these notes in any way, as I am learning at my own pace.  I’ve put this out there, with a permanent home on github, to invite any one to copy, edit, reorganize, and use these notes in anyway they’d like (the power of crowdsourcing).


20170423 update.

I have been reviewing holonomy by reading Conlon (2008), Clarke and Santoro (2012, 1206.3170 [math.DG]), and Schreiber and Waldorf (2007, 0705.0452 [math.DG]) concurrently.  I’ve already put these notes on my github repository mathphysics , in DGDT_dump.tex and DGDT_dump.pdf.


Computational Physics (CompPhys), Computational Fluid Dynamics (CFD)

I went through Ch.10 of Hjorth-Jensen (2015) and wrote up as many C++ scripts to illustrate all the (serial) PDE solvers: forward, backward Euler, Crank-Nicolson, Jacobi method.

Cpp/progs/ch10pde of CompPhys github repository

Lid-driven cavity with incompressible, viscous fluid on a 512×512 staggered grid, in CUDA C++11, with finite difference method for 2-dim., unsteady Navier-Stokes equations solver



Compare this with pp. 69 of Ch. 5, Example Applications of Griebel, Dornsheifer, Neunhoeffer.


Michael Griebel, Thomas Dornsheifer, Tilman Neunhoeffer.  Numerical Simulation in Fluid Dynamics: A Practical Introduction (Monographs on Mathematical Modeling and Computation).  SIAM.  1997.

Cantera installation tips (on Fedora Linux, namely Fedora 23 Workstation Linux)

I spent an obscene amount of time documenting my installation on Fedora 23 Workstation Linux of Cantera on my github repository subdirectory cantera_install_tips in Markdown. I’ll try copying markdown in here, in wordpress. Otherwise, go here: github:Propulsion/cantera_stuff/cantera_install_tips/

Cantera Installation Tips

Installing Cantera on Fedora Linux, straight, directly from the github repository, all the way to being compiled with scons, was nontrivial, mostly because of the installation prerequisites, which, in retrospect, can be easily installed if one knows what they are with respect to what it is in terms of Fedora/CentOS/RedHat dnf.

codename directory reference webpage (if any) Description
cantera_install_success ./ None A verbose, but complete Terminal log of cantera installation on Fedora Workstation 23 Linux, from git clone, cloning the githb repository for cantera, directly, all the way to a successful scons install.
ClassThermoPhaseExam.cpp ./ Computing Thermodynamic Properties, Class ThermoPhase, Cantera C++ Interface User’s Guide Simple, complete program creates object representing gas mixture and prints its temperature
chemeqex.cpp ./ Chemical Equilibrium Example Program, Cantera C++ Interface User’s Guide equilibrate method called to set gas to state of chemical equilibrium, holding temperature and pressure fixed.
verysimplecppprog.cpp ./

Installation Prerequisites, ala Fedora Linux, Fedora/CentOS/RedHat dnf

While Cantera mainpage’s Cantera Compilation Guide gave the packages in terms of Ubuntu/Debian’s package manager:

g++ python scons libboost-all-dev libsundials-serial-dev

and for the python module

cython python-dev python-numpy python-numpy-dev

for other Linux distributions/flavors, the same libraries have different names for different package managers and some libraries were already installed with the “stock” OS and some aren’t (as I found in my situation. For example, Cantera’s mainpage, for Ubuntu/Debian installation (compilation), it’s neglected that boost is already installed (which I found wasn’t for Fedora 23 Workstation Linux).

Installation Prerequisites for Fedora 23 Workstation Linux (make sure to do these dnf installs and installation with scons will go more smoothly).

I found that you can’t get away from dnf install on an administrator account – be sure to be on a sudo or admin account to be able to do dnf installs. Also, I found that compiling Cantera had to be done on a sudo-enabled or administrator account, in particular, access is needed to be granted to accessing root directories such as /opt/, etc. (more on that later).

Also, in general, you’d want to install the developer version of the libraries as well, usually suffixed with -devel, mostly because the header files will be placed in the right /usr/* subdirectory so to be included in the system (when compiling C++ files or installing).

  • g++ and gcc – For something else (namely CUDA Toolkit), I successfully installed, by dnf install, gcc 5, the C++ compiler that has compatibility with the new C++11/C++14 standard. The C++11 standard is necessary for compiling C++ files using Cantera (so the flag -std=c++11 is needed with g++).
  • scons – be sure to install scons – it seems like there is a push to use scons, a Python program, for installation and (package) compilation, as opposed to (old-school) CMake, or Make.
  • boostBoost is free peer-reviewed portable C++ source libraries.
sudo dnf install boost.x86_64
sudo dnf install boost-devel.x86_64
  • lapacklapack, Linear Algebra PACkage. Don’t take it for granted that lapack is already installed (I had to troubleshoot this myself, beyond the Cantera main page documentation, and find where it is). I had to install it because I found it was missing through the Cantera scons build
dnf list lapack*  # find lapack in dnf
sudo dnf install lapack.x86_64
sudo dnf install lapack-devel.x86_64
  • blasblas, Basic Linear Algebra Subprograms. Don’t take it for granted that blas is already installed (I had to troubleshoot this myself, beyond the Cantera main page documentation, and find where it is). I had to install it because I found it was missing through the Cantera scons build
dnf list blas*  # find blas in dnf
sudo dnf install blas.x86_64
sudo dnf install blas-devel.x86_64
  • python-devel – Following the spirit of how you’d want to install the developer’s version of the library concurrent with the library itself, in that you’d want the headers and symbolic links to be installed and saved onto the respective root /usr/* subdirectories (so that your system will know how to include the files), you’d want to install the Python developer’s libraries.
sudo dnf install python-devel

On this note, for Fedora Linux, I did not find with dnf list python-numpy nor python-numpy-dev which, supposedly, is found in Ubuntu/Debian – this is an example of how Fedora/CentOS/RedHat package manager is different from Ubuntu/Debian.
sundialsundial has (essential) non-linear solvers.

sudo dnf install sundials.x86_64
sudo dnf install sundials-devel.x86_64

Clean install, from git clone to scons install

git clone

scons build -j12

scons build -j12

scons build by itself is ok; I added the flag -j12 (correct me if I’m wrong) to optimize the compilation on 12 cores. So if you’re on a quad-core CPU processor, then you’d do -j4.
scons test
In my experience, if all the necessary libraries and prerequisite software are installed, then scons test should result in all tests being passed, none failed.
sudo scons install

sudo scons install

There’s no getting around not using sudo for scons install.

A successful sudo scons install should end up looking like this at the very end:

Cantera has been successfully installed.

File locations:

  applications                /usr/local/bin
  library files               /usr/local/lib64
  C++ headers                 /usr/local/include
  samples                     /usr/local/share/cantera/samples
  data files                  /usr/local/share/cantera/data 
  Python 2 package (cantera)  /usr/local/lib64/python2.7/site-packages
  Python 2 samples            /usr/local/lib64/python2.7/site-packages/cantera/examples 
  setup script                /usr/local/bin/setup_cantera

The setup script configures the environment for Cantera. It is recommended that
you run this script by typing:

  source /usr/local/bin/setup_cantera

before using Cantera, or else include its contents in your shell login script.

scons: done building targets.

Knowing where all the files were installed is good to know.

Compiling very simple C++ programs as a sanity check (that Cantera was installed)

The Cantera main page, C++ Interface User’s Guide, under Compiling Cantera C++ Programs gave the tips of using 3 ways, pkg-config, SCons, Make as ways to compile C++ programs.

However, a brief peruse of Cantera.mak, you’ll see that the flags included are daunting, numerous, and complicated:

# Required Cantera libraries
CANTERA_CORE_LIBS=-pthread -L/usr/local/lib64 -lcantera

CANTERA_CORE_LIBS_DEP = /usr/local/lib64/libcantera.a


CANTERA_CORE_FTN=-L/usr/local/lib64 -lcantera_fortran -lcantera


CANTERA_FORTRAN_SYSLIBS=-lpthread -lstdc++

#            BOOST



CANTERA_SUNDIALS_LIBS= -lsundials_cvodes -lsundials_ida -lsundials_nvecserial

Do you need sundials all the time? Does anyone (still) program in Fortran (2016)? Do we really need to include the /usr/local/lib64 directory every time? What’s the most minimal number of flags needed?

Thus, in this repository’s subdirectory, I included the simple programs that I was able to compile without a complicated Makefile such as Cantera.mak.

I found these compilation commands worked:

g++ -std=c++11 verysimplecppprog.cpp -o verysimplecppprog -lcantera -l pthread
g++ -std=c++11 chemeqex.cpp -o chemeqex -lcantera -l pthread
g++ -std=c++11 ClassThermoPhaseExam.cpp -o ClassThermoPhaseExam -lcantera -l pthread

These flags also worked, but seemed unnecessary:

g++ -std=c++11 chemeqex.cpp -o chemeqex -lcantera -L/usr/local/lib64 -lsundials_cvodes -lsundials_ida -lsundials_nvecserial -L/usr/local/lib -l pthread

Troubleshooting installation/(installation) errors that pop up

  • fatal error: Python.h: No such file or directory
fatal error: Python.h: No such file or directory
scons: *** [build/temp-py/_cantera2.os] Error 1

I found that I had to dnf install python-devel to get the header files installed onto the appropriate /usr/* root subdirectories.
scons: *** [/usr/local/include/cantera/Edge.h] /usr/local/include/cantera/Edge.h: Permission denied
Do sudo scons install
error: could not create/usr/local/lib64/python2.7′: Permission deniedDosudo scons install-scons: *** [/opt/cantera] /opt/cantera: Permission denied`

scons: *** [/opt/cantera] /opt/cantera: Permission denied
scons: building terminated because of errors.

Do sudo scons install

Troubleshooting C++ compilation/(C++ compilation) errors that pop up

I realized that I needed to include the Cantera library in this way:


when compiling with g++.
Package cantera was not found in the pkg-config search path.

Package cantera was not found in the pkg-config search path.
Perhaps you should add the directory containing `cantera.pc'
to the PKG_CONFIG_PATH environment variable
No package 'cantera' found
verysimplecppprog.cpp:9:29: fatal error: cantera/Cantera.h: No such file or directory
compilation terminated.

In my experience, I found that pkg-config, even though installed, didn’t work in compiling a simple program.
/usr/lib64/ error adding symbols: DSO missing from command line

I Google searched for this webpage:
cf. “error adding symbols: DSO missing from command line” while compiling g13-driver, ask ubuntu

From this page, I saw the use of the line LIBS = -lusb-1.0 -l pthread, and the idea of using the flag -l pthread ended up being the solution.
/usr/include/c++/5.3.1/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
You must include the -std=c++11 to use the new C++11 standard. Indeed:

/usr/include/c++/5.3.1/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
 #error This file requires compiler and library support \
In file included from /usr/local/include/cantera/base/fmt.h:2:0,
                 from /usr/local/include/cantera/base/ctexceptions.h:14,
                 from /usr/local/include/cantera/thermo/Phase.h:12,
                 from /usr/local/include/cantera/thermo/ThermoPhase.h:14,
                 from /usr/local/include/cantera/thermo.h:12,

So you’ll have to compile like this:

g++ -std=c++11

and include this flag in Makefiles.
usr/bin/ld: cannot find -l
include the -lcantera flag in C++ compilation.

Images gallery (that may help you with your installation process; it can be daunting)

dnf list boost-*

dnf list boost

sudo dnf install boost-devel.x86_64

sudo dnf install boost-devel.x86_64

dnf list lapack*  # find lapack in dnf
sudo dnf install lapack-devel.x86_64

sudo dnf install lapack-devel.x86_64

sudo dnf install python-devel

sudo dnf install python-devel

sudo dnf install sundials.x86_64
sudo dnf install sundials-devel.x86_64

sudo dnf install sundials.x86_64

sudo dnf install sundials-devel.x86_64

git clone

git clone

fatal error: Python.h: No such file or directory
scons: *** [build/temp-py/_cantera2.os] Error 1

fatal error: Python.h: No such file or directory

Successful installation/compilation (what we want, what it should look like)

scons build


scons test

scons test

scons test success

scons test

sudo scons install

There’s no way, I found, of getting away from having to use sudo for scons install – you’ll have to be on a sudo enabled or administrator account logged in.

It troubleshoots

scons: *** [/usr/local/include/cantera/Edge.h] /usr/local/include/cantera/Edge.h: Permission denied
error: could not create `/usr/local/lib64/python2.7': Permission denied

sudo scons install

sudo scons install

sudo scons install success

sudo scons install success