In the subdirectory scan in Lesson Code Snippets 3 is an implementation in CUDA C++11 and C++11, with global memory, of the Hillis/Steele (inclusive) scan, Blelloch (prefix; exclusive) scan(s), each in both parallel and serial implementation.
As you can see, for large float arrays, running parallel implementations in CUDA C++11, where I used the GeForce GTX 980 Ti smokes being run serially on the CPU (I use for a CPU the Intel® Xeon(R) CPU E5-1650 v3 @ 3.50GHz × 12.
I have a thorough write up on the README.md of my fork of Udacity’s cs344 on github.
Note that I was learning about the Hillis/Steele and Blelloch (i.e. Prefix) scan(s) methods in conjunction with Udacity’s cs344, Lesson 3 – Fundamental GPU Algorithms (Reduce, Scan, Histogram), i.e. Unit 3.. I have a writeup of the notes I took related to these scans, formulating them mathematically, on my big CompPhys.pdf, Computational Physics notes.