Fedora 23 workstation (Linux)+NVIDIA GeForce GTX 980 Ti: my experience, log of what I do (and find out)

OhNoFedora

Learn from my mistakes. Getting to this screen in Fedora 23 (Linux) is a mini-nightmare.

 

 

Table of Contents

I most recently took delivery of a Titan workstation computer (thank you Titan Workstation computers!, the Titan X199: here is some of the configuration:

  • Processor:Intel Xeon E5-1650 v3 Haswell 3.5GHz (3.8 GHz Turbo Boost) 140W 15MB L3 Cache 6 Core
  • Motherboard:MSI X99A SLI PLUS LGA 2011-v3 Intel X99 SATA 6GB/s USB 3.1 USB3.0 ATX Intel Motherboard
  • Memory:32GB (8x4GB)288-Pin DDR4 2133 (PC4-17000) Desktop Memory
  • Power:650W – Deepcool 650W ATX12V SLI Ready CrossFire Ready 80 PLUS GOLD Certified Active PFC Power Supply
  • Video Card 1: NVIDIA GeForce GTX 980 Ti 6GB 384-Bit DDR5 GDDR5 Video Card

First, I wanted to get a workstation to try out some CFD computations, parallelized to the processor(s) and GPU (hence the NVIDIA GeForce GTX 980 Ti, $649.99!), and some Deep Learning/Machine Learning computations, again parallelized out to the GPU. With the NVIDIA, I wanted to learn CUDA. Also, I wanted to build from source Sage Math (it requires a whopping 6 GB of hard drive space) and needed wanted a more capable computer to deal with building Sage Math all the time. Third, I wanted a workstation dedicated to Linux because a lot of the scientific/numerical computation programs work better/install or compile or make “better” in Linux, and I went with Fedora Linux, after doing a Google Search and reading about, more or less, “best linux distro for scientific/numerical computation (e.g. quora).

By the way, I am interested in CFD, Deep Learning/Machine Learning computations, and computational physics, and hence this new workstation, solely because I am currently “seeking opportunities in propulsion development” (i.e. I really want to work in the new companies for commercial space industry, SpaceX, Virgin Galactic, Blue Origin) and am trying to develop my skills (set) to help out in that area.

Coming back to this wordpress post, this post will be continuously updated (just like my other posts on TQFT, General Relativity, Propulsion (for aerospace stuff), and Computers; I wanted to focus on 4 main topics and collect all my writings into 4 blog posts, 1 for each topic, because I wanted to try to allow for deeper insight, than to fire off a cursory blog post, spamming followers; for instance the Computers post is a running log of various tips on programming, software, installation; Gravity has the stuff or links to, including links to my github repo, of what I pick up on GR), and it’ll link back to the Computers because this is my experience dealing with computers. So you can always easily navigate to this post from my simple menu with only 4 topics: TQFT, Gravity, Propulsion, Computers. PS. I wish wordpress had a github-like way of doing version control on blog posts and how you could Publish or push blog posts and media files from the command line, instead of the browser. I’m finding my github repositories way more easy (and fun) to update, either from the command line or browser.

Now, I was/am a sole Mac OS X/iOS user (I find myself losing my memory of how to use Windows as I haven’t used Windows in a long time; I used to edit my Windows key registry for fun) and switching to Fedora Linux so far has been a huge learning curve. I’m going to go ahead and write on tips, hints, advice, and things that I’ve learned even if they might be rudimentary or too simply (or silly) to advanced users because they were not simple to me (and hopefully to others)!

Oh no Fedora! Something has gone wrong; A problem has occurred (with Nvidia drm, rpm and nouveau drivers with a new Fedora kernel); panic, and how I recovered my system

Don't let this happen to your distro.
Learn from my experience: don’t let this happen to you and your Fedora 23 Workstation distro.

I had Fedora 23 Workstation (Linux) up and running and with the fresh install, I first installed the NVIDIA proprietary drivers but simply following the instructions off their official website and the driver itself.

Much later in the day, Fedora asked me to upgrade, via the Software program in Activities, and I did that with dnf system upgrade.

Now when I turn on the computer, it can’t go into X i.e. the GUI and it flickers sometimes:

FlickeringfailedVideodevicestartup

Taking a look at the built-in EFI boot(er) (I tried and tried again and again to reinstall off a Live USB disk, but it didn’t work because it went straight to this built-in EFI),

builtinEFIforTitanX199

Fedora 4.2.3 23 is the original kernel; Fedora 4.4.8 is the offending kernel( right after it installed and restart automatically, fedora’s X graphics environment doesn’t work anymore). Either 3 options can’t load the graphics environment and I’m not sure how to check what driver or package install was bad and remove , configure and try to run again. In either of the 3 options I keep getting this until I ctrl alt f2

Also, I was receiving error messages when I booted up and couldn’t get into my X11 X (graphical) windowing environment; I was stuck at the low-resolution command line.

From my Xorg.0.log, it said

(EE)
fedora linux Nvidia Failed to initialize the Nvidia kernel module please see the Nvidia system's kernel log for additional error messages and consult the NVIDIA README for details

No devices detected

Fatal server error:
no screens found(EE)

The symbol (EE) is where errors occurred in the boot up.

What happened to me has happened with other people when they use (or, in a kernel update, was switched over to) nouveau drivers (open-source, I think?) for their NVIDIA GTX video card.

cf.
Nvidia driver causes boot hang when upgrading to Fedora 23

and also

Nvidia drivers not loading correctly on Fedora 23. However, I would not follow the advice given in, for downgrading X11, nor given in that stackexchange question, respectively.

Fix

Instead, what worked for me following, to the letter, the If not true then false Fedora 23/22/21 nVidia Drivers Install Guide. This guide worked for me for reinstalling proprietary NVIDIA drivers after a conflicting kernel upgrade, accidental installing of the nouveau or nvidia-drm drivers. Go there.

I also end up back at the official NVIDIA Linux-64 bit drivers page, especially their Additional Information subpage for instructions on how to install their proprietary drivers; what helped me install the first time, and then reinstall their driver, is this page. Also, keep in mind that you can uninstall using the same command

sh ./NVIDIA-Linux-x86_64-346.35.run --uninstall

but with the uninstall flag (look it up, I forgot the exact syntax of the uninstall command).

Before those steps mentioned above, I I’ll try removing the kernel that was the last major change (it said need to install and restart upgrade and restart led to Oh no screen of death)
cf.
http://www.labtestproject.com/using_linux/remove_fedora_kernel.html

From that page, I did the following commands:
rpm -qa | grep ^kernel

You want to be sure that you’re not removing the current kernel you’re running:
uname -r

Finally, the remove:
sudo yum remove kernel-4.4.8-300.fc23.x86_64 kernel-headers-4.4.8-300.fc23.x86_64 kernel-devel-4.4.8-300.fc23.x86_64 kernel-core-4.4.8.300.fc23.x86_64 kernel-modules-4.4.8-300.fc23.x86_64

Then I uninstalled (from the command line) and reinstalled (following the if not True then False guide) the NVidia drivers.

Wrap up

Finally, you’d want to things like Display video card driver version.

lspci | grep VGA

So in conclusion, my advice from my experience, and of almost losing my X11, X, startX, graphical windowing environment is to

  • Be extremely careful about doing a dnf or yum system upgrade or kernel upgrade, and watch out what dependencies get installed when you do install a new program
  • If you run into trouble, check dnf history to see what steps to (manually) undo
  • In my case, I had to uninstall the new, offending kernel off the built-in EFI boot(er), following http://www.labtestproject.com/using_linux/remove_fedora_kernel.html
  • Uninstall and reinstall the proprietary Linux driver; just follow what it says.

Make a USB (live) boot disk of your distro

Whoops, the NVIDIA GTX 980 Ti did not like that last Fedora 23 upgrade and I have no idea which, in the logs, is what Fedora 23 didn’t like, and so my GUI or X (startX) isn’t starting. Unfortunately, the only thing left to do is to reinstall from a Live USB boot disk.

I did this to find out where my USB disk is on my Mac OS X:

diskutil list

I made a note of which /dev/diskn number (1,2,or 3, etc. where n is, e.g. /dev/disk2) and which number (e.g. it said #2: SANDISKCRUZ and SANDISKCRUZ is the name that I named the disk when I formatted the USB stick, and so #2 it is).

After downloading the 64-bit iso I needed for Fedora, I did, e.g.


sudo dd if=Fedora_Live-Workstation-x86_64-23-10.iso of=/dev/rdisk2s2

I read that adding the ‘r’ in ‘rdisk2s2’ speeds things up.

This process took about 86 minutes on a MacBook Pro, Late-2013 (!!!). To check the status I did Ctrl-T and it gave me records in, records out, and total bytes transferred. I tried pkill and sending a signal with kill but couldn’t work that out.

cf. How to Copy an ISO to a USB Drive from Mac OS X with dd (super useful article/link); is ‘dd’ command taking too long?, Show progress of dd command (clarified many things; his experience on using dd)

Also, keep in mind the official Fedora documentation for making a Live USB:

https://fedoraproject.org/wiki/How_to_create_and_use_Live_USB

Off to try to reinstall with this USB disk…

…And it didn’t help. I discovered on my own that Titan workstation computers keep Fedora 23 Workstation linux on the built-in EFI (EFI is like the new bootloader, newer than BIOS). No matter how many times I try to boot off the USB disk by changing the boot order or disabling the SATA disk drive, or any disk drive, the workstation directly boots to the built-in EFI boot loader, for the Fedora 23 Workstation. Aaaaaaaaaaaaaaahhhh. AAAArggggg.

See the above section, Oh no Fedora! Something has gone wrong; A problem has occurred (with Nvidia drm, rpm and nouveau drivers with a new Fedora kernel; panic, and how I recovered my system, to see how I manually, from the command line, recovered my X (graphical) environment.

Installation of NVIDIA CUDA on Fedora 23 Workstation (Linux)

See also my github repository MLgrabbag, the README.md file, for the latest update, as well.

Installation of NVIDIA’s CUDA Toolkit on a Fedora 23 Workstation was nontrivial; part of the reason is that it appears that 7.5 is the latest version of the CUDA Toolkit (as of 20150512), and 7.5 only supports (for sure) Fedora 21. And, this 7.5 version supports (out of the box) C compiler gcc up to version 4.* and not gcc 5. But there’s no reason why the later versions, Fedora 23 as opposed to Fedora 21, gcc 5 vs. gcc 4.*, cannot be used (because I got CUDA to work on my setup, including samples). But I found that I had to make some nontrivial symbolic linking (ln).

I wanted to install CUDA for Udacity’s Intro to Parallel Programming, and in particular, in the very first lesson or video, Intro to the Class, for instructions on running CUDA locally, only the links to the official NVIDIA documentation were given, in particular for Linux,

http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html

But one only needs to do a Google search and read some forum posts that installing CUDA, Windows, Mac, or Linux, is highly nontrivial.

I’ll point out how I did it, and refer to the links that helped me (sometimes you simply follow, to the letter, the instructions there) and other links in which you should follow the instructions, but modify to suit your (my) system, and what NOT to do (from my experience).

Gist, short summary, steps to do (without full details), to just get CUDA to work (no graphics)

My install procedure assumes you are using the latest proprietary NVIDIA Accelerated Graphics Drivers for Linux. I removed and/or blacklisted any other open-source versions of nvidia drivers, and in particular blacklisted nouveau. See my blog post for details and description.

  1. Download the latest CUDA Toolkit (appears to be 7.5 as of 20160512). For my setup, I clicked on the boxes Linux for Operation System, x86_64 for Architecture, Fedora for Distribution, 21 for Version (only one there), runfile (local) for Installer Type (it was the first option that appeared). Then I modified the instructions on their webpage:
    1. Run `sudo sh cuda_7.5.18_linux.run`
    2. Follow the command-line prompts.
    3. Instead, I did


      $ sudo sh cuda_7.5.18_linux.run --override

      with the --override flag to use gcc 5 so I did not have to downgrade to gcc 4.*.

      Here is how I selected my options at the command-line prompts (and part of the result):


      $ sudo sh cuda_7.5.18_linux.run --override

      -------------------------------------------------------------
      Do you accept the previously read EULA? (accept/decline/quit): accept
      You are attempting to install on an unsupported configuration. Do you wish to continue? ((y)es/(n)o) [ default is no ]: yes
      Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 352.39? ((y)es/(n)o/(q)uit): n
      Install the CUDA 7.5 Toolkit? ((y)es/(n)o/(q)uit): y
      Enter Toolkit Location [ default is /usr/local/cuda-7.5 ]:
      Do you want to install a symbolic link at /usr/local/cuda? ((y)es/(n)o/(q)uit): y
      Install the CUDA 7.5 Samples? ((y)es/(n)o/(q)uit): y
      Enter CUDA Samples Location [ default is /home/[yournamehere] ]: /home/[yournamehere]/Public
      Installing the CUDA Toolkit in /usr/local/cuda-7.5 ...
      Missing recommended library: libGLU.so
      Missing recommended library: libX11.so
      Missing recommended library: libXi.so
      Missing recommended library: libXmu.so

      Installing the CUDA Samples in /home/[yournamehere]/ ...
      Copying samples to /home/propdev/Public/NVIDIA_CUDA-7.5_Samples now...
      Finished copying samples.

      Again, Fedora 23 was not a supported configuration, but I wished to continue. I had already installed NVIDIA Accelerated Graphics Driver for Linux (that’s how I was seeing my X graphical environment) but it was a later version 361.* and I did not want to uninstall it and then reinstall, which was recommended by other webpages (I had already gone through the mini-nightmare of reinstalling these drivers before, which can trash your X11 environment that you depend on for a functioning GUI).

    4. Continuing, this was also printed out by CUDA’s installer:


      Installing the CUDA Samples in /home/propdev/Public ...
      Copying samples to /home/propdev/Public/NVIDIA_CUDA-7.5_Samples now...
      Finished copying samples.

      ===========
      = Summary =
      ===========

      Driver: Not Selected
      Toolkit: Installed in /usr/local/cuda-7.5
      Samples: Installed in /home/[yournamehere]/Public, but missing recommended libraries

      Please make sure that
      - PATH includes /usr/local/cuda-7.5/bin
      - LD_LIBRARY_PATH includes /usr/local/cuda-7.5/lib64, or, add /usr/local/cuda-7.5/lib64 to /etc/ld.so.conf and run ldconfig as root

      To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-7.5/bin
      To uninstall the NVIDIA Driver, run nvidia-uninstall

      Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-7.5/doc/pdf for detailed information on setting up CUDA.

      ***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 352.00 is required for CUDA 7.5 functionality to work.
      To install the driver using this installer, run the following command, replacing with the name of this run file:
      sudo .run -silent -driver

      Logfile is /tmp/cuda_install_7123.log

      For “ PATH includes /usr/local/cuda-7.5 ” I do


      $ export PATH=/usr/local/cuda-7.5/bin:$PATH

      as suggested by Chapter 6 of CUDA_Getting_Started_Linux.pdf

      Dealing with the LD_LIBRARY_PATH, I did this: I created a new text file (open up your favorite text editor) in /etc/ld.so.conf.d called cuda.conf, e.g. I used emacs:


      sudo emacs cuda.conf

      and I pasted in the directory


      /usr/local/cuda-7.5/lib64

      (since my setup is 64-bit) into this text file. I did this because my /etc/ld.so.conf file includes files from /etc/ld.so.conf.d, i.e. it says


      include ld.so.conf.d/*.conf

      Make sure this change for `LD_LIBRARY_PATH` is made by running the command


      ldconfig

      as root.

      I check the status of this “linking” to PATH and LD_LIBRARY_PATH with the echo command, each time I reboot, or log back in, or start a new Terminal window:


      echo $PATH
      echo $LD_LIBRARY_PATH

    5. Patch the host_config.h header file

      cf. [nstall NVIDIA CUDA on Fedora 22 with gcc 5.1 and CUDA incompatible with my gcc version.

      To use gcc 5 instead of gcc 4.*, I needed to patch the host_config.h header file because I kept receiving errors. What worked for me was doing this to the file – original version:


      #if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ > 9)

      #error -- unsupported GNU version! gcc versions later than 4.9 are not supported!

      #endif /* __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ > 9) */

      Commented-out version (these 3 lines)

      // #if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ > 9)

      // #error -- unsupported GNU version! gcc versions later than 4.9 are not supported!

      // #endif /* __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ > 9) */

      Afterwards, I did not have any problems with c compiler gcc incompatibility (yet).

    6. At this point CUDA runs without problems if no graphics capabilities are needed. For instance, as a sanity check, I ran, from the installed samples with CUDA, I made `deviceQuery` and ran it:

      $ cd ~/NVIDIA_CUDA-7.5_Samples/1_Utilities/deviceQuery
      $ make -j12
      $ ./deviceQuery

      And then if your output looks something like this, then success!


      ./deviceQuery Starting...

      CUDA Device Query (Runtime API) version (CUDART static linking)

      Detected 1 CUDA Capable device(s)

      Device 0: "GeForce GTX 980 Ti"
      CUDA Driver Version / Runtime Version 8.0 / 7.5
      CUDA Capability Major/Minor version number: 5.2
      Total amount of global memory: 6143 MBytes (6441730048 bytes)
      (22) Multiprocessors, (128) CUDA Cores/MP: 2816 CUDA Cores
      GPU Max Clock rate: 1076 MHz (1.08 GHz)
      Memory Clock rate: 3505 Mhz
      Memory Bus Width: 384-bit
      L2 Cache Size: 3145728 bytes
      Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
      Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
      Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
      Total amount of constant memory: 65536 bytes
      Total amount of shared memory per block: 49152 bytes
      Total number of registers available per block: 65536
      Warp size: 32
      Maximum number of threads per multiprocessor: 2048
      Maximum number of threads per block: 1024
      Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
      Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
      Maximum memory pitch: 2147483647 bytes
      Texture alignment: 512 bytes
      Concurrent copy and kernel execution: Yes with 2 copy engine(s)
      Run time limit on kernels: Yes
      Integrated GPU sharing Host Memory: No
      Support host page-locked memory mapping: Yes
      Alignment requirement for Surfaces: Yes
      Device has ECC support: Disabled
      Device supports Unified Addressing (UVA): Yes
      Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0
      Compute Mode:

      deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GTX 980 Ti
      Result = PASS

    7. Getting the other samples to run, getting CUDA to have graphics capabilities, soft symbolic linking to the existing libraries.

      The flow or general procedure I ended up having to do was to use `locate` to find the relevant `*.so.*` or `*.h` file for the missing library or missing header, respectively, and then making soft symbolic links to them with the `ln -s` command. I found that some of the samples have different configurations for in which directory the graphical libraries are (GL, GLU, X11, glut, etc.) than other samples in the samples included by NVIDIA.

    To be continued, and see my github repo MLgrabbag, the README.md file, for the latest update (html code is a pain compared to markdown and I don’t want to download anymore programs to convert markdown to html (I’m already doing a lot of installing already)).

    Sage Math: Installing programs on Fedora 23 (Linux)

    First, I tried this, following Sage Math install from source. I followed the steps until, at Step-by-Step installation procedure, General procedure, 4. Read the README.txt I read README.md
    I use emacs so I did emacs README.md where sage-7.1 is; in README.md, in section More Detailed Instructions to Build from Source, I did
    export MAKE="make -j14"
    because my processor has 12 (I find that out by the following:
    cf. http://www.binarytides.com/linux-cpu-information/

    $ less /proc/cpuinfo
    and
    $ cat /proc/cpuinfo | grep processor | wc -l
    and then I use more than 14 jobs for 12 processors. It took about 25-30 minutes.

    However, it failed to build, again and again, even for libraries I successfully installed through Anaconda conda (from Continuum) such as git-2.6. and matplotlib. So now I am trying to follow the instructions I received from Eric Gourgoulhon (LUTH) gave me for building Sage Math from the git develop version, which I had already accounted for in my Computers post, under Starting or beginning developing (i.e. contributing code) to a major open-source project, in this case, Sage Math.

    If you’re getting errors when building from github Sage Math using my and Gourgoulhon’s instructions,

    Check the errors you’re receiving and the suggested log files. In the “root” directory of sage directory with the source src, there is a logs/pkgs directory with all the logs of installed or failed packages, and in my particular case, flask_babel-0.9.log failed. Reading the log,it was a “Download Error!” So it was probably a problem with my internet connection (I’ve had problems with Time Warner Cable as a service provider for service interruptions and I cannot recommend TWC).

    Try your make again, but be sure not to overwrite the previously successful package build by typing at the command prompt of the main sage directory


    SAGE_KEEP_BUILT_SPKGS=yes

    cf. http://hpc.wm.edu/SciClone/documentation/software/math/sage-5.1/html/en/installation/source.html

    Also, I was able to build from the pre-built Linux binaries:

    In my experience, either following the instructions I and Eric Gourgoulhon gives, as stated in my “Computers” blog post, to build straight from the git development version, and pre-built binary, in Mac OS X and Fedora Linux, is the way to go for installing Sage Math – building from source instructions in Sage Math haven’t worked for me.

    TeXLive Install for LaTeX

    This was straightforward. I did this:
    yum install texlive-scheme-full

    cf. How to fully install Latex in fedora?

    Good intentions; bad advice i.e. DON’T follow these commands carelessly in Fedora 23 (Linux)

    You may be (at least I certainly was) in a rush to fix something and so you are furiously doing Google searches and searching forum posts and trying any kind of command(s) to fix the problem. But here, I collect commands NOT to do (casually).

    http://www.liquidweb.com/kb/how-to-install-and-configure-git-on-fedora-23/

    dnf -y upgrade

    Don’t do dnf upgrade casually. This is because NVIDIA’s proprietary drivers may have conflict with the latest kernel. This has happened with others.

    You don’t need to downgrade X11!!!

    I found that I didn’t need to downgrade my X11 as advised in the article NVIDIA – Incompatible with Fedora 23 Xorg – and a Workaround.., as the latest NVIDIA drivers did just fine.

    How not to replace nouveau drivers in Fedora 23

    cf. HOWTO: Install NVIDIA driver on Fedora – replacing Nouveau

    I wouldn’t do it this way (and it didn’t work for me, in the crucial step #4 of theirs, to blacklist nouveau in /etc/modprobe.d with the commands

    echo 'blacklist nouveau' >> /etc/modprobe.d/disable-nouveau.conf
    echo 'nouveau modeset=0' >> /etc/modprobe.d/disable-nouveau.conf

    Instead, what worked for me, again, as previously linked and written about above, is to following, to the letter, the If !1 0 Fedora 23/22/21 nVidia Drivers Install Guide.

    While on that note, advice in the fedoraforum post, entitled
    [SOLVED] Oh no! Something has gone wrong didn’t help me. I was thinking of trying to do a reinstall of Fedora into the built-in EFI, but this post, how to install Fedora 11 in EFI shell and GPT partition? didn’t help.

    I had the same problem as described here (with a similar log), in FC22: nvidia kernel module loads, but X can’t initialize GPU, but the fix the member StefanJ proposed didn’t help in my situation.

    Number of “cores” on Fedora 23 (Linux)? In Linux, they’re called cpus or processors

    grep processor /proc/cpuinfo

    cat /proc/cpuinfo | less

    Also, other system information:

    cat /proc/meminfo | less
    lspci

    cf. http://www.cyberciti.biz/faq/linux-display-cpu-information-number-of-cpus-and-their-speed/

    getting error “Can’t create transaction lock” with rpm

    getting error “Can’t create transaction lock” with rpm

    Solution:

    “Try running your command as root. It worked for me.” –phathutshezo