Ray Tracey's blog

Tuesday, May 17, 2016

Start your engines: source code for FireRays (AMD's high performance OpenCL based GPU ray tracing framework) available

AMD has just released the full source code of FireRays, their OpenCL based GPU renderer which was first available as a SDK library since August 2015 (see http://raytracey.blogspot.co.nz/2015/08/firerays-amds-opencl-based-high.html). This is an outstanding move by AMD which significantly lowers the threshold for developers to enter the GPU rendering arena and create an efficient OpenCL based path tracing engine that is able to run on hardware from AMD, Intel and Nvidia without extra effort.

Here's an ugly sample render of FireRays provided by AMD:

And an old video from one of the developers:

Nvidia open sourced their high performance CUDA based ray tracing framework in 2009, but hasn't updated it since 2012 (presumably due to the lack of any real competition from AMD in this area) and has since focused more on developing OptiX, a CUDA based closed source ray tracing library. Intel open sourced Embree in 2011, which is being actively developed and updated with new features and performance improvements. They even released another open source high performance ray tracer for scientific visualisation called OSPRay.

FireRays seems to have some advanced features such as ray filtering, geometry and ray masking (to make certain objects invisible to the camera or selectively ignore effects like shadows and reflections) and support for volumetrics. Hopefully AMD will also release some in-depth documentation and getting started tutorials in order to maximise adoption of this new technology among developers who are new to GPU ray tracing.

Tuesday, March 8, 2016

Free introductory minibook on ray tracing

Peter Shirley has just released "Ray tracing, the next week", a free book on Amazon for anyone who wants to learn how to code a basic path tracer in C: http://psgraphics.blogspot.be/2016/03/new-ray-tracing-mini-book-is-out.html

"Ray tracing the next week" is a follow-up to another mini-book by Shirley, "Ray tracing in one weekend" which was released only last month and covers the very basics of a ray tracer including ray-sphere intersection, path tracing of diffuse, metal and dielectric materials, anti-aliasing, positionable camera and depth-of-field. The Kindle edition is available for free when downloaded within the next five days (until 11 March). The book is excellent for people who quickly want to dive into coding a path tracer from scratch without being overwhelmed by theoretical details. It covers more advanced features such as solid textures, image textures, participating media, motion blur, instancing, and BVH acceleration structures and comes with source code snippets (using C plus classes and operator overloading, easily portable to CUDA). The code even contains some simple but clever optimisation tricks which are not published in any other ray tracing books.

Monday, January 18, 2016

GPU path tracing tutorial 3: GPU-friendly Acceleration Structures. Now you're cooking with GAS!

(In case you were wondering, my pun-loving girlfriend came up with the title for this post). This tutorial is the longest, but most crucial one so far and deals with the implementation ray tracing acceleration structure that can be traversed on the GPU. The code from the previous tutorial works okay for simple triangle meshes with less then 10,000 triangles, but since render times grow linearly or O(n)with the complexity of the scene (each ray needs to test every primitive in the scene for intersection), anything above that number becomes unfeasible. To address this issue, ray tracing researchers came up with several acceleration structures such as grids, octrees, binary space partitioning trees (BSP trees), kd-trees and BVHs (bounding volume hierarchy), allowing render times to scale logarithmically or O(log n) instead of linearly with scene complexity, a huge improvement in speed and efficiency. Acceleration structures are by far the most important ingredient to building a fast ray tracer and an enormous amount of research has gone into improving and refining the algorithms to build and traverse them, both on the CPU on the GPU (the latter since 2006, around the same time unified shader architecture was introduced on GPUs).

Scratch-a-Pixel (again) has a great introduction to acceleration structures for ray tracing (grids and bounding volume hierarchies) that includes example code: http://www.scratchapixel.com/lessons/advanced-rendering/introduction-acceleration-structure. Peter Shirley's "Realistic Ray Tracing" book also contains a good description and implementation of a BVH with C++ code.

An overview of the latest state-of-the-art research in acceleration structures for GPUs can be found in this blogpost on Robbin Marcus' blog: http://robbinmarcus.blogspot.co.nz/2015/10/real-time-raytracing-part-2.html

This tutorial focuses on the implementation of a BVH acceleration structure on the GPU, and comes with complete annotated source code for BVH construction (on the CPU) and BVH traversal (on the GPU using CUDA). The reason for choosing a BVH over a grid or kd-tree is because BVHs map better to modern GPU architectures and have also been shown to be the acceleration structure which allows the fastest build and render times (see for example https://anteru.net/research/quantitative-analysis-of-voxel-raytracing-acceleration-structures/). Another reason for choosing BVHs is that they are conceptually simple and easy to implement. The Nvidia research paper "Understanding the efficiency of ray traversal on GPUs" by Aila and Laine comes with open source code that contains a highly optimised BVH for CUDA path tracers which was used in Cycles, Blender's GPU path tracing renderer (http://wiki.blender.org/index.php/Dev:Source/Render/Cycles/BVH).

The code in this tutorial is based on a real-time CUDA ray tracer developed by Thanassis Tsiodras, which can be found on http://users.softlab.ntua.gr/~ttsiod/cudarenderer-BVH.html and which I converted to support path tracing instead. The BVH from this renderer is already quite fast and relatively easy to understand.

For the purpose of clarity and to keep the code concise (as there's quite a lot of code required for BVH construction), I removed quite a few nice features from Thanassis' code which are not essential for this tutorial, such as multithreaded BVH building on the CPU (using SSE intrinsics), various render modes (like point rendering), backface culling, a scheme to divide the image in rendertiles in Morton order (along a space filling Z-curve) and some clever workarounds to deal with CUDA's limitations such as separate templated kernels for shadow rays and ambient occlusion.

One of the more tricky parts of implementing a BVH for ray tracing on the GPU is how to store the BVH structure and BVH node data in a GPU friendly format. CPU ray tracers store a BVH as a hierarchical structure starting with the root node, which contains pointers to its child nodes (in case of an inner node) or pointers to triangles (in case of a leaf node). Since a BVH is built recursively, the child nodes in turn contain pointers to their own child nodes and this keeps on going until the leaf nodes are reached. This process involves lots of pointers which might point to scattered locations in memory, a scenario which is not ideal for the GPU. GPUs like coherent, memory aligned datastructures such as indexable arrays that avoid the use of too many pointers. In this tutorial, the BVH data (such as nodes, triangle data, triangle indices, precomputed intersection data) are therefore stored in flat one-dimensonal arrays (storing elements in depth first order by recursively traversing the BVH), which can be easily digested by CUDA and are stored on the GPU in either global memory or texture memory in the form of CUDA textures (hardware cached). The BVH in this tutorial is using CUDA texture memory, since global memory on older GPUs is not cached (as opposed to texture memory). Since the introduction of Fermi however, global memory is also cached and the performance difference when using one or the other is hardly noticeable.

In order to avoid wasting time by rebuilding the BVH every time the program is run, the BVH is built only once and stored in a file. For this to work, the BVH data is converted to a cache-friendly format which takes up as little memory space as possible (but the compactness of the data makes it also harder to read). A clever scheme is used to store BVH leaf nodes and inner nodes using the same data structure: instead of using a separate struct for leaf nodes and inner nodes, both types of nodes occupy the same memory space (using a union), which stores either two child indices to the left and right child when dealing with an inner node or a start index into the list of triangles and a triangle count in case of a leaf node. To distinguish between a leaf node and an inner node, the highest bit of the triangle count variable is set to 1 for a leaf node. The renderer can then determine at runtime if it has intersected an inner node or a leaf node by checking the highest bit (with a bitwise AND operation).

A lot of the triangle intersection data (such as triangle edges, barycentric coordinates, dot products between vertices and edge planes) is precomputed at the scene initialisation stage and stored. Since modern GPUs have much more raw compute power than memory bandwidth, it would be interesting to know whether fetching the precomputed data from memory is faster or slower compared to computing that data directly on the GPU.

The following is a high level explanation of the algorithm for top-down BVH construction (on the CPU) and traversal (on the GPU). The BVH in this code is built according to the surface area heuristic and uses binning to find the best splitting plane. The details of the BVH algorithm can be found in the following papers:

- "On fast construction of SAH based Bounding Volume Hierarchies" by Ingo Wald, 2007. This paper is a must read in order to understand what the code is doing.

- "Ray tracing deformable scenes using dynamic Bounding Volume Hierarchies" by Wald, Boulos and Shirley, 2007

- "On building fast kd-trees for ray tracing, and on doing that in O(N log N)" by Wald and Havran, 2006

Overview of algorithm for building the BVH on the CPU

- the main() function (in main.cpp) calls prepCUDAscene(), which in turn calls UpdateBoundingVolumeHierarchy()

- UpdateBoundingVolumeHierarchy() checks if there is already a BVH for the scene stored (cached) in a file and loads that one or builds a new BVH by calling CreateBVH()

- CreateBVH():

computes a bbox (bounding box) for every triangle and calculate the bounds (top and bottom)
initialises a "working list" bbox to contain all the triangle bboxes
expands the bounds of the working list bbox so it encompasses all triangles in the scene by looping over all the triangle bboxes
computes each triangle bbox centre and adds the triangle bbox to the working list
passes the working list to Recurse(), which builds the BVH tree structure
returns the BVH root node

Recurse() recursively builds the BVH tree from top (rootnode) to bottom using binning, finding optimal split planes for each depth. It divides the work bounding box into a number of equally sized "bins" along each axis, chooses the axis and splitting plane resulting in the least cost (determined by the surface area heuristic or SAH: the larger the surface area of a bounding box, the costlier it is to raytrace) and finding the bbox with the minimum surface area:

Check if the working list contains less then 4 elements (triangle bboxes) in which case create a leaf node and push each triangle to a triangle list
Create an inner node if the working list contains 4 or more elements
Divide node further into smaller nodes
Start by finding the working list bounds (top and bottom)
Loop over all bboxes in current working list, expanding/growing the working list bbox
find surface area of bounding box by multiplying the dimensions of the working list's bounding box
The current bbox has a cost C of N (number of triangles) * SA (Surface Area) or C = N * SA
Loop over all three axises (X, Y, Z) to find best splitting plane using "binning"
Binning: try splitting the current axis at a uniform distance (equidistantly spaced planes) in "bins" of size "step" that gets smaller the deeper we go: size of "sampling grid": 1024 (depth 0), 512 (depth 1), etc
For each bin (equally spaced bins of size "step"), initialise a left and right bounding box
For each test split (or bin), allocate all triangles in the current work list based on their bbox centers (this is a fast O(N) pass, no triangle sorting needed): if the center of the triangle bbox is smaller than the test split value, put the triangle in the left bbox, otherwise put the triangle in the right bbox. Count the number of triangles in the left and right bboxes.
Now use the Surface Area Heuristic to see if this split has a better "cost": calculate the surface area of the left and right bbox and calculate the total cost by multiplying the surface area of the left and right bbox by the number of triangles in each. Keep track of cheapest split found so far.
At the end of this loop (which runs for every "bin" or "sample location"), we should have the best splitting plane, best splitting axis and bboxes with minimal traversal cost
If we found no split to improve the cost, create a BVH leaf, otherwise create a BVH inner node with L and R child nodes. Split with the optimal value we found above.
After selection of the best split plane, distribute each of the triangles into the left or right child nodes based on their bbox center
Recursively build the left and right child nodes (repeat steps 1 - 16)
When all recursive function calls have finished, the end result of Recurse() is to return the root node of the BVH

Once the BVH has been created, we can copy its data into a memory saving, cache-friendly format (CacheFriendlyBVHNode occupies exactly 32 bytes, i.e. a cache-line) by calling CreateCFBVH(). which recursively counts the triangles and bounding boxes and stores them in depth first order in one-dimensional arrays by calling PopulateCacheFriendlyBVH().

The data of the cache friendly BVH is copied to the GPU in CUDA global memory by prepCUDAscene() (using the cudaMalloc() and cudaMemcpy() functions). Once the data is in global memory it's ready to be used by the renderer, but the code is taking it one step further and binds the BVH data to CUDA textures for performance reasons (texture memory is cached, although global memory is also cached since Fermi). The texture binding is done by cudarender() (in cuda_pathtracer.cu) which calls cudaBindTexture(). After this stage, all scene data is now ready to be rendered (rays traversing the BVH and intersecting triangles).

Overview of algorithm for traversing the BVH on the GPU

- after cudarenderer() has bound the data to CUDA textures with cudaBindTexture() the first time it's being called, it launches the CoreLoopPathTracingKernel() which runs in parallel over all pixels to render a frame.

- CoreLoopPathTracingKernel() computes a primary ray starting from the interactive camera view (which can differ each frame) and calls path_trace() to calculate the ray bounces

- path_trace() first tests all spheres in the scene for intersection and then tests if the ray intersects any triangles by calling BVH_IntersectTriangles() which traverses the BVH.

- BVH_IntersectTriangles():

initialise a stack to keep track of all the nodes the ray has traversed
while the stack is not empty, pop a BVH node from the stack and decrement the stack index
fetch the data associated with this node (indices to left and right child nodes for inner nodes or start index in triangle list + triangle count for leaf nodes)
determine if the node is a leaf node or triangle node by examining the highest bit of the count variable
if inner node, test ray for intersection with AABB (axis aligned bounding box) of node --> if intersection, push left and right child node indices on the stack, and go back to step 2 (pop next node from the stack)
if leaf node, loop over all the triangles in the node (determined by the start index in the list of triangle indices and the triangle count),
for each triangle in the node, fetch the index, center, normal and precomputed intersection data and check for intersection with the ray
if ray intersects triangle, keep track of the closest hit
recursively traverse the left and right child nodes, if any (repeat steps 2 - 9)
after all recursive calls have finished, the end result returned by the function is a bool based on the index of the closest hit triangle (true if index is not -1)

- after the ray has been tested for intersection with the scene, compute the colour of the ray by multiplying with the colour of the intersected object, calculate the direction of the next ray in the path according to the material BRDF and accumulate the colours of the subsequent path segments (see GPU path tracing tutorial 1).

In addition to the BVH, I added an interactive camera based on the interactive CUDA path tracer code from Yining Karl Li and Peter Kutz (https://github.com/peterkutz/GPUPathTracer). The camera's view direction and position can be changed interactively with mouse and keyboard (a new orthornormal basis for the camera is computed each frame). The camera produces an antialiased image by jittering the primary ray directions. By allowing primary rays to start randomly on a simulated disk shaped lens instead of from a point. a camera aperture (the opening in the diaphragm) with focal plane can be simulated, providing a cool, photographic depth-of-field effect. The focal distance can also be adjusted interactively.

The material system for this tutorial allows five basic materials: ideal diffuse, ideal specular, ideal refractive, Phong metal (based on code from Peter Shirley's "Realistic Ray Tracing" book) with a hardcoded exponent and a coat (acrylic) material (based on Karl Li and Peter Kutz' CUDA path tracer).

CUDA/C++ source code

The source code for this tutorial can be found at

https://github.com/straaljager/GPU-path-tracing-tutorial-3/

As in the previous tutorials, I aimed to keep the code as simple and clear as possible and added plenty of comments throughout (in addition to the original comments). If some steps still aren't clear, let me know. Detailed compilation instructions for Windows and Visual Studio are in the readme file: https://github.com/straaljager/GPU-path-tracing-tutorial-3/blob/master/README.md

Download executable (Windows only)

https://github.com/straaljager/GPU-path-tracing-tutorial-3/releases

All scene elements in this executable are hardcoded. Changing the scene objects, materials and lights is only possible by directly editing and re-compiling the source code.

Screenshots

Screenshots produced with the code from this tutorial (Stanford Dragon and Happy Buddha .ply models from the Stanford 3D scanning repository)

Glossy Stanford dragon model (871,000 triangles)

Happy Buddha model (1,088,000 triangles) with Phong metal material

The next tutorial will add even more speed: I'll dive deeper into the highly optimised BVH acceleration structure for GPU traversal from Aila and Laine, which uses spatial splitting to build higher quality (and faster) trees. It's also the framework that the GPU part of Blender Cycles is using.

Other features for upcoming tutorials are support for textures, sun and sky lighting, environment lighting, more general and accurate materials using Fresnel, area light support, direct light sampling and multiple importance sampling.

References

- Real-time CUDA ray tracer with BVH acceleration structure by Thanassis Tsiodras

- CUDA based sphere path tracer by Peter Kutz and Yining Karl Li

- On fast construction of SAH based Bounding Volume Hierarchies" paper by Ingo Wald

- Scratch-a-Pixel tutorial on acceleration structures (grid and BVH)

- Overview of state-of-the-art acceleration structures for GPU ray tracing by Robbin Marcus

- "Realistic Ray Tracing" by P. Shirley

Tuesday, December 1, 2015

GPU path tracing tutorial 2: interactive triangle mesh path tracing

While the tutorial from the previous post was about path tracing simple scenes made of spheres, this tutorial will focus on how to build a very simple path tracer with support for loading and rendering triangle meshes. Instead of rendering the entire image in the background and saving it to a file as was done in the last tutorial, this path tracer displays an interactive viewport which shows progressively rendered updates. This way we can see the rendered image from the first pass and watch it converge to a noise free result (which can take some time in the case of path tracing triangle meshes without using acceleration structures).

For this tutorial I decided to modify the code of a real-time CUDA ray tracer developed by Peter Trier from the Alexandra Institute in 2009 (described in this blog post), because it's very compact, does not use any external libraries (except for CUDA-OpenGL interoperability) and provides a simple obj loader for triangle meshes. I modified the ray tracing kernel to handle path tracing (without recursion) using the path tracing code from the previous tutorial, added support for perfectly reflective and refractive materials (like glass) based on the code of smallpt. The random number generator from the previous post has been replaced with CUDA's own random number generation library provided by curand(), which is less prone to patterns at low sample rates and has more uniform distribution properties. The seed calculation is based on a trick described in a post on RichieSam's blog.

Features of this path tracer

- primitive types: supports spheres, boxes and triangles/triangle meshes

- material types: support for perfectly diffuse, perfectly reflective and perfectly refractive materials

- progressive rendering

- interactive viewport displaying intermediate rendering results

Scratch-a-Pixel has some excellent lessons on ray tracing triangles and triangle meshes, which discuss barycentric coordinates, backface culling and the fast Muller-Trumbore ray/triangle intersection algorithm that is also used in the code for this tutorial:

- ray tracing triangles: http://www.scratchapixel.com/lessons/3d-basic-rendering/ray-tracing-rendering-a-triangle

- ray tracing polygon meshes: http://www.scratchapixel.com/lessons/3d-basic-rendering/ray-tracing-polygon-mesh

The code is one big CUDA file with lots of comments and can be found on my Github repository.

Github repository link: https://github.com/straaljager/GPU-path-tracing-tutorial-2

Some screenshots

Performance optimisations

- triangle edges are precomputed to speed up ray intersection computation and triangles are stored as (first vertex, edge1, edge2)

- ray/triangle intersection uses the fast Muller-Trumbore technique

- triangle data is stored in the GPU's texture memory which is cached and is a bit faster than global memory because fetching data from textures is accelerated in hardware. The texture cache is also optimized for 2D spatial locality, so threads that access addresses in texture memory that are close together in 2D will achieve best performance.

- triangle data is aligned in float4s (128 bits) for coalesced memory access, maximising memory throughput, (see https://docs.nvidia.com/cuda/cuda-c-programming-guide/#device-memory-accesses and http://blog.spectralstudios.net/raytracing/realtime-raytracing-part-3/#more-573)

- for expensive functions (such as sin() and sqrt()), compute fast approximations using single precision intrinsic math functions such as __sinf(), __powf(), __fdividef(): these functions are performed in hardware by the special function units (SFU) on the GPU and are much faster than the standard divide and sin/cos functions at the cost of precision and robustness in corner cases (see https://docs.nvidia.com/cuda/cuda-c-programming-guide/#intrinsic-functions

- to speed up the ray tracing an axis aligned bounding box is created around the triangle mesh. Only rays hitting this box are intersected with the mesh. Without this box, all rays would have to be tested against every triangle for intersection, which is unbearably slow.

In the next tutorial, we'll have a look at implementing an acceleration structure, which speeds up the rendering by several orders of magnitude. This blog post provides a good overview of the most recent research in ray tracing acceleration structures for the GPU. There will also be an interactive camera to allow real-time navigation through the scene with depth of field and supersampled anti-aliasing (and there are still lots of optimisations).

References

- Trier real-time CUDA ray tracing tutorial

- Igor Sevo's real-time CUDA ray tracing tutorial

- Richie Sam's blog post on random number generation for CUDA path tracing
- Quick and easy GPU random numbers

- Robbin Marcus real-time CUDA ray tracing blog

- Scratch-a-Pixel lesson on ray tracing triangles (very insightful) and polygon meshes

- CUDA-OpenGL interoperability tutorial

- CUDA programming and optimisation guide

Monday, October 5, 2015

GPU path tracing tutorial 1: Drawing First Blood

In early 2011 I developed a simple real-time path traced Pong game together with Kerrash on top of an open source GPU path tracer called tokaspt (developed by Thierry Berger-Perrin) which could only render spheres, but was bloody fast at it. The physics were bodged, but the game proved that path tracing of very simple scenes at 30 fps was feasible, although a bit noisy. You can still download it from https://code.google.com/p/tokap-the-once-known-as-pong/. Since that time I've always wanted to write a short and simple tutorial about GPU path tracing to show how to make your GPU draw an image with high quality ray traced colour bleeding with a minimum of code and now is a good time to do exactly that.

This tutorial is not meant as an introduction to ray tracing or path tracing as there are plenty of excellent ray tracing tutorials for beginners online such as Scratch-a-Pixel (also check out the old version which contains more articles) and Minilight (more links at the bottom of this article). The goal of this tutorial is simply to show how incredibly easy it is to turn a simple CPU path tracer into a CUDA accelerated version. Being a fan of the KISS principle from design and engineering (Keep It Simple Stupid) and aiming to avoid unnecessary complexity, I've chosen to cudafy Kevin Beason's smallpt, the most basic but still fully functional CPU path tracer around. It's a very short piece of code that doesn't require the user to install any tedious libraries to compile the code (apart from Nvidia's CUDA Toolkit).

The full CPU version of smallpt can be found at http://www.kevinbeason.com/smallpt/ Due to its compactness the code is not very easy to read, but fortunately David Cline made a great Powerpoint presentation explaining what each line in smallpt is doing with references to Peter Shirley's "Realistic Ray Tracing" book.

To keep things simple and free of needless clutter, I've stripped out the code for the tent filter, supersampling, Russian Roulette and the material BRDFs for reflective and refractive materials, leaving only the diffuse BRDF. The 3D vector class from smallpt is replaced by CUDA's own built-in float3 type (built-in CUDA types are more efficient due to automatic memory alignment) which has the same linear algebra math functions as a vector such as addition, subtraction, multiplication, normalize, length, dot product and cross product. For reasons of code clarity, there is no error checking when initialising CUDA. To compile the code, save the code in a file with ".cu" file extension and follow these CUDA installation guides to install Nvidia's GPU Computing Toolkit and configure the programming tools to work with CUDA.

After reading the slides from David Cline, the commented code below should speak for itself, but feel free to drop me a comment below if some things are still not clear.

So without further ado, here's the full CUDA code (also on Github at https://github.com/straaljager/GPU-path-tracing-tutorial-1):


// smallptCUDA by Sam Lapere, 2015
// based on smallpt, a path tracer by Kevin Beason, 2008  
 
#include <iostream>
#include <cuda_runtime.h>
#include <vector_types.h>
#include "device_launch_parameters.h"
#include <cutil_math.h> // from http://www.icmc.usp.br/~castelo/CUDA/common/inc/cutil_math.h

#define M_PI 3.14159265359f  // pi
#define width 512  // screenwidth
#define height 384 // screenheight
#define samps 1024 // samples 

// __device__ : executed on the device (GPU) and callable only from the device

struct Ray { 
 float3 orig; // ray origin
 float3 dir;  // ray direction 
 __device__ Ray(float3 o_, float3 d_) : orig(o_), dir(d_) {} 
};

enum Refl_t { DIFF, SPEC, REFR };  // material types, used in radiance(), only DIFF used here

struct Sphere {

 float rad;            // radius 
 float3 pos, emi, col; // position, emission, colour 
 Refl_t refl;          // reflection type (e.g. diffuse)

__device__ float intersect_sphere(const Ray &r) const { 
          
 // ray/sphere intersection
 // returns distance t to intersection point, 0 if no hit  
 // ray equation: p(x,y,z) = ray.orig + t*ray.dir
 // general sphere equation: x^2 + y^2 + z^2 = rad^2 
 // classic quadratic equation of form ax^2 + bx + c = 0 
 // solution x = (-b +- sqrt(b*b - 4ac)) / 2a
 // solve t^2*ray.dir*ray.dir + 2*t*(orig-p)*ray.dir + (orig-p)*(orig-p) - rad*rad = 0 
 // more details in "Realistic Ray Tracing" book by P. Shirley or Scratchapixel.com

  float3 op = pos - r.orig;    // distance from ray.orig to center sphere 
  float t, epsilon = 0.0001f;  // epsilon required to prevent floating point precision artefacts
  float b = dot(op, r.dir);    // b in quadratic equation
  float disc = b*b - dot(op, op) + rad*rad;  // discriminant quadratic equation
  if (disc<0) return 0;       // if disc < 0, no real solution (we're not interested in complex roots) 
   else disc = sqrtf(disc);    // if disc >= 0, check for solutions using negative and positive discriminant
  return (t = b - disc)>epsilon ? t : ((t = b + disc)>epsilon ? t : 0); // pick closest point in front of ray origin
 }
};

// SCENE
// 9 spheres forming a Cornell box
// small enough to be in constant GPU memory
// { float radius, { float3 position }, { float3 emission }, { float3 colour }, refl_type }
__constant__ Sphere spheres[] = {
 { 1e5f, { 1e5f + 1.0f, 40.8f, 81.6f }, { 0.0f, 0.0f, 0.0f }, { 0.75f, 0.25f, 0.25f }, DIFF }, //Left 
 { 1e5f, { -1e5f + 99.0f, 40.8f, 81.6f }, { 0.0f, 0.0f, 0.0f }, { .25f, .25f, .75f }, DIFF }, //Right 
 { 1e5f, { 50.0f, 40.8f, 1e5f }, { 0.0f, 0.0f, 0.0f }, { .75f, .75f, .75f }, DIFF }, //Back 
 { 1e5f, { 50.0f, 40.8f, -1e5f + 600.0f }, { 0.0f, 0.0f, 0.0f }, { 1.00f, 1.00f, 1.00f }, DIFF }, //Frnt 
 { 1e5f, { 50.0f, 1e5f, 81.6f }, { 0.0f, 0.0f, 0.0f }, { .75f, .75f, .75f }, DIFF }, //Botm 
 { 1e5f, { 50.0f, -1e5f + 81.6f, 81.6f }, { 0.0f, 0.0f, 0.0f }, { .75f, .75f, .75f }, DIFF }, //Top 
 { 16.5f, { 27.0f, 16.5f, 47.0f }, { 0.0f, 0.0f, 0.0f }, { 1.0f, 1.0f, 1.0f }, DIFF }, // small sphere 1
 { 16.5f, { 73.0f, 16.5f, 78.0f }, { 0.0f, 0.0f, 0.0f }, { 1.0f, 1.0f, 1.0f }, DIFF }, // small sphere 2
 { 600.0f, { 50.0f, 681.6f - .77f, 81.6f }, { 2.0f, 1.8f, 1.6f }, { 0.0f, 0.0f, 0.0f }, DIFF }  // Light
};

__device__ inline bool intersect_scene(const Ray &r, float &t, int &id){

 float n = sizeof(spheres) / sizeof(Sphere), d, inf = t = 1e20;  // t is distance to closest intersection, initialise t to a huge number outside scene
 for (int i = int(n); i--;)  // test all scene objects for intersection
  if ((d = spheres[i].intersect_sphere(r)) && d<t){  // if newly computed intersection distance d is smaller than current closest intersection distance
    t = d;  // keep track of distance along ray to closest intersection point 
    id = i; // and closest intersected object
  }
 return t<inf; // returns true if an intersection with the scene occurred, false when no hit
}

// random number generator from https://github.com/gz/rust-raytracer

__device__ static float getrandom(unsigned int *seed0, unsigned int *seed1) {
 *seed0 = 36969 * ((*seed0) & 65535) + ((*seed0) >> 16);  // hash the seeds using bitwise AND and bitshifts
 *seed1 = 18000 * ((*seed1) & 65535) + ((*seed1) >> 16);

 unsigned int ires = ((*seed0) << 16) + (*seed1);

 // Convert to float
 union {
  float f;
  unsigned int ui;
 } res;

 res.ui = (ires & 0x007fffff) | 0x40000000;  // bitwise AND, bitwise OR

 return (res.f - 2.f) / 2.f;
}

// radiance function, the meat of path tracing 
// solves the rendering equation: 
// outgoing radiance (at a point) = emitted radiance + reflected radiance
// reflected radiance is sum (integral) of incoming radiance from all directions in hemisphere above point, 
// multiplied by reflectance function of material (BRDF) and cosine incident angle 
__device__ float3 radiance(Ray &r, unsigned int *s1, unsigned int *s2){ // returns ray color

 float3 accucolor = make_float3(0.0f, 0.0f, 0.0f); // accumulates ray colour with each iteration through bounce loop
 float3 mask = make_float3(1.0f, 1.0f, 1.0f); 

 // ray bounce loop (no Russian Roulette used) 
 for (int bounces = 0; bounces < 4; bounces++){  // iteration up to 4 bounces (replaces recursion in CPU code)

  float t;           // distance to closest intersection 
  int id = 0;        // index of closest intersected sphere 

// test ray for intersection with scene
  if (!intersect_scene(r, t, id))
   return make_float3(0.0f, 0.0f, 0.0f); // if miss, return black

  // else, we've got a hit!
  // compute hitpoint and normal
  const Sphere &obj = spheres[id];  // hitobject
  float3 x = r.orig + r.dir*t;          // hitpoint 
  float3 n = normalize(x - obj.pos);    // normal
  float3 nl = dot(n, r.dir) < 0 ? n : n * -1; // front facing normal

  // add emission of current sphere to accumulated colour
  // (first term in rendering equation sum) 
  accucolor += mask * obj.emi;

  // all spheres in the scene are diffuse
  // diffuse material reflects light uniformly in all directions
  // generate new diffuse ray:
  // origin = hitpoint of previous ray in path
  // random direction in hemisphere above hitpoint (see "Realistic Ray Tracing", P. Shirley)

  // create 2 random numbers
  float r1 = 2 * M_PI * getrandom(s1, s2); // pick random number on unit circle (radius = 1, circumference = 2*Pi) for azimuth
  float r2 = getrandom(s1, s2);  // pick random number for elevation
  float r2s = sqrtf(r2); 

  // compute local orthonormal basis uvw at hitpoint to use for calculation random ray direction 
  // first vector = normal at hitpoint, second vector is orthogonal to first, third vector is orthogonal to first two vectors
  float3 w = nl; 
  float3 u = normalize(cross((fabs(w.x) > .1 ? make_float3(0, 1, 0) : make_float3(1, 0, 0)), w));  
  float3 v = cross(w,u);

  // compute random ray direction on hemisphere using polar coordinates
  // cosine weighted importance sampling (favours ray directions closer to normal direction)
  float3 d = normalize(u*cos(r1)*r2s + v*sin(r1)*r2s + w*sqrtf(1 - r2));

  // new ray origin is intersection point of previous ray with scene
  r.orig = x + nl*0.05f; // offset ray origin slightly to prevent self intersection
  r.dir = d;

  mask *= obj.col;    // multiply with colour of object       
  mask *= dot(d,nl);  // weigh light contribution using cosine of angle between incident light and normal
  mask *= 2;          // fudge factor
 }

 return accucolor;
}


// __global__ : executed on the device (GPU) and callable only from host (CPU) 
// this kernel runs in parallel on all the CUDA threads

__global__ void render_kernel(float3 *output){

 // assign a CUDA thread to every pixel (x,y) 
 // blockIdx, blockDim and threadIdx are CUDA specific keywords
 // replaces nested outer loops in CPU code looping over image rows and image columns 
 unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;   
 unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

 unsigned int i = (height - y - 1)*width + x; // index of current pixel (calculated using thread index) 

 unsigned int s1 = x;  // seeds for random number generator
 unsigned int s2 = y;

// generate ray directed at lower left corner of the screen
// compute directions for all other rays by adding cx and cy increments in x and y direction
 Ray cam(make_float3(50, 52, 295.6), normalize(make_float3(0, -0.042612, -1))); // first hardcoded camera ray(origin, direction) 
 float3 cx = make_float3(width * .5135 / height, 0.0f, 0.0f); // ray direction offset in x direction
 float3 cy = normalize(cross(cx, cam.dir)) * .5135; // ray direction offset in y direction (.5135 is field of view angle)
 float3 r; // r is final pixel color       
    
 r = make_float3(0.0f); // reset r to zero for every pixel 

 for (int s = 0; s < samps; s++){  // samples per pixel
    
  // compute primary ray direction
  float3 d = cam.dir + cx*((.25 + x) / width - .5) + cy*((.25 + y) / height - .5);
  
  // create primary ray, add incoming radiance to pixelcolor
  r = r + radiance(Ray(cam.orig + d * 40, normalize(d)), &s1, &s2)*(1. / samps); 
 }       // Camera rays are pushed ^^^^^ forward to start in interior 

 // write rgb value of pixel to image buffer on the GPU, clamp value to [0.0f, 1.0f] range
 output[i] = make_float3(clamp(r.x, 0.0f, 1.0f), clamp(r.y, 0.0f, 1.0f), clamp(r.z, 0.0f, 1.0f));
}

inline float clamp(float x){ return x < 0.0f ? 0.0f : x > 1.0f ? 1.0f : x; } 

inline int toInt(float x){ return int(pow(clamp(x), 1 / 2.2) * 255 + .5); }  // convert RGB float in range [0,1] to int in range [0, 255] and perform gamma correction

int main(){

 float3* output_h = new float3[width*height]; // pointer to memory for image on the host (system RAM)
 float3* output_d;    // pointer to memory for image on the device (GPU VRAM)

 // allocate memory on the CUDA device (GPU VRAM)
 cudaMalloc(&output_d, width * height * sizeof(float3));
        
 // dim3 is CUDA specific type, block and grid are required to schedule CUDA threads over streaming multiprocessors
 dim3 block(8, 8, 1);   
 dim3 grid(width / block.x, height / block.y, 1);

 printf("CUDA initialised.\nStart rendering...\n");
 
 // schedule threads on device and launch CUDA kernel from host
 render_kernel <<< grid, block >>>(output_d);  

 // copy results of computation from device back to host
 cudaMemcpy(output_h, output_d, width * height *sizeof(float3), cudaMemcpyDeviceToHost);  
 
 // free CUDA memory
 cudaFree(output_d);  

 printf("Done!\n");

 // Write image to PPM file, a very simple image file format
 FILE *f = fopen("smallptcuda.ppm", "w");          
 fprintf(f, "P3\n%d %d\n%d\n", width, height, 255);
 for (int i = 0; i < width*height; i++)  // loop over pixels, write RGB values
  fprintf(f, "%d %d %d ", toInt(output_h[i].x),
                          toInt(output_h[i].y),
                          toInt(output_h[i].z));
 
 printf("Saved image to 'smallptcuda.ppm'\n");

 delete[] output_h;
 system("PAUSE");
}

Optionally, the following 3D vector algebra functions can be inserted at the top of the file instead of #including "cutil_math.h". Instead of creating a Vector3D class (with 3 floats), CUDA's built-in float3 type is used instead as built-in types have automated memory alignment and provide higher for performance. The "__host__ __device__" keywords in front of the functions allow them to run on both the CPU and GPU.

// 3D vector algebra from cutil_math.h
/*DEVICE_BUILTIN*/
struct float3 {float x, y, z;};
typedef struct float3 float3;
// add
inline __host__ __device__ float3 operator+(float3 a, float3 b){return make_float3(a.x + b.x, a.y + b.y, a.z + b.z);}
inline __host__ __device__ void operator+=(float3 &a, float3 b){a.x += b.x; a.y += b.y; a.z += b.z;}
inline __host__ __device__ float3 operator+(float3 a, float b){return make_float3(a.x + b, a.y + b, a.z + b);}
inline __host__ __device__ float3 operator+(float b, float3 a){return make_float3(b + a.x, b + a.y, b + a.z);}
inline __host__ __device__ void operator+=(float3 &a, float b){a.x += b; a.y += b; a.z += b;}
// subtract
inline __host__ __device__ float3 operator-(float3 a, float3 b){return make_float3(a.x - b.x, a.y - b.y, a.z - b.z);}
inline __host__ __device__ void operator-=(float3 &a, float3 b){a.x -= b.x; a.y -= b.y; a.z -= b.z;}
inline __host__ __device__ float3 operator-(float3 a, float b){return make_float3(a.x - b, a.y - b, a.z - b);}
inline __host__ __device__ float3 operator-(float b, float3 a){return make_float3(b - a.x, b - a.y, b - a.z);}
inline __host__ __device__ void operator-=(float3 &a, float b){a.x -= b; a.y -= b; a.z -= b;}
// multiply
inline __host__ __device__ float3 operator*(float3 a, float3 b){return make_float3(a.x * b.x, a.y * b.y, a.z * b.z);}
inline __host__ __device__ void operator*=(float3 &a, float3 b){a.x *= b.x; a.y *= b.y; a.z *= b.z;}
inline __host__ __device__ float3 operator*(float3 a, float b){return make_float3(a.x * b, a.y * b, a.z * b);}
inline __host__ __device__ float3 operator*(float b, float3 a){return make_float3(b * a.x, b * a.y, b * a.z);}
inline __host__ __device__ void operator*=(float3 &a, float b){a.x *= b; a.y *= b; a.z *= b;}
// divide
inline __host__ __device__ float3 operator/(float3 a, float3 b){return make_float3(a.x / b.x, a.y / b.y, a.z / b.z);}
inline __host__ __device__ void operator/=(float3 &a, float3 b){a.x /= b.x; a.y /= b.y; a.z /= b.z;}
inline __host__ __device__ float3 operator/(float3 a, float b){return make_float3(a.x / b, a.y / b, a.z / b);}
inline __host__ __device__ void operator/=(float3 &a, float b){a.x /= b; a.y /= b; a.z /= b;}
inline __host__ __device__ float3 operator/(float b, float3 a){return make_float3(b / a.x, b / a.y, b / a.z);}
// min
inline __host__ __device__ float3 fminf(float3 a, float3 b){return make_float3(fminf(a.x, b.x), fminf(a.y, b.y), fminf(a.z, b.z));}
// max
inline __host__ __device__ float3 fmaxf(float3 a, float3 b){return make_float3(fmaxf(a.x, b.x), fmaxf(a.y, b.y), fmaxf(a.z, b.z));}
// lerp
inline __device__ __host__ float3 lerp(float3 a, float3 b, float t){return a + t*(b - a);}
// clamp value v between a and b
inline __device__ __host__ float clamp(float f, float a, float b){return fmaxf(a, fminf(f, b));}
inline __device__ __host__ float3 clamp(float3 v, float a, float b){return make_float3(clamp(v.x, a, b), clamp(v.y, a, b), clamp(v.z, a, b));}
inline __device__ __host__ float3 clamp(float3 v, float3 a, float3 b){return make_float3(clamp(v.x, a.x, b.x), clamp(v.y, a.y, b.y), clamp(v.z, a.z, b.z));}
// dot product
inline __host__ __device__ float dot(float3 a, float3 b){return a.x * b.x + a.y * b.y + a.z * b.z;}
// length
inline __host__ __device__ float length(float3 v){return sqrtf(dot(v, v));}
// normalize
inline __host__ __device__ float3 normalize(float3 v){float invLen = rsqrtf(dot(v, v));return v * invLen;}
// floor
inline __host__ __device__ float3 floorf(float3 v){return make_float3(floorf(v.x), floorf(v.y), floorf(v.z));}
// frac
inline __host__ __device__ float fracf(float v){return v - floorf(v);}
inline __host__ __device__ float3 fracf(float3 v){return make_float3(fracf(v.x), fracf(v.y), fracf(v.z));}
// fmod
inline __host__ __device__ float3 fmodf(float3 a, float3 b){return make_float3(fmodf(a.x, b.x), fmodf(a.y, b.y), fmodf(a.z, b.z));}
// absolute value
inline __host__ __device__ float3 fabs(float3 v){return make_float3(fabs(v.x), fabs(v.y), fabs(v.z));}
// reflect 
//returns reflection of incident ray I around surface normal N
// N should be normalized, reflected vector's length is equal to length of I
inline __host__ __device__ float3 reflect(float3 i, float3 n){return i - 2.0f * n * dot(n, i);}
// cross product
inline __host__ __device__ float3 cross(float3 a, float3 b){return make_float3(a.y*b.z - a.z*b.y, a.z*b.x - a.x*b.z, a.x*b.y - a.y*b.x);}

Full code on Github:

https://github.com/straaljager/GPU-path-tracing-with-CUDA-tutorial-1

In this example, it's pretty easy to turn C/C++ code into CUDA code (CUDA is a subset of the C language). The differences with the CPU version of smallpt are as follows:

smallpt's 3D Vector struct is replaced by CUDA's built-in float3 type (linear algebra vector functions for float3 are defined in cutil_math.h)
CUDA specific keyword __device__ before functions that should run on the GPU and are only callable from the GPU
CUDA specific keyword __global__ in front of the kernel that is called from the host (CPU) and which runs in parallel on all CUDA threads
a custom random number generator that runs on the GPU
as GPUs don't handle recursion well, the radiance function needs to be converted from a recursive function to an iterative function (see Richie Sam's blogpost or Karl Li's slides for more details) with a fixed number of bounces (Russian roulette could be implemented here to terminate paths with a certain probability, but I took it out for simplicity)
in a CPU raytracer, you loop over each pixel of the image with two nested loops (one for image rows and one for image columns). On the GPU the loops are replaced by a kernel which runs for each pixel in parallel. A global thread index is computed instead from the grid dimensions, block dimensions and local thread index. See http://www.3dgep.com/introduction-to-cuda-using-visual-studio-2008/ for more details
the main() function calls CUDA specific functions to allocate memory on the CUDA device (cudaMalloc()), launch the CUDA kernel using the "<<< grid, block >>>" syntax and copy the results (in this case the rendered image) from the GPU back to the CPU, where the image is saved in PPM format (a supersimple image format)

When running the code above, we get the following image (1024 samples per pixel, brute force path tracing):

Path traced color bleeding rendered entirely on the GPU! On my laptop's GPU (Geforce 840M) it renders about 24x faster than the multithreaded CPU version (laptop Core-i7 clocked at 2.00 Ghz). The neat thing here is that it only took about 100 lines (if you take out the comments) to get path tracing working on the GPU. The beauty lies in its simplicity.

Even though the path tracing code already works well, it is actually very unoptimized and there are many techniques to speed it up:

explicit light sampling (or next event estimation): sample the light source directly instead of using brute force path tracing. This makes an enormous difference in reducing noise.
jittered sampling (also called stratified sampling): instead of sampling a pixel randomly, divide the pixel up into a number of layers (strata) in which random sampling is performed. According to Peter Shirley's book this way of sampling (which is partly structured and partly random) is one of the most important noise reduction methods
better random number generators
various importance sampling strategies: this code already performs cosine weighted importance sampling for diffuse rays, favouring rays with directions that are closer to the normal (as they contribute more to the final image). See http://www.rorydriscoll.com/2009/01/07/better-sampling/.
ray tracing acceleration structures: kd-trees, octrees, grids, bounding volume hierarchies provide massive speedups

GPU specific optimisations (see http://www.3dgep.com/optimizing-cuda-applications/ and Karl Li's course slides linked below):

using shared memory and registers whenever possible is many times faster than using global/local memory
memory alignment for coalesced reads from GPU memory
thread compaction: since CUDA launches a kernel in groups of 32 threads in parallel ("warps"), threads taking different code paths can give rise to thread divergence which reduces the GPU's occupancy. Thread compaction aims to mitigate the effects of thread divergence by bundling threads following similar code paths

I plan to cover the following topics (with CUDA implementations) in upcoming tutorials whenever I find some time:

an interactive viewport camera with progressive rendering,
textures (and bump mapping),
environment lighting,
acceleration structures,
triangles and triangle meshes
building more advanced features on top of Aila and Laine's GPU ray tracing framework which is also used by Blender's Cycles GPU renderer
dissecting some code snippets from Cycles GPU render or SmallLuxGPU

References used:

smallpt by Kevin Beason
line-for-line explanation of smallpt code by David Cline
"Realistic Ray Tracing" book by Peter Shirley
Scratch-a-Pixel : excellent online rendering tutorials (ray tracing and rasterisation)
tokaspt: a very fast CUDA path tracer by Thierry Berger Perrin (renders only spheres)
CUDA path tracing tutorial on RichieSams blog
Raytracer in Rust
Triers CUDA ray tracing tutorial
Real-time CUDA raytracer (triangle meshes and BVH acceleration structure) by Thanassis Tsiodras
A simple real-time ray tracer in CUDA and OpenGL tutorial (+ code) by Igor Sevo
CUDA path tracer by Peter Kutz and Karl Li
UPenn CUDA path tracing course by Yining Karl Li: part 1, part 2, GitHub code project
University of Pennsylvania GPU Programming and Architecture course: excellent beginner to advanced level course to learn CUDA, path tracing in CUDA, CUDA optimisation and pitfalls, performance profiling with NSight
CUDA programming and optimisation tutorials
CUDA optimisation