When easy is difficult, and vice versa

Published on dim 07 août 2011 in Clover, (Comments)


Some days ago, I was thinking of what I have to work on first for Clover. The API was nearly complete (only samplers and clFlush/clFinish remaining), there was built-ins to implement, etc.

I decided to take some time to read the OpenCL spec part speaking about the built-in functions. The first chunk is implemented, the next ones (mathematical functions) very easy to do (but boring because despite the fact that Clang allows function overloading in C, it doesn't allow templates and each function can accept float, float2, float3, float4, float8, float16 and the same with int, short, etc). Then came the "memory fences" that don't do anything on a CPU device. Then image functions, fairly easy (all the algorithms are given in the spec). I finally read the part entitled "Synchronization functions".

This part contains only one function : barrier(). This function stops the current work-item and waits for the others to reach the same function call. This doesn't seem difficult, but in fact it is, because Clover runs the work-items of a work-group sequentially.

So, I decided I was better beginning by implementing the Sampler objects, an easy task.

As the title says, it wasn't so easy and barrier() wasn't so difficult. I love challenging problems, so I couldn't help having my brain thinking about barrier(). I thought that the best way to solve this problem is to launch a work-item, and when it encounters a barrier(), to stop it and start the next, until it reaches barrier(), launching the next, etc. The idea is good, but difficult to achieve. I wanted to use nested calls, but it could have resulted to stack overflows.

After a few minutes, I remembered what I've read in an old book dating from the Windows NT 4 days (it was a Microsoft Press book). In an appendix, it speaks about a strange thing called "fibers". In fact, during my reflection, I wanted something like threads but managed cooperatively by the application. I don't know why, but this small ten-pages appendix (in a book close to 1000 pages) I've read nearly 6 years ago came immediately in my mind.

After a quick Google search to check that fibers are still available in modern operating system, I found my dream : setcontext (and other functions in the same family).

These functions allow a thread to save its context (stack, CPU registers, flags, instruction pointer) and to jump into another context (like an interprocedural goto). It's exactly what I want !

Each work-item will have its own context consisting of a small stack and CPU registers. When a barrier is encountered, the context is halted and CPUKernelWorkGroup jumps to the next context. It will also encounter the barrier, jump to the next context, and so forth. When the last context reached barrier(), it can continue, get to another barrier or finish its execution. When that arises, the unfinished contexts are resumed and can continue their execution.

It's way easier and more efficient than having one thread per work-item, I don't need to use any locking machinery, the context switch is exactly when the application needs it, etc.

So, this barrier() problem was relatively easily solved. It allowed me to read some papers about using Clang and LLVM to implement OpenCL (and it seems that either nVidia or AMD, I don't remember, are using a complex LLVM IR rewriting to chain blocks, handling barriers, to allow complex auto-vectorization to take place). I hope that my solution will work and will not be too slow (the contexts will be created at the first barrier() call, so no overhead when no barriers are needed).

I then came back to my samplers, the "boring" thing I have to do before being able to work on barrier(), using very exciting POSIX functions.

The implementation of class Sampler was boring as expected. I need to convert three flag arguments to one bitfield, and then to convert it back to the flag arguments in clGetSamplerInfo. Boring, yes. The API was also easy : copy/paste of the one of Context, with only a small set of modifications.

Then, I had to plug the samplers in the Kernel code. I went to the big switch handling arguments type, and saw an horror : sampler_t isn't a pointer to an opaque struct, it is an unsigned int !

What does it mean ? An opaque struct has a name, so I can use " struct_type->getName() == "image2d"; " to know that the argument is of type Image2D. With an integer, all I get from LLVM IR is a "i32" type, indistinguishable from a simple "uint" type.

The problem was bigger than expected, and no Microsoft book could help me : LLVM simply doesn't supply enough type information, and I don't want to use Clang's debug informations (they are too big and complex).

After a few days of thinking (barrier() only too a few hours to sort out), I came to a solution that works in 99,99999% of the cases : storing in a std::list the known samplers. In setKernelArg, when the user tries to fit a pointer into an i32, I check if the pointer is a known sampler. On 32-bit architectures, sizeof(void *) == 4, so I always check if the i32 isn't in fact a pointer to a sampler.

On 64-bit machines, the code is perfect. The samplers are detected, normal i32s are left untouched, and plain wrong i64-to-i32 conversions are spotted and result to an error. On 32-bit machines though, all valid i32s are checked against the table of known samplers, and it's possible that a valid i32 corresponds to a valid pointer to a sampler. In this corner case, the i32 gets replaced by the sampler's bitfield value, that is data corruption.

I have no better solution, and the infrastructure needed to do that allowed me to make Clover more robust, so I keep it for the moment. I'll now implement clFlush and clFinish, then tests for the command queue events, then barrier().

I hope to have barrier() finished by August 15, the "soft pencil down date". It will mean that I was able to code during my summer a nearly complete OpenCL implementation, lacking only some small built-ins. The following days until August 22 will be used to write the documentation of what I did, and maybe some builtins (the image ones, I want to do first the most complex thing, because it's what I have to do).

By the way, if you know applications or examples using OpenCL (but nearly no built-in functions), I will be glad to test OpenCL with them. It will be especially interesting if these applications use many clEnqueueWaitForEvents, clEnqueueBarrier, barrier() and out-of-order command queues, that is the part of Clover the most difficult to test using testsuites.

« Boring work for API completeness   The End is Approaching »