A fix and an optimization - Denis Steckelmacher

Published on sam 23 juillet 2011 in Clover, (Comments)

Hello,

Last week, I was able to run an OpenCL-compiled kernel using Clover. The problem was that there was a problem : every once, the kernel failed.

I thought it was a locking problem. In fact, it was : the work groups called kernel->callFunction (to get a zero-argument function calling the actual kernel with its arguments) without any lock.

After a fix, the kernels were mostly working but continued to fail after a few iterations. The problem took one week to find and was tricky.

In fact, when an event is "finished", it can be deleted. The logic was simple : run each work group of a kernel, and when the last work group has finished running, kill the kernel. This seemed to work on paper, but in reality, on multi-core machines running multi-threaded OSes, it doesn't work. What happened is that the last work group finished before other work groups, because they were on busier CPU cores, or were preempted. The kernel deleted, clWaitForEvent returned, and the program began to read the buffer while it wasn't already ready.

This fix was to add a counter of the finished work groups. When all the work groups are finished, we can delete the kernel. A simple fix for a complex bug.

I then added an optimization regarding the get_global_id builtin function. This function returns an ID based on the current work group and work item. OpenCL also adds a "work-offset" and some other things. The result was a function with two additions, a multiplication, and several getter calls (I hadn't enabled link-time optimization).

Now, some of these computations are cached in a private variable, and get_global_id is free of any function call and contains only a simple addition. The speedup is interesting : for a kernel with two very slow modulo operations (what a processor does the slowest), the speedup is of 1.72x. I haven't tried a simpler kernel, but it seems that the speedup for only get_global_id is already very interesting.

The other builtin functions are simpler and don't need to be optimized. Another slow part of Clover is the thread-local storage. Each call to a builting needs a call to __tls_get_storage or a function like this.

The plan for the following days is to implement samplers and images (to have a complete API), and then a Qt-based example applications allowing to write kernels, add buffers and images and run programs. This will allow me to write nice test kernels for the testsuite, that will have a new testrunner taking a kernel as argument (and information about the buffers) and testing the kernel. With all of this, I will be able to write and test the builtin functions :) .