Command queues, mutexes and threads

Published on mer 08 juin 2011 in Clover, (Comments)

Hello,

Sorry for so little blog posts these time, but I'm in my exam period and I have work to do for school, but my biggest exam is friday and it's followed by a long week-end of three days. Nevertheless, I had time to work on my project these days, and I implemented all the “events and command queues” stuff.

When reading the spec, it can seem an easy task, but it isn't, really. The fact is that I want to do a proper and fast implementation, compatible with software and hardware rendering. That's what makes it difficult.

To read from or write to a buffer object, you have to use a function like clEnqueueReadBuffer. This function creates an event, and pushes it in a command queue created with clCreateCommandQueue. When an event is pushed, it can be submitted to the device for execution.

The trick is that I implement software rendering, but I want it to be efficient and multi-threaded. So, I began by implementing a function returning the number of CPU cores in the computer (mine has only one core but two threads, so anyway I can test). Then, when a CPUDevice is created, Clover launches one thread per core.

One part of the work is done in CPUDevice : each thread waits for events to process, by reading them from a list of events assigned to the CPU and that can be run in any order.

But this list has to be fed, and here's the difficulties. The principle isn't too complex : when we enqueue events, the command queue checks what events it can push to the device (events with no not-already-completed dependencies, no events after a barrier, etc). It's simple and it works, but not always. For example, an application can push two sequential events, and then do heavy I/O. The first event is pushed on the device, then we try to push the second but it fails because the first hasn't finished. The device executes the first, “unlocks” the second but the main thread is waiting for I/O, so the second event isn't pushed on the device. Worse, if the main thread doesn't touch the command queue until it flushes it or release it (doing an implicit flush), the second event will never be pushed until the command queue is flushed. It's inefficient, the worker threads sleeps for nothing.

So, when a worker thread finishes an event, it can itself ask the command queue to push another event. If there's nothing to push, then it'll wait, but we are sure that it's because no event can be pushed and that we don't waste time.

Ok, it's solved, but here start the problems. The biggest is the synchronization of all of this. It took me hours to add mutexes and wait conditions everywhere it's needed.

Another is a small line of code found in CommandQueue until I removed it

1 2	// Called when an event is completed clReleaseEvent(event);

Seems harmless ? The problem was that with all these threads, all these functions, it was possible that this line was called from an event queue, and it's dangerous because the event queue may get deleted if the event's reference count becomes zero (so it will delete itself and dereference its command queue) and the command queue's reference count can also becomes zero, so it also deletes itself.

The solution is to delete the event from the worker thread, outside any command queue code. The only thing I still have to sort out is how I synchronize the event's destruction with the other threads, I think I'll change how I handle reference count, to be able to use a mutex and be sure that the event gets deleted only one time.

So, my code is probably full of ugly bugs that will take hours each to be solved, but I hope they will not be too many, and my early tests pass, including the nicest :

char data[16];
result = clEnqueueReadBuffer(queue, subbuf, 1, 0, 5, data, 0, 0, 0);

fail_if(
    result != CL_SUCCESS,
    "unable to read the buffer"
);
fail_if(
    strncmp(data, "world", 5),
    "the subbuffer must contain \"world\""
);

By passing, it tells me that clWaitForEvents works, and also all the machinery I wrote about in this blog post. More, it validates that what I coded before works. The primary signification of this test passing is that my OpenCL implementation started today to be useful and to do things ! (yes, we can use it to copy buffers around in a multi-threaded fashion, interesting isn't it ?)

So, thanks for reading. I'll work on my maths exam and give you some more news of my future progress this week-end.

Denis Steckelmacher

Free Software and Research

Command queues, mutexes and threads