The End is Approaching

Published on lun 15 août 2011 in Clover, (Comments)

Hello,

Today is the "soft pencil down" day. It means that I have to mostly stop my work now and concentrate on writing a good documentation for my project.

The first step of that is this blog post, a summary of what I've done during the summer, what is still to do, etc.

The Project

My project was to work on an existing small code-base called Clover, an Open Source OpenCL implementation. This project was started by a Mesa developer, Zack Rusin.

Before I started my project, and as I said in one of my first blog post, the code was very small and did nearly nothing, but the structure was well in place and easy to extend.

This project having been started by a Mesa developer, many people wanted it to be able to take advantage of hardware acceleration.

Status

My target was do to as much as I could during the summer, in order to have a feature-rich code base with many difficult details already solved. I wanted Clover to be able to launch some simple OpenCL demos on the CPU, with all in place to be able to use Gallium3D to provide hardware acceleration.

I'm happy to say that Clover evolved fairly well during the summer. It's now an API-complete library, relatively fast (but it will become even more in the following days, I just thought about a potentially huge optimization), and with all the needed pieces in place for hardware acceleration.

The only missing things are some built-in functions available to the kernels. I already implemented the most difficult of them, and the remaining ones are things like clamp(), etc.

Plans for this week

Now that most of the work is finished, I will be able to clean the code and write documentation. It will be in Doxygen format and will cover the DeviceInterface API (abstraction layer between the core API and the devices, for the moment only CPUDevice), the core classes, and will also explain some difficult parts of Clover :

  • The way memory objects work.
  • The structure of events, command queues and CPU worker threads
  • How I implemented barrier() (a cleaner and more comprehensible explanation than the one I posted on August 7)
  • How I use Clang and LLVM to compile and launch kernels.

I'll also improve the "stub functions" of a kernel. They are currently small functions taking no argument at all and calling the kernel function with its arguments given as constants. It works fairly well but is a bit slow when __local pointers are used (these pointers must be allocated for every work group, so their value changes constantly, and the stub function needs to be recreated).

My plan is to create only one stub function taking one argument : void **locals. This table of pointers will be used by the stub to call the kernel with varying arguments without needing to be re-created and re-JITed every time.

Furthermore, a huge optimization is made possible by that : I implemented the image built-ins, and they are full of switch statements. The code is very big and could highly take advantage of aggressive inlining. I want to implement the images not as pointers to opaque structures ("typedef struct image2d image2d_t;"), but as real structures. When the stub function is created, it will put interesting data about the images in these structs, and the kernel will have the data at hand without needing to call native functions like __cpu_get_image_width. That will allow inlining.

Current stub:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
void stub_for_one_work_group()
{
    kernel(16, (image2d_t)0xfee80000, (float4 *)0x8945a000);
}

void kernel(uint size, __read_only image2d_t image, __local float4 *temp)
{
    // Many read_imagef calls that cannot be inlined and reduced to say
    // pixel = int_to_float(uchar_to_int(*(uchar4 *)(image->data + 256)));
}

Future stub before optimization:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
void stub_for_all_the_kernel(void **locals)
{
    image2d image;
    image.data_order = CLK_RGBA;   // Constant built from info available in Image2D
    image.data_type = CLK_UNORM_INT8;
    image.width = 4;
    image.height = 4;
    image.data = (uchar4 *)0xdeadbeef;

    kernel(16, &image, (float4 *)locals[0]);
}

void kernel(...) {...}

After an LLVM optimization pass, kernel() can be inlined into the stub, and all the image functions flattened to become lightweight direct memory accesses. The fact that images are stored as constants and not opaque pointers allows LLVM to do its constant propagation pass.

« When easy is difficult, and vice versa   Documentation online »