Boring work for API completeness - Denis Steckelmacher

Published on mar 02 août 2011 in Clover, (Comments)

Hello,

Clover's development seemed slow these days, but in fact it wasn't. I'm currently "polishing" all I've already done. Not because I'm near the end of the project, but because the last part of my Google Summer of Code project will begin in the following days, and I want the code upon which I'll build it to be solid.

So, my first target for Clover was to be able to launch OpenCL-compiled kernels. In order to be able to do that, the implementation needed to support several things : buffers, events, command queues, contexts, etc. Now that the kernels can run (but without any interesting built-in function), I decided to finish the public API of OpenCL.

In the git repository, you can therefore see many commits like "Implement clFoo and clFooBar". I've read all the APIs and implemented the missing functions.

Currently, I focused on the "enqueue" functions, that is the functions used to queue specific events, the actions OpenCL can perform. These functions are :

clEnqueueRead/WriteBufferRect: a complex function copying a buffer to another, but only a rectangle (if we say the buffer contains 2D data) or a cube. This event is particularly important because I built all the image-related events upon it.
clEnqueueCopyBuffer: a simple event copying a buffer to another.
clEnqueueCopyBufferRect.
clCreateImage2D and clCreateImage3D, to add image support to Clover.
clEnqueueReadImage and clEnqueueWriteImage, built upon CopyBufferRect.
clEnqueueCopyImage (really the mirror of CopyBufferRect).
clEnqueueCopyImageToBuffer and clEnqueueCopyBufferToImage.
clEnqueueMapImage.
clGetSupportedImageFormats.
And then clEnqueueBarrier, clEnqueueMarker and clEnqueueWaitForEvents

Now, all the "enqueue" API is completed. I have now to implement the samplers, and clFlush and clFinish. Then, I will be able to implement the interesting built-in functions (from simple mathematical functions to barrier(), the one that could take a fair amount of time thinking on how I could implement it).

The functions I just implemented are based on the "events" framework of Clover, a set of classes inheriting Coal::Event and organized in a complex heritage tree. This enabled me to implement all the events and their checks with only 1500 lines of code in events.cpp (the biggest file of Clover). All the "rectangle-related" events (that is to say Read/Write/CopyBufferRect, and image events) are implemented in less than 100 lines of worker code in CPUDevice (but the code isn't really readable, I heavily used the testsuite to check my code). For the reference, here is the code doing all the 2D and 3D copies in CPUDevice :

case Event::ReadBufferRect:
case Event::WriteBufferRect:
case Event::CopyBufferRect:
case Event::ReadImage:
case Event::WriteImage:
case Event::CopyImage:
case Event::CopyBufferToImage:
case Event::CopyImageToBuffer:
{
    // src = buffer and dst = mem if not copy
    ReadWriteCopyBufferRectEvent *e = (ReadWriteCopyBufferRectEvent *)event;
    CPUBuffer *src_buf = (CPUBuffer *)e->source()->deviceBuffer(device);
    unsigned char *src = (unsigned char *)src_buf->data();
    unsigned char *dst;

    switch (t)
    {
        case Event::CopyBufferRect:
        case Event::CopyImage:
        case Event::CopyImageToBuffer:
        case Event::CopyBufferToImage:
        {
            CopyBufferRectEvent *cbre = (CopyBufferRectEvent *)e;
            CPUBuffer *dst_buf = (CPUBuffer *)cbre->destination()->deviceBuffer(device);
            dst = (unsigned char *)dst_buf->data();
            break;
        }

        default:
        {
            // dst = host memory location
            ReadWriteBufferRectEvent *rwbre = (ReadWriteBufferRectEvent *)e;
            dst = (unsigned char *)rwbre->ptr();
        }
    }

    // Iterate over the lines to copy and use memcpy
    for (size_t z=0; z<e->region(2); ++z)
    {
        for (size_t y=0; y<e->region(1); ++y)
        {
            unsigned char *s;
            unsigned char *d;

            d = imageData(dst,
                          e->dst_origin(0),
                          y + e->dst_origin(1),
                          z + e->dst_origin(2),
                          e->dst_row_pitch(),
                          e->dst_slice_pitch(),
                          1);
            s = imageData(src,
                          e->src_origin(0),
                          y + e->src_origin(1),
                          z + e->src_origin(2),
                          e->src_row_pitch(),
                          e->src_slice_pitch(),
                          1);

            // Copying an image to a buffer may need to add an offset
            // to the buffer address (its rectangular origin is
            // always (0, 0, 0)).
            if (t == Event::CopyBufferToImage)
            {
                CopyBufferToImageEvent *cptie = (CopyBufferToImageEvent *)e;
                s += cptie->offset();
            }
            else if (t == Event::CopyImageToBuffer)
            {
                CopyImageToBufferEvent *citbe = (CopyImageToBufferEvent *)e;
                d += citbe->offset();
            }

            if (t == Event::WriteBufferRect || t == Event::WriteImage)
                std::memcpy(s, d, e->region(0)); // Write dest (memory) in src
            else
                std::memcpy(d, s, e->region(0)); // Write src (buffer) in dest (memory), or copy the buffers
        }
    }

    break;
}

ImageData is a simple function returning the address of a pixel given its coordinates. It currently works only on little-endian architectures. You'll see that bytes_per_pixel is always 1 in this code (the last argument of imageData). It's normal, Event objects already did the multiplications where needed.

static unsigned char *imageData(unsigned char *base, size_t x, size_t y,
                                size_t z, size_t row_pitch, size_t slice_pitch,
                                unsigned int bytes_per_pixel)
{
    unsigned char *result = base;
    result += (z * slice_pitch) +
              (y * row_pitch) +
              (x * bytes_per_pixel);
    return result;
}

I'm nearing the end of my project. I don't know if I will be able to implement all the built-in functions by August 25. I'll start with the "difficult" ones (barrier(), image reading and writing) in the hope that I will be able to implement the remaining ones after the Summer of Code program. These are fairly simple functions already implemented in many third-party mathematical libraries, so I can simply call them or copy their code.