Thread-local storage

Published on lun 11 juillet 2011 in Clover, (Comments)

Hello,

Here is a post about my thoughts on how to implement functions like "get_global_id()" in a threaded environment.

This function, and others like it, returns an identifier of the current OpenCL kernel run. It's used for instance to have the same kernel running in parallel on multiple elements of a data set. The trick is that its return value cannot be inferred from its arguments, but from some data provided either by the GPU, or the current CPU thread.

Let's see how I want to implement that from top to bottom.

The LLVM IR binding

The very good thing about LLVM IR is that it allows calling remote functions. It's a bit dangerous (I have to figure out how to disable a kernel from calling random C functions resolvable from the current application, like exec()) but very powerful to allow implementing functions like get_global_id() in pure compiled C++.

The first part is to define the function in the LLVM IR. Using Clang as the compiler, it's as easy as writing a declaration in a header, like done in src/runtime/stdlib.h : size_t get_global_size(uint dimindx);

It's all for the LLVM IR part. Now, I'll use the LLVM JIT to run kernels, so I dove in the LLVM documentation and saw the [llvm::ExecutionEngine:: InstallLazyFunctionCreator()][] function. This function takes a function pointer that will get called everytime the JIT sees a function not defined in LLVM IR. Clover will be able to provide a native function for this name.

So, the code :

1
2
3
4
5
6
7
8
static void *createFunction(const std::string &name)
{
    if (name == "get_global_id")
        return &native_get_global_id;
    else if (name == "get_work_dim")
        return &native_get_work_dim;
    // Other...
}

So, it's simple (but not really fast due to the string comparisons, fortunately this function is called once for each function call-site, not for every call). Let's implement these functions.

The native part

My first plan was to put the thread-local variable in the stdlib bitcode, and implementing functions like get_global_id like this :

1
2
3
4
5
6
7
8
__thread void *tls;                    // Value set by native code, but stored by LLVM

size_t _get_global_id(void *tls, uint dimindx); // The extern native function

size_t get_global_id(uint dimindx)
{
    return _get_global_id(tls, dimindx);
}

The idea was to have the native functions called with the thread-local param given as a parameter (its could be a pointer to a CPUKernelInstance). Some inlining during the program compilation would have replaced all the get_global_id calls with _get_global_id calls.

It was a bad idea.

The first reason is that TLS is not well supported by the LLVM JIT. It works on x86, not on x86_64, and I don't know if it's good on ARM and other targets. And I want OpenCL to be usable everywhere, even on Meego-powered ARM devices using Free drivers :) .

So, after some minutes of reasoning, I found an evident and way better solution : putting the TLS value in native code, not in the LLVM IR.

So, I removed all the stub functions from stdlib.c. Now, a call to get_global_id remains a call to this function, without added parameters and without renaming.

So, it's time to implement these functions :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
__thread struct
{
    size_t work_dim;
    size_t current_global_id[CPUDEVICE_MAX_DIMS];
    // Other variables needed to implement the functions
} g_tls;

size_t native_get_work_dim()
{
    return g_tls.work_dim;
}

size_t native_get_global_id(uint dimindx)
{
    if (dimindx >= CPUDEVICE_MAX_DIMS)
        return 0; // Follow the OpenCL spec

    return g_tls.current_global_id[dimindx];
}

It's not so complicated. It's a pity that I cannot put the TLS variable in LLVM IR (inlining would have removed the if if dimindx is known to be good, and it would have avoided the need of an expensive x86 CALL only to get a variable), but I hope it will be fast enough.

The last part to implement is the CPUKernelInstance. I don't know yet what this class will do, but I think it will represent a work-goup. OpenCL divides a task into work-groups, divided into work-items. We can also divide a work-group into sub-work-groups if needed (I think it's only the case on GPUs). A work-group is a "hint" on how to divide the work on parts bigger that a work-item. All the work-groups must have the same size. If I have correctly understood the specification, a work-item could be a pixel in an image, and a work-group a line. "local" OpenCL storage is also shared between the work-items of a work_group, but not with the ones outside (again if I have correctly understood the spec that is a complex here).

These work-groups will be used by Clover not to run all the work-items in parallel. It would be way too slow (every work-item would require a fetch from the command-queue, and even modern processors don't have thousands of cores)). I intend to run the work-groups in parallel (a command-queue fetch per work-group), but the work-items sequentially :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
void KernelInstance::runWorkGroup()
{
    // Number of work-items in a work_group
    size_t items_per_work_group = getItemsPerWorkGroup();

    // Index of the current work-group (in fact, it is a CPUDEVICE_MAX_DIMS-dimensional array,
    // but I simplify here)
    size_t work_group_index = getWorkGroupIndex();

    // Function to call
    llvm::Function *func = p_kernel->getFunction(p_device);

    // Arguments
    std::vector<llvm::GenericValue> &args = p_kernel->llvmFunctionArgs();

    // Run the work-items
    for (size_t i=0; i<items_per_work_group; ++i)
    {
        // TLS data
        g_tls.current_global_id = (work_group_index * items_per_work_group) + i;

        // Call the function
        p_jit->runFunction(func, args);
    }
}

I think all is in place in order to run kernels. No code is in place yet, but it will come in the following days (I have some infrastructure to do first). I am open to any suggestion.

« STL + Clang + LLVM, the pieces are coming together   When LLVM is slow »