When LLVM is slow

Published on mer 13 juillet 2011 in Clover, (Comments)

Hello,

A quick blog post about what I've just seen in the LLVM codebase.

First, a note about the progress I made since the previous post. The kernels are done, I've implemented clGetKernelInfo, clGetKernelWorkGroupInfo (currently a stub), clSetKernelArg, and even clEnqueueNDRangeKernel and clEnqueueTask. It means that API-wise, we can run compiled OpenCL kernels. But it isn't the case yet. I still need to implement many things in the core of Clover, but every hour makes me closer to finish that.

One of the thing I have to implement is a fast mean to call a kernel. “Fast” is really important here, since most kernels are small (thus run fast) but are a lot to be run.

I naively began by envisaging using llvm::ExecutionEngine::runFunction. This function takes two parameters : the function to run, and a vector of its arguments. This vector didn't make me happy, I found it too slow. So I looked at the code of this function for the JIT (the engine used on x86, x86_64 and ARM, the three targets I want to run fast). I was horrified by this function !

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// ... Many "fast" special cases

// Okay, this is not one of our quick and easy cases. Because we don't have a
// full FFI, we have to codegen a nullary stub function that just calls the
// function we are interested in, passing in constants for all of the
// arguments. Make this function and return.

// First, create the function.
FunctionType *STy=FunctionType::get(RetTy, false);
Function *Stub = Function::Create(STy, Function::InternalLinkage, "",
                                  F->getParent());

// Insert a basic block.
BasicBlock *StubBB = BasicBlock::Create(F->getContext(), "", Stub);

// ... Code generation for the args, to have a function like
//     T f() { return function(3, "foobar", (int *)0x9fffff90); }

// Finally, call our nullary stub function.
GenericValue Result = runFunction(Stub, std::vector<GenericValue>());

// Erase it, since no other function can have a reference to it.
Stub->eraseFromParent();

The last line is the worst : the function is removed, each call will need to use code generation !

Fortunately, I can avoid that for kernels. A kernel is a function called many times with the same parameters. So, I can build this stub function (avoiding some corner cases LLVM handles), keep it in the Coal::Kernel object, and call it many times with a simple :

1
2
3
void (*func)() = device_kernel->nativeFunctionAddress();
for (...)
    func();

Simple, and really fast !

« Thread-local storage   Compile and launch simple OpenCL C kernels with Clover »