Hello,
A quick blog post about what I've just seen in the LLVM codebase.
First, a note about the progress I made since the previous post. The kernels are done, I've implemented clGetKernelInfo, clGetKernelWorkGroupInfo (currently a stub), clSetKernelArg, and even clEnqueueNDRangeKernel and clEnqueueTask. It means that API-wise, we can run compiled OpenCL kernels. But it isn't the case yet. I still need to implement many things in the core of Clover, but every hour makes me closer to finish that.
One of the thing I have to implement is a fast mean to call a kernel. “Fast” is really important here, since most kernels are small (thus run fast) but are a lot to be run.
I naively began by envisaging using llvm::ExecutionEngine::runFunction. This function takes two parameters : the function to run, and a vector of its arguments. This vector didn't make me happy, I found it too slow. So I looked at the code of this function for the JIT (the engine used on x86, x86_64 and ARM, the three targets I want to run fast). I was horrified by this function !
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | // ... Many "fast" special cases
// Okay, this is not one of our quick and easy cases. Because we don't have a
// full FFI, we have to codegen a nullary stub function that just calls the
// function we are interested in, passing in constants for all of the
// arguments. Make this function and return.
// First, create the function.
FunctionType *STy=FunctionType::get(RetTy, false);
Function *Stub = Function::Create(STy, Function::InternalLinkage, "",
F->getParent());
// Insert a basic block.
BasicBlock *StubBB = BasicBlock::Create(F->getContext(), "", Stub);
// ... Code generation for the args, to have a function like
// T f() { return function(3, "foobar", (int *)0x9fffff90); }
// Finally, call our nullary stub function.
GenericValue Result = runFunction(Stub, std::vector<GenericValue>());
// Erase it, since no other function can have a reference to it.
Stub->eraseFromParent();
|
The last line is the worst : the function is removed, each call will need to use code generation !
Fortunately, I can avoid that for kernels. A kernel is a function called many times with the same parameters. So, I can build this stub function (avoiding some corner cases LLVM handles), keep it in the Coal::Kernel object, and call it many times with a simple :
1 2 3 | void (*func)() = device_kernel->nativeFunctionAddress();
for (...)
func();
|
Simple, and really fast !