Hello,
Here is a small post to say I just discovered a "killer optimization" in Clang that makes me confident that it will be a very good OpenCL compiler.
I'm currently implementing all the functions described in the "The OpenCL Platform Layer"chapterĀ of the OpenCL spec. This chapter contains three functions whose name ends with "Info".
These functions have a signature like this :
1 | cl_int clGetSomethingInfo(cl_something, cl_enum info, size_t len, void *buf, size_t *real_len);
|
- cl_something is the object for which we want a piece of information
- info is the info we want (CL_CONTEXT_DEVICES for example)
- len is the size of the application-allocated buffer that will contain the info. This buffer must be large enough to contain what the function will return
- buf is the buffer
- real_len is returned by the function and says what size actually has the buffer
These functions are convenient to use from the application perspective, but are a pain to implement, because info can take tens of values, and each value has a type. And for each value, we must check that it will fit in the application-provided buffer. So, it's a big switch full of copy/paste.
When I took the code from its original author, I saw he implemented this kind of function like this (lines to copy/paste marked with an @, lines originally written by the othors marked with a # because I added other to make this code on par with mine checks-wise) :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | #define VERSION_STR "OpenCL 1.0"
#define VERSION_STR_LEN 10
cl_int foo()
{
switch(param_name)
{
case CL_PLATFORM_VERSION:
if (param_value_size < VERSION_STR_LEN && param_value) // @
return CL_INVALID_VALUE; // @
if (param_value) // @
strcpy((char*)param_value, VERSION_STR); // #
if (param_value_size_ret) // @
*param_value_size_ret = VERSION_STR_LEN; // #
break;
default:
return CL_INVALID_VALUE;
}
return CL_SUCCESS;
}
|
We can see that each case statement needs to be full of verification code. Did I say that nearly all the parameters are optional ?
So, I found an elegant solution to solve this problem : put the verification code out of the switch, to make case statements as empty as possible. The resulting code can be found here, and is like this :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | cl_int foo()
{
void *value; // The pointer we'll use outside the switch
int value_length; // Nearly each case has a different value size
// Then, to save space on the stack, a small union
union
{
cl_uint cl_uint_var;
cl_device_type cl_device_type_var;
};
// Use some macros to clarify the code
#define SIMPLE_ASSIGN(type, _value) \
value_length = sizeof(type); \
type##_var = _value; \
value = & type##_var;
#define STRING_ASSIGN(string)
{ \
static const char str[] = string; \
value_length = sizeof(str); \
value = (void *)str; \
}
// And now the switch
switch (param_name)
{
case CL_DEVICE_TYPE:
SIMPLE_ASSIGN(cl_device_type, CL_DEVICE_TYPE_CPU);
break;
case CL_DEVICE_VENDOR_ID:
SIMPLE_ASSIGN(cl_uint, 0);
break;
// Tens of cases
case CL_DEVICE_OPENCL_C_VERSION:
STRING_ASSIGN("OpenCL C 1.1 LLVM 3.0"); // TODO: LLVM version
break;
default:
return CL_INVALID_VALUE;
}
// Now we know all we have to, we can check everything at one place
if (param_value && param_value_size < value_length)
return CL_INVALID_VALUE;
if (param_value_size_ret)
*param_value_size_ret = value_length;
if (param_value)
memcpy(param_value, value, value_length);
return CL_SUCCESS;
}
|
For one or two cases, my code is longer, but it is more easy to read and less error-prone when there are more cases (copy/paste is always to avoid in programming).
So, I was happy with that. Then, I wanted to look at the code produced by Clang to see how it handles all these things. What I saw is that it is very good at optimizing, and that my solution was not yet the best. Here is a C version of what it does :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | void foo()
{
// Yeah, Clang agrees that it's a good idea :)
void *value;
size_t value_length;
// In C, we need my union because of static typing, but assembly code uses register
// overlapping (eax in rax, etc)
union
{
uint32_t i32;
uint64_t i64;
} my_union;
// Then, Clang saw that my code is nearly always using SIMPLE_ASSIGN, that's to say
// my_union
value = (void *)&my_union;
// And it also knows that more than half the values are i32
value_length = 4;
// Now a stripped-down version of the switch
switch (param_name)
{
case CL_DEVICE_TYPE:
my_union.i32 = CL_DEVICE_TYPE_CPU; // Yes, each case is a simple assign operation !
break;
case CL_DEVICE_VENDOR_ID:
my_union.i32 = 0;
break;
case CL_DEVICE_MAX_WORK_GROUP:
my_union.i64 = 1;
value_length = 8; // Oh oh, size_t is bigger than an i32 !
break;
case CL_DEVICE_VERSION:
value = "OpenCL 1.1 Mesa O.1"; //Oh oh, we don't use the union !
value_length = 20;
break;
}
// Then we have my checks and the memcpy, it isn't touched by the optimizer }
|
I'm happy, Clang managed to make a code faster than mine, and without the need of macros to make it short. Congrats !
I didn't test, but it is possible that even the old version, with all the copy/paste, would be optimized like that by Clang (the union is created by the optimizer pass that "unions" vars that are never used at the same time, and then it's like my version).
So, hats off Clang and LLVM developers, you made a wonderful tool ! (And by the way, I use exclusively Clang to compile my projects, it's faster and produce better warnings and messages).
I also looked at the code produced by GCC, but is is less beautiful that the one produced by Clang. Every case has three moves : the value length, the value in the union, and then the address of the union in *value.