Data … as usual

All things about data by Laurent Leturgez

SIMD Extensions in and out Oracle 12.1.0.2

Leave a comment Posted by Laurent on April 22, 2015

First of all, I would like to thank Tanel Pöder from Enkitec Accenture for its review of this post and some precious information he gave me.

—-

Recently I posted a link on twitter which explains basics of SIMD Programming (https://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-docs/CellProgrammingTutorial/BasicsOfSIMDProgramming.html), and I had a reply which asked me if it was Oracle 12c style, and the answer is … yes and no.

What is a SIMD extension?

A SIMD Extension is a CPU instruction that computes many data in only one instruction (Single Instruction Multiple Data). Imagine, you have 2 arrays of 4 integers, and you want to compute a sum of those 2 arrays. A classical way will be to loop on each value and add them one by one and to get the result in another array. This operation will produce 4 operations.

Now Imagine, your arrays are now located in a vector of 4 integers, those 2 vectors are in fact specific registers and with only one CPU instruction, you will add those 2 vectors by producing only one vector. You reduce CPU instructions by 4 … for the same result.

If it’s not clear, don’t go away … I have written small C sample code to demonstrate this.

A bit of history

SIMD extensions are not quite recent. They have been created in 1970 with vector programming.

In 1996, SIMD extensions have been widely deployed with MMX extensions (which are SIMD extensions), then Alvitec systems with motorola processors and IBM Power systems have developed more powerful instructions. Then Intel reveals its new SSE extensions in 1999 that have been improved by other extension SSE2, SSE3, SSSE3, SSE4 and now AVX, AVX2 and AVX512 extensions.

So Oracle is not using a specific extension but those which are available on your platform, because all CPUs are not offering the same extensions. For example, modern processors have AVX extensions, but most recent extension (AVX-512) are only available in Xeon Phi Knights Landing and Xeon Skylake microarchitectures (broadwell successors).

Data Structures

SIMD extensions are based on data structures or vectors.

A vector is an array data structure (don’t be confused with an array datatype) which have a fixed length and which is, in fact, a succession of scalars of one type.

For example, if you have a vector of 64 bits (8 bytes), you can put in it 2 integers because an integer has a 4 bytes size (in x86-64 arch), 8 chars (1 bytes) but only one double (8 bytes long).

Those data structures are located is CPU registers dedicated for those SIMD instructions.

Let’s take an example, you want to process the sum of two vectors in a processor which uses only MMX instructions (old one 😉 ) have 8 registers (MM0 through MM7). Each register holds 64 bits.

First vector content is 1,2 and second one is 1,2. First vector is copied from memory to MM0 register and the second in MM1, and then the CPU launch the SIMD instruction that will produce in MM0 the sum on MM1 and MM0, and then MM0 is copied in memory as a result.

Now imagine, your vector doesn’t hold 64 bits but 128, 256, 512 or 1024 … you will put in it more data and those data will be computed with only one operation …

It’s one of the key of SIMD evolution, MMX uses 64 bits registers (MM0 to MM7), SSE (1/2/3 and 4) uses 128 bits registers (XMM), AVX (1/2) uses 256 bits registers (YMM), and AVX-512 uses 512 bits registers (ZMM).

For Intel processors, vector datatypes are __m64, __mm128, __mm256, and __mm512 (each vector will contain floating point value aka float), you have the equivalent for double precision values (__mm128d, __mm256d, __mm512d) and for other types : int, short, char (__mm128i, __mm256i, __mm512i).

Note: Note that all those types are automatically aligned on a 8, 16, 32 or 64 bytes boundaries.

Now computing data

Now you know how will be computed your data, you can perform operation on it. You can add, multiply your vectors, perform bit shifting etc.

You have the choice to do “classical” operations, or you can use Intel’s intrinsics which are functions which computes a specific operation (basic mathematics, bit shifting, comparisons etc.). All of Intel’s Intrinsics are available at this URL: https://software.intel.com/sites/landingpage/IntrinsicsGuide/. On this page you can also see performance information of each function on different processors.

Examples

For all examples above, I used C langage.

Compiling “SIMD aware” programs (with GCC)

If you want to compile SIMD aware program, you have to include “immintrin.h” header file which is available with GCC. This header will test which extension you have, and you have used for you compilation. (Just find this file and open it). Depending on your CPU and compilation, it will include another header file:

mmintrin.h for MMX instructions and datatypes:
xmmintrin.h for SSE
emmintrin.h for SSE2
pmmintrin.h for SSE3
tmmintrin.h for SSSE3
smmintrin.h for SSE4.1 and SSE4.2
avxintrin.h for AVX

When you compile your program, some extensions are not included by default. Indeed if your CPU supports AVX extensions, if you don’t give the correct option to the compiler, AVX won’t be used.

Main options are:

O3: this option enable vectorization loops optimization.
msse4.1: this option enable SSE4.1 extension
msse4.2: this option enable SSE4.2 extension
mavx: this option enable AVX extension
mavx2: this option enable AVX2 extension

Other options are available here: https://gcc.gnu.org/onlinedocs/gcc-4.4.7/gcc/i386-and-x86_002d64-Options.html

To demonstrate this, I used a small program:


#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>

void print_extensions () {
#ifdef __MMX__
printf("MMX ... OK\n");
#else
printf("MMX ... KO\n");
#endif

#ifdef __SSE__
printf("SSE ... OK\n");
#else
printf("SSE ... KO\n");
#endif

#ifdef __SSE2__
printf("SSE2 ... OK\n");
#else
printf("SSE2 ... KO\n");
#endif

#ifdef __SSE3__
printf("SSE3 ... OK\n");
#else
printf("SSE3 ... KO\n");
#endif

#ifdef __SSSE3__
printf("SSSE3 ... OK\n");
#else
printf("SSSE3 ... KO\n");
#endif

#if defined (__SSE4_2__) || defined (__SSE4_1__)
printf("SSE4_1/2 ... OK\n");
#else
printf("SSE4_1/2 ... KO\n");
#endif

#if defined (__AES__) || defined (__PCLMUL__)
printf("AES/PCLMUL ... OK\n");
#else
printf("AES/PCLMUL ... KO\n");
#endif

#ifdef __AVX__
printf("AVX ... OK\n");
#else
printf("AVX ... KO\n");
#endif
}

int main(int argc, char** argv) {
print_extensions();
return 0;
}

If you run it with only O3 optimization, you will get this result:


macbook-laurent:simd $ sysctl -a | egrep 'cpu.*features'
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.leaf7_features: SMEP ENFSTRG RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM
machdep.cpu.extfeatures: SYSCALL 1GBPAGE EM64T LAHF RDTSCP TSCI

macbook-laurent:simd $ cc -O3 -o simd_ext simd_ext.c
macbook-laurent:simd $ ./simd_ext
MMX ... OK
SSE ... OK
SSE2 ... OK
SSE3 ... OK
SSSE3 ... OK
SSE4_1/2 ... KO
AES/PCLMUL ... KO
AVX ... KO

If you run with correct options, your program can use AVX or SSE4 extensions:

macbook-laurent:simd $ cc -O3 -msse4.2 -o simd_ext simd_ext.c
macbook-laurent:simd $ ./simd_ext
MMX ... OK
SSE ... OK
SSE2 ... OK
SSE3 ... OK
SSSE3 ... OK
SSE4_1/2 ... OK
AES/PCLMUL ... KO
AVX ... KO
macbook-laurent:simd $ cc -O3 -mavx -o simd_ext simd_ext.c
macbook-laurent:simd $ ./simd_ext
MMX ... OK
SSE ... OK
SSE2 ... OK
SSE3 ... OK
SSSE3 ... OK
SSE4_1/2 ... OK
AES/PCLMUL ... KO
AVX ... OK

Note that if you enable AVX extension, SSE4 extensions are enabled by default.

Example of SSE2 usage in a basic operation (sum)

The C code above will show you how to perform a sum of two arrays of 16 integers each without using Intel intrinsics:


void func2_sse() {
int a[16] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
int b[16] = {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1};
__m128i* aptr;
__m128i* bptr;
int i;
int loopcnt=0;
printf("sizeof(__m128i)=%lu\n",sizeof(__m128i));
printf("sizeof(a)=%lu\n",sizeof(a));

// Above, we cast integer arrays to vectors of integers

aptr=(__m128i*)a;
bptr=(__m128i*)b;

// and now we compute the sum
for (i=0;i<sizeof(a)/sizeof(__m128i);i++) {
loopcnt++;
bptr[i]=aptr[i]+bptr[i];
}

int* c=(int*)bptr;

printf("loopcount = %d\nresult= ",loopcnt);
for (i=0;i<16;i++) {
printf("%d ",c[i]);
}
printf("\n");
}

and the result, my sum has been computed in only 4 loops:


SSE
--------------------
sizeof(__m128i)=16
sizeof(a)=64
loopcount = 4
result= 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Same example with AVX extension:


void func2_avx() {
 int a[16] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
 int b[16] = {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1};
 __m256i* aptr;
 __m256i* bptr;
 int i;
 int loopcnt=0;
 printf("sizeof(__m256i)=%lu\n",sizeof(__m256i));
 printf("sizeof(a)=%lu\n",sizeof(a));
 aptr=(__m256i*)a;
 bptr=(__m256i*)b;

 for (i=0;i<sizeof(a)/sizeof(__m256i);i++) {
 loopcnt++;
 bptr[i]=aptr[i]+bptr[i];
 }

 int* c=(int*)bptr;

 printf("loopcount = %d\nresult= ",loopcnt);
 for (i=0;i<16;i++) {
 printf("%d ",c[i]);
 }
 printf("\n");
}

and the result, my sum has been computed in only 2 loops:


AVX
--------------------
sizeof(__m256i)=32
sizeof(a)=64
loopcount = 2
result= 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Now, let’s compare two data sets with SIMD extension

Next code sample concerns a vector where we want to search the value 10. To do that, we use a comparison function and a function to build a 256bits (AVX) vector full of the value we search. The comparison function works with 32bits packets (useful to compare integers) and returns 0xFFFFFFFF if both values are equal, 0x0 otherwise. As it’s an AVX function, our initial vector composed by 16 values is processed in only 2 CPU cycles.

void func2_compare_32bitsPack() {
    int a[16] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
    __m256i* aptr;
    __m256i b;
    int i;
    int loopcnt=0;
    aptr=(__m256i*)a;
    // b is a vector full off int(32bits) equal to 10 (the value we search)
    b=_mm256_set1_epi32(10);

    for (i=0;i<sizeof(a)/sizeof(__m256i);i++) {
        loopcnt++;
        // comparison intrinsic function: packed by 32 bits(specific for int: if equal set 0xFFFFFFFF, 0x0 otherwise)
        aptr[i]=_mm256_cmpeq_epi32(aptr[i],b);
    }

    // print results
    int* c=(int*)aptr;

    printf("loopcount = %d\nresult= ",loopcnt);
    for (i=0;i<16;i++) {
        printf("0x%x   ",c[i]);
    }
    printf("\n");
}

And the result:


macbook-laurent:simd $ ./simd
Comparison
loopcount = 2
result= 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0xffffffff 0x0 0x0 0x0 0x0 0x0 0x0

It becomes easy to identify that the value 10 is located at the index 10 in our initial array.

Ok, and how SIMD extensions are used in Oracle 12c In Memory ?

If you have read my last post on how to activate SSE4 extensions on VirtualBox guests (https://laurent-leturgez.com/2015/04/14/enable-simd-sse4-extension-in-oracle-virtualbox/) , and Tanel Pöder’s post (https://blog.tanelpoder.com/2014/10/05/oracle-in-memory-column-store-internals-part-1-which-simd-extensions-are-getting-used/), you have noticed that Oracle can run IM with only SSE2 extension (default), but if your CPUs have SSE4, or AVX extensions, Oracle will use some specific libraries that uses SSE4 (libshpksse4212.so) and AVX (libshpkavx12.so).

If we have a look at functions in those libraries, we will see that every function starts with “kdzk”

[oracle@oel64-112 ~]$ readelf -a /u01/app/oracle/product/12.1.0/dbhome_1/lib/libshpksse4212.a | grep FUNC
 6: 0000000000000030 256 FUNC LOCAL DEFAULT 3 kdzk_overload_opc_name
 23: 0000000000000130 80 FUNC LOCAL DEFAULT 3 kdzk_flag_name
 26: 0000000000000180 112 FUNC LOCAL DEFAULT 3 kdzk_enc_name
 31: 00000000000001f0 320 FUNC LOCAL DEFAULT 3 kdzk_datawidth_name
 64: 0000000000002b70 544 FUNC LOCAL DEFAULT 3 kdzk_eq_dict_1bit
 65: 0000000000002d90 544 FUNC LOCAL DEFAULT 3 kdzk_lt_dict_1bit
 66: 0000000000002fb0 544 FUNC LOCAL DEFAULT 3 kdzk_gt_dict_1bit
 67: 00000000000031d0 592 FUNC LOCAL DEFAULT 3 kdzk_le_dict_1bit
 68: 0000000000003420 592 FUNC LOCAL DEFAULT 3 kdzk_ge_dict_1bit
 69: 0000000000003670 544 FUNC LOCAL DEFAULT 3 kdzk_ne_dict_1bit
 70: 0000000000003890 224 FUNC LOCAL DEFAULT 3 kdzk_gt_lt_dict_1bit
 71: 0000000000003970 576 FUNC LOCAL DEFAULT 3 kdzk_gt_le_dict_1bit
 72: 0000000000003bb0 576 FUNC LOCAL DEFAULT 3 kdzk_ge_lt_dict_1bit
 73: 0000000000003df0 992 FUNC LOCAL DEFAULT 3 kdzk_ge_le_dict_1bit
 74: 00000000000041d0 512 FUNC LOCAL DEFAULT 3 kdzk_eq_dict_1bit_null
 75: 00000000000043d0 192 FUNC LOCAL DEFAULT 3 kdzk_lt_dict_1bit_null
 76: 0000000000004490 512 FUNC LOCAL DEFAULT 3 kdzk_gt_dict_1bit_null
 77: 0000000000004690 512 FUNC LOCAL DEFAULT 3 kdzk_le_dict_1bit_null
 78: 0000000000004890 464 FUNC LOCAL DEFAULT 3 kdzk_ge_dict_1bit_null
 79: 0000000000004a60 512 FUNC LOCAL DEFAULT 3 kdzk_ne_dict_1bit_null
 80: 0000000000004c60 192 FUNC LOCAL DEFAULT 3 kdzk_gt_lt_dict_1bit_null
 81: 0000000000004d20 528 FUNC LOCAL DEFAULT 3 kdzk_gt_le_dict_1bit_null
 82: 0000000000004f30 192 FUNC LOCAL DEFAULT 3 kdzk_ge_lt_dict_1bit_null
 83: 0000000000004ff0 528 FUNC LOCAL DEFAULT 3 kdzk_ge_le_dict_1bit_null
 84: 0000000000005200 848 FUNC LOCAL DEFAULT 3 kdzk_eq_dict_2bit_selecti
 85: 0000000000005550 960 FUNC LOCAL DEFAULT 3 kdzk_eq_dict_2bit
 89: 0000000000005910 848 FUNC LOCAL DEFAULT 3 kdzk_lt_dict_2bit_selecti
 90: 0000000000005c60 1056 FUNC LOCAL DEFAULT 3 kdzk_lt_dict_2bit
 91: 0000000000006080 848 FUNC LOCAL DEFAULT 3 kdzk_gt_dict_2bit_selecti
 92: 00000000000063d0 1008 FUNC LOCAL DEFAULT 3 kdzk_gt_dict_2bit
 93: 00000000000067c0 848 FUNC LOCAL DEFAULT 3 kdzk_le_dict_2bit_selecti
 94: 0000000000006b10 1024 FUNC LOCAL DEFAULT 3 kdzk_le_dict_2bit
 95: 0000000000006f10 848 FUNC LOCAL DEFAULT 3 kdzk_ge_dict_2bit_selecti
 96: 0000000000007260 1056 FUNC LOCAL DEFAULT 3 kdzk_ge_dict_2bit
 97: 0000000000007680 848 FUNC LOCAL DEFAULT 3 kdzk_ne_dict_2bit_selecti
 98: 00000000000079d0 960 FUNC LOCAL DEFAULT 3 kdzk_ne_dict_2bit
 99: 0000000000007d90 928 FUNC LOCAL DEFAULT 3 kdzk_gt_lt_dict_2bit_sele
 100: 0000000000008130 1328 FUNC LOCAL DEFAULT 3 kdzk_gt_lt_dict_2bit
 101: 0000000000008660 928 FUNC LOCAL DEFAULT 3 kdzk_gt_le_dict_2bit_sele
 102: 0000000000008a00 1296 FUNC LOCAL DEFAULT 3 kdzk_gt_le_dict_2bit
 103: 0000000000008f10 928 FUNC LOCAL DEFAULT 3 kdzk_ge_lt_dict_2bit_sele
 104: 00000000000092b0 1328 FUNC LOCAL DEFAULT 3 kdzk_ge_lt_dict_2bit</pre>

kdzk is the Oracle component that manages compression:


SQL> oradebug doc components

.../...

Components in library ADVCMP:
--------------------------
 ADVCMP_MAIN Archive Compression (kdz)
 ADVCMP_COMP Archive Compression: Compression (kdzc, kdzh, kdza)
 ADVCMP_DECOMP Archive Compression: Decompression (kdzd, kdzs)
 ADVCMP_DECOMP_HPK Archive Compression: HPK (kdzk)
 ADVCMP_DECOMP_PCODE Archive Compression: Pcode (kdp)

An interesting thing to see is that, even you use an Oracle Kernel without any SSE4 nor AVX extension active (so your process doesn’t use libshpksse4212.so nor libshpkavx12.so library), you use kdz functions when you query and filter a table which is managed in Memory.

In a session I run the statements above:


SQL> select segment_name,BYTES,BYTES_NOT_POPULATED from v$im_segments

SEGMENT_NAME         BYTES         BYTES_NOT_POPULATED
-------------------- ------------- -------------------
S                         37748736                   0

SQL> select spid from v$process where addr=(select paddr from v$session where sid=sys_context('USERENV','SID'));

SPID
------------------------
3619

SQL> select count(*) from s where amount_sold>1700;

Just before launching the command, I attach my process and run gdb to catch every call to kdz functions:


[oracle@oel64-112 ~]$ pmap -x 3619 | egrep 'sse|avx'

[oracle@oel64-112 ~]$ gdb -pid 3619
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-64.el6_5.2)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
.../...
Loaded symbols for /u01/app/oracle/product/12.1.0/dbhome_1/lib/libnque12.so
0x000000362ea0e740 in __read_nocancel () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6_5.4.x86_64 libaio-0.3.107-10.el6.x86_64 numactl-2.0.7-8.el6.x86_64
(gdb) rbreak ^kdz

.../...

(gdb) commands
Type commands for breakpoint(s) 1-2165, one per line.
End with a line saying just "end".
>continue
>end

If you study the output, you will see that a lot of functions are called, and in the list, you will find some interesting functions: kdzdcol_get_minval, kdzdcol_get_maxval, kdzk_build_vector etc. Oracle clearly uses vectors to process IM compression units.

In my opinion, it’s normal to use functions related to compression because the kernel manipulates “Compression Units”, and it should integrates SIMD functions in its libraries.

A last curiosity with Oracle 12c (12.1.0.2)

Ok now you had a look to your installation, your machine is “AVX enabled”, and Oracle processes uses the AVX compatible library (libshpkavx212.so), everything is OK and you think you will use all this stuff.

But if you use objdump on this library, and you search for AVX registers, you won’t find anything:


[oracle@oel64-112 ~]$ grep -i ymm objdump_out.1 | wc -l
0

Tanel Pöder gave me the answer !!! Oracle database code is compiled to be compatible with Redhat/Oracle Linux 5, so it must be compatible with kernel 2.6.18. But linux scheduler can work with YMM registers from version 2.6.30 onwards.

You can use new instructions without the kernel knowing about us, but you can’t use registers that are not yet supported by the kernel.

I think next version of Oracle will improve this, maybe in 12.2.

To conclude, there is not Oracle 12c style for SIMD instructions. Oracle has developed functions that uses SIMD instructions, for Intel CPUs, they uses SSE, SSE2, SSE3, SSE4 or AVX depending on the CPU architecture, on IBM AIX these libraries use VMX extension (SIMD instruction on Power) etc.

Sources:

http://blog.tanelpoder.com/2014/10/05/oracle-in-memory-column-store-internals-part-1-which-simd-extensions-are-getting-used/

http://en.wikipedia.org/wiki/Data_structure_alignment

http://en.wikipedia.org/wiki/Advanced_Vector_Extensions

http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions

http://en.wikipedia.org/wiki/SIMD

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

https://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-docs/CellProgrammingTutorial/BasicsOfSIMDProgramming.html

https://laurent-leturgez.com/2015/04/14/enable-simd-sse4-extension-in-oracle-virtualbox/