Wednesday, December 16, 2009

SSE2 vs CUDA in parallel bit stream transposition / similarity. Part 2.


We made a second attempt to work with CUDA to better understand if it looks like a suitable technology for database and data-mining acceleration.
With a very valuable help from our readers we prepared a second version of GPU benchmark and tried to analyze performance and power consumption.

CUDA vs SSE. Part 2.

Friday, December 11, 2009

Win7 vs Ubuntu 9.10 (32-bit vs 64-bit)

This benchmark is designed to illustrate platform-compiler choice for BitMagic derived computational algorithms.

Test total numbers (shorter - faster)

From Blog


Test Platform: Nehalem Core i5, 2.67GHz (Turbo Boost enabled)
BitMagic Library 3.6.2 (not yet released)

Windows 7 32-bit used default 32-bit compile using MSVC8. SSE4 optimization disabled.

Ubuntu 9.10 64-bit GCC 4.4.1 with compile options ( -march=core2 -m64 -msse4 )

BitMagic 64-bit optimizations + SSE4 optimizations:
#define BMSSE42OPT
#define BM64OPT

(this combination will be supported in BitMagic Library 3.6.2).

All measurements in seconds (shorter - faster).

From Blog

As you can see, 32-bit default compile using stock compiler looses in almost every test to 64-bit SSE4.2 version, tuned for the platform.

Yes, this experiment intentionally staged for Windows7+MSVC to loose (by using unfavorable compilation and optimization settings: stock x86 32-bit software suffers from register starvation, ofter ignores presence of SIMD SSE unit. Having said that such resource under-utilization often happens in real-life and it is often ignored as insignificant. From this data we can see that sometimes insignificant things can add-up to measurable values.

Sunday, November 22, 2009

BitMagic library version 3.6.1 released

  • Elias Gamma decoding optimizations significantly improved bit-vector de-serialization performance. Compressed bit-vectors now save disk space and IO bandwidth without big CPU costs
  • Fixed stack overrun introduced in 3.6.0
  • First version of SSE 4.2 optimizations for Nehalem (Core i5, i7, etc). Tested with MSVC 2008 and GCC 4.4.1. Use BMSSE42OPT to turn on SSE4.2 support in the code.

Friday, November 13, 2009

64-bit portability checklist

Useful article from developers of PVS Studio tool on porting to 64-bit systems.

20 issues of porting C++ code on the 64-bit platform

SSE4.2 readyness. GCC vs MSVC. Deathmatch.

GCC 4.3.2 (cygwin) vs MS VC++ 2008 (Windows 7).

First results porting BitMagic Library to Intel Nehalem SSE4.2 instruction set. :-)




1>Generating code
1>c:\dev\bm\src\bmsse4.h(140) : fatal error C1001: An internal error has occurred in the compiler.
1>(compiler file 'f:\dd\vctools\compiler\utc\src\p2\main.c[0x60375DBB:0x00000000]', line 182)
1> To work around this problem, try simplifying or changing the program near the locations listed above.
1>Please choose the Technical Support command on the Visual C++
1> Help menu, or open the Technical Support help file for more information
1>LINK : fatal error LNK1000: Internal error during IMAGE::BuildImage
1> Version 9.00.30729.01
1> ExceptionCode = C0000005
1> ExceptionFlags = 00000000
1> ExceptionAddress = 60375DBB (60300000) "C:\Program Files\Microsoft Visual Studio 9.0\VC\bin\c2.dll"
1> NumberParameters = 00000002
1> ExceptionInformation[ 0] = 00000000
1> ExceptionInformation[ 1] = 00000000
1>CONTEXT:
1> Eax = 031F666C Esp = 0034EC78
1> Ebx = 00000000 Ebp = 0034ECA4
1> Ecx = 00000000 Esi = 03214298
1> Edx = 00000000 Edi = 03201608
1> Eip = 60375DBB EFlags = 00010246
1> SegCs = 0000001B SegDs = 00000023
1> SegSs = 00000023 SegEs = 00000023
1> SegFs = 0000003B SegGs = 00000000
1> Dr0 = 00000000 Dr3 = 00000000
1> Dr1 = 00000000 Dr6 = 00000000
1> Dr2 = 00000000 Dr7 = 00000000
1>Build log was saved at "file://c:\dev\bm\bmperf\Release\BuildLog.htm"
1>bmperf - 2 error(s), 5 warning(s)




$ make
======= Building : BitMagic Library Performance Test cygwin GNU_CC
< perf.cpp >
g++ -msse4 -march=core2 -Wall -c -D_REENTRANT -DCYGWIN_NT -D_GNU_SOURCE -Wno-d
eprecated -I/cygdrive/c/dev/bm/src perf.cpp -o /cygdrive/c/dev/bm/tests/perf
/Release/perf.o -g0 -O2 -march=core2 -fomit-frame-pointer -pipe
perf.cpp: In function 'void EnumeratorTest()':
perf.cpp:497: internal compiler error: in memory_address_length, at config/i386/
i386.c:16380
Please submit a full bug report,
with preprocessed source if appropriate.
See for instructions.
make: *** [/cygdrive/c/dev/bm/tests/perf/Release/perf.o] Error 1



Monday, October 19, 2009

Notes on Elias Gamma encoding in BitMagic 3.6.0

Version 3.6.0 implements a variant of Elias Gamma Encoding as part of bit-vector serialization.
Wikipedia has an article on this encoding here:
http://en.wikipedia.org/wiki/Elias_gamma_coding

The description is so good that one should be able to implement this encoding after reading a couple of pages.

Older versions of BM library used D-Gap compression which is a variant of run length encoding. Or used plain lists of integers to encode very sparse sets. All this compression methods used word aligned algorithms and rely on presence of long runs of identical bits. Word aligned compression is also very fast.

A lot of practical tests demonstrated that BM based search systems are bandwidth limited. It means that CPU is starved, waiting for memory operations or disk retrieval. Disk retrieval is particularly important because IO and network subsystems never followed terms of Moore Law. Non-Word-Aligned encoding seems to a good answer to this problem.

Elias Gamma encoding developed by Peter Elias postulates bit-encoding of positive (non-zero) integers . Positive non-zero integers are essentially the native format for BM library to store sparse bit-vectors or inverted lists (basic building blocks of efficient bitmap indexes).

Advantages:
1.Simple implementation, requires no compression dictionary. It is easy to compress just a few ints without constructing and storing a dictionary. BM library works with independent blocks of bits (potentially computed in parallel).
2.Implied probability favors small integers – good for BM library since it fragments bit-vector into blocks of 65K bits. Methods allows some blocks to be encoded, and some to remain in GAP or raw bit-block format (if statistical distribution of bits suddenly changes).
3.Compact implementation reduces code bloat, fits into a C++ template library paradigm.

Disadvantages:
1.True adaptive statistical methods (like arithmetic compression) can(should?) give better compression ratio.
2.Encoding and decoding method is sequential, it seems hard to implement a parallel encoder/decoder using SIMD, SSE or GP-GPU. (BM blocks can potentially be decoded in parallel so maybe there is a way to use parallel methods for this).

Saturday, October 17, 2009

BitMagic Library v.3.6.0 released

  • improved bit-vector serialization with bitset compression.
    New serialization adds Gamma encoding on top of D-Gap compression. This should
    reduce IO for databases and search systems using serialized bit-vectors.
  • fixed functionality bug in deserialization with logical AND
  • new serialization API (old API kept for backward compatibility)