BitMagic Library Blog: August 2009

Saturday, August 22, 2009

BitMagic version 3.5.4 released

The scope of this release is general code cleanup (only partially successful) and SSE2 for GCC compatibility fixes.

GCC supports SSE2 intrinsics, so this port was relatively easy. Practically the only change needed was correct syntax for 16-byte alignment declaration. SSE2 optimizations are now available for Intel C++, GCC and MSVC.

Download: https://sourceforge.net/projects/bmagic/files/

Monday, August 17, 2009

CUDA vs SSE

We all know that GPUs can be incredibly fast at some single precision floating point algorithms.
Developers of BitMagic Library Igor Tolstoy and Anatoliy Kuznetsov made an attempt to implement parallel bit-stream transposition and similarity algorithm (a lot of bit-shifting, population counting and general purpose integer logic) to better understand CUDA and GP-GPU computing in comparison with current (and future) Intel vectorization architectures (SSE2).

Benchmarking, CUDA and SSE source codes, some speculation about Larabee and implications for data-mining and large scale databases here:
http://bmagic.sourceforge.net/bmcudasse2.html

Wednesday, August 12, 2009

DoxyDocumentation

Doxygen documentation made available on the WEB:
http://bmagic.sourceforge.net/doxy/modules.html

Sunday, August 9, 2009

Intel Atom: Best GCC settings

After several experiments with 32-bit GCC 4.3.3 on Intel Atom N270 here some recommendations for optimization parameters.
-march=core2 -O2 -fomit-frame-pointer -pipe
Another good alternative:
-mtune=native -O2 -fomit-frame-pointer -pipe

In some BM benchmarks -march=core2 is up to 20% better than -march=prescott
So it looks like, Intel Atom is more Core2 than Pentium4. For low power devices like Atom it looks very important to use right settings since platform balances on a very fine edge in terms of usability. Given the growing popularity of this CPU it would be nice to see binary builds of software products optimized for the target (or doing internal CPUID auto-detection) to load the right binaries.

SSE2 integer optimization seems to get a nice additional boost too. Current version of BitMagic Library (3.5.3) s not compatible with GCCs syntax for variable alignment (one of the key elements for getting best memory bandwidth with SSE). Next version is going to offer better compatibility.

Saturday, August 8, 2009

BitMagic version 3.5.3 released

improved performance of AND logical operations and derived DOT product algorithms (cound_and(), any_and())
improved compression of serialized bvectors (serialization format remains backward compatible)
fixed compatibility issues with MSVC8 and Intel C++ 11.1 (SSE2 optimization mode)
improved automatic variables disambiguation using 'restrict'

Download: https://sourceforge.net/projects/bmagic/files/

Friday, August 7, 2009

Search for Efficient Bitwise Traversal

One of the fundamental problems BitMagic library faces is optimization of bit traverse algorithms.
BitMagic implements a class bvector<>::emunerator to iterate 1 bits just as they are sequence of integers. Under the hood of the high level algorithm is a problem how to unpack a single integer value into a variable length array of 1 bit indexes.

For example an bit string 10011 should translate into { 0, 3, 4 }.

This problem is somewhat related to a famous POPCNT family of bit counting algorithms. POPCNT is beaten to death on the Internet, while there are not to many sources doing bit traversal (which looks very relevant too).

It is hard to implement a fast and generic algorithm to behave equally well for words with high and low bit density. In this post we want to encourage others to try finding alternative methods, optimizing for 64-bit, SSE2, maybe GPU, CUDA, etc.

Solution 1.
This algorithm uses 4-bit unrolled loop with a straight-forward check. This code is very compact and works well when both code and data fit into L1 CPU cache. Template parameter T represents integer word type, F is a functor-class to respond to found 1 bit.


template < typename T, typename F > 
void bit_for_each(T w, F& func)
{
 for (unsigned octet = 0; w != 0; w >>= 8, octet += 8)
 {
      if (w & 1)   func(octet + 0);
      if (w & 2)   func(octet + 1);
      if (w & 4)   func(octet + 2);
      if (w & 8)   func(octet + 3);
      if (w & 16)  func(octet + 4);
      if (w & 32)  func(octet + 5);
      if (w & 64)  func(octet + 6);
      if (w & 128) func(octet + 7);
 } 
}

Solution 2.
Second algorithm implements a hard coded switch table (C/C++-compiler most likely transforms this into a JUMP address table. As general purpose solution this algorithm looks faster than variant 1. It gives compiler a better compile time optimization plan.
The catch is that functor should support multiple arguments (up to4 in this case). All functor operators () must be inline, otherwise all performance benefits easily lost pushing parameters down to the stack. (This solution is inspired by template meta-programming articles, I believe this code can be written as a compact template meta program at the expense of clarity and maybe compatibility...)


template < typename T, typename F > 
void bit_for_each_4(T w, F& func)
{
 for (unsigned sub_octet = 0; w != 0; w >>= 4, sub_octet += 4)
 {
  switch (w & 15)
  {
  case 0: // 0000
      break;
  case 1: // 0001
      func(sub_octet);
      break;
  case 2: // 0010
      func(sub_octet + 1);
      break;
  case 3: // 0011
      func(sub_octet, sub_octet + 1);
      break;
  case 4: // 0100
      func(sub_octet + 2);
      break;
  case 5: // 0101
      func(sub_octet, sub_octet + 2);
      break;
  case 6: // 0110
      func(sub_octet + 1, sub_octet + 2);
      break;
  case 7: // 0111
      func(sub_octet, sub_octet + 1, sub_octet + 2);
      break;
  case 8: // 1000
      func(sub_octet + 3);
      break;
  case 9: // 1001
      func(sub_octet, sub_octet + 3);
      break;
  case 10: // 1010
      func(sub_octet + 1, sub_octet + 3);
      break;
  case 11: // 1011
      func(sub_octet, sub_octet + 1, sub_octet + 3);
      break;
  case 12: // 1100
      func(sub_octet + 2, sub_octet + 3);
      break;
  case 13: // 1101
      func(sub_octet, sub_octet + 2, sub_octet + 3);
      break;
  case 14: // 1110
      func(sub_octet + 1, sub_octet + 2, sub_octet + 3);
      break;
  case 15: // 1111
      func(sub_octet, sub_octet + 1, sub_octet + 2, sub_octet + 3);
      break;
  default:
      break;
  } 
 }

Again there should be a way to write an assembly code using BSF (bit scan forward), BSR (bit scan reverse). References on relevant samples are highly appreciated.

Wednesday, August 5, 2009

Compilers test ICC vs MSVC vs GCC

BitMagic benchmark test compiled on Windows XP under different compilers (GCC 3.4, MSVC 8 and Intel C++ 11). Tests here are synthetic, but represent very practical functions used in data-mining and database search algorithms. No SSE2 or 64-bit optimization, pure 32-bit mode, single thread. System is running on Core2 Duo.

BitMagic individual tests (shorter - faster)

Overall results (shorter - faster)

Test results are relatively uniform (code is rather well tuned manually), it was a bit of surprise to see Intel C++ 11 slower in XOR count test (Humming distance).
Overall winner for integer algorithms - GNU C++.

BitMagic Library Blog