Saturday, January 9, 2010

Parallel superscalar bit-counting

While working on version 3.6.3 collective Internet wisdom brought yet another interesting bit-counting algorithm, which turns to be efficient for 64-bit systems with superscalar execution (a form of parallelizm, where linear code with data-independent instructions is actually executed in parallel).

This code below implements population count for 4 64-bit integers. (Tested with GCC 4.4.1 64-bit Linux). This becomes part of BM 3.6.3 (when SourceForge recovers from its (I hope temporary) down).

inline
unsigned bitcount64_4way(bm::id64_t x, bm::id64_t y,
                    bm::id64_t u, bm::id64_t v)
{
    const bm::id64_t m1 = 0x5555555555555555;
    const bm::id64_t m2 = 0x3333333333333333;
    const bm::id64_t m3 = 0x0F0F0F0F0F0F0F0F;
    const bm::id64_t m4 = 0x000000FF000000FF;

    x = x - ((x >> 1) & m1);
    y = y - ((y >> 1) & m1);
    u = u - ((u >> 1) & m1);
    v = v - ((v >> 1) & m1);
    x = (x & m2) + ((x >> 2) & m2);
    y = (y & m2) + ((y >> 2) & m2);
    u = (u & m2) + ((u >> 2) & m2);
    v = (v & m2) + ((v >> 2) & m2);
    x = x + y;
    u = u + v;
    x = (x & m3) + ((x >> 4) & m3);
    u = (u & m3) + ((u >> 4) & m3);
    x = x + u;
    x = x + (x >> 8);
    x = x + (x >> 16);
    x = x & m4;
    x = x + (x >> 32);
    return x & 0x000001FF;
}

SourceForge down :-(

Thursday, January 7, 2010

Saturday, December 19, 2009

BitMagic library version 3.6.2 released

New features:
- added configurations for MacOS-X
 - optimizations for SSE 4.2 in 64-bit mode

now it is fine to use:
#define BM64OPT
#define BMSSE42OPT

This two options produce SIMD vectorized code for SSE 4.2 which uses 64-bit aware bit-counting

- new algorithm for picking random subset out of bvector<>
see  sample10.cpp

Wednesday, December 16, 2009

SSE2 vs CUDA in parallel bit stream transposition / similarity. Part 2.


We made a second attempt to work with CUDA to better understand if it looks like a suitable technology for database and data-mining acceleration.
With a very valuable help from our readers we prepared a second version of GPU benchmark and tried to analyze performance and power consumption.

CUDA vs SSE. Part 2.

Friday, December 11, 2009

Win7 vs Ubuntu 9.10 (32-bit vs 64-bit)

This benchmark is designed to illustrate platform-compiler choice for BitMagic derived computational algorithms.

Test total numbers (shorter - faster)

From Blog


Test Platform: Nehalem Core i5, 2.67GHz (Turbo Boost enabled)
BitMagic Library 3.6.2 (not yet released)

Windows 7 32-bit used default 32-bit compile using MSVC8. SSE4 optimization disabled.

Ubuntu 9.10 64-bit GCC 4.4.1 with compile options ( -march=core2 -m64 -msse4 )

BitMagic 64-bit optimizations + SSE4 optimizations:
#define BMSSE42OPT
#define BM64OPT

(this combination will be supported in BitMagic Library 3.6.2).

All measurements in seconds (shorter - faster).

From Blog

As you can see, 32-bit default compile using stock compiler looses in almost every test to 64-bit SSE4.2 version, tuned for the platform.

Yes, this experiment intentionally staged for Windows7+MSVC to loose (by using unfavorable compilation and optimization settings: stock x86 32-bit software suffers from register starvation, ofter ignores presence of SIMD SSE unit. Having said that such resource under-utilization often happens in real-life and it is often ignored as insignificant. From this data we can see that sometimes insignificant things can add-up to measurable values.