Sunday, April 25, 2010

BitMagic Library v.3.7.0 released

- Fixed logical bug in deserialization (multiple deserialization into 1 bit-vector with incorrect OR)
- Performance optimization count_and(), bvector destruction, etc
- Simplified template arguments for bvector<>

Tuesday, March 30, 2010

Use case: Implementing GROUP BY – HAVING with bit-vectors

In this post I wanted to discuss a design pattern for implementing GROUP BY  -- HAVING requests, common in data-mining application.
Problem description:
Let’s say we have a fact table with multiple fields, all indexed using bit-vector based inverted lists.
Our goal is to run a query which best to describe in pseudo SQL :

SELECT fact1, count(*) FROM FactsTable
WHERE fact1 > ‘value’ AND fact2 BETWEEN 10 and 100 OR fact 3 = ‘valueX’
GROUP BY fact1

The proposed “query plan” for this type of query is a two step process.

Step 1. Program runs a WHERE clause, retrieving bit-vectors corresponding to various parts of the logical expression (fact1 > ‘value’,  fact2 BETWEEN 10 and 100,  fact 3 = ‘valueX’). When all vectors are in memory, series of ANDs and Ors gives us the result. This is a low complexity problem.

Step 2.  Program got the bit-vector from Step 1 (Query Bit-Vector).
All we need to do now is to iteratively read all possible bit-vectors corresponding to fact1. Number of vectors here depends on the total number of variants (facts) in the fact1 field (in other words we are looking at fact1 cardinality). It can be quite large, but many intermediate results have a good chance to be filtered out with a HAVING criteria.

To perform GROUP BY – HAVING we propose to execute logical operation AND between the query vector and all fact1 vectors.  HAVING is trivial – all you need is to count bits in the AND product vector.
Lets look at the pseudo code for this:

LOOP (until we have bit-vectors)
{; // read BLOB
   bvect* pbv = new bvect(BM_GAP); // GROUP vect
   // deserialize BLOB
   bm::deserialize(*pbv, buffer); 

   *pbv &= bv_query; // GROUP BY AND 
   if (pbv->count() >= 100 )// HAVING
              // process query results
   delete pbv;


Algorithmic complexity of this approach is directly proportional to number on GROUP BY vectors we have to read , deserialize and AND. The good news is that there is no loop iteration carried dependencies, so this algorithm can be easily parallelized – all deserializations, ANDs and bit-counting can be done in parallel. For dense bit-vectors BitMagic will use vector operations with SSE2, SSE4.2 to use wide-parallel hardware acceleration.  So this algorithm is multi-core friendly and should scale well given that IO subsystem is sufficiently capable delivering new the raw data. 

On a well balanced system we will be able to utilize a large number of CPUs / Cores and solve this problem fast. 

Tuesday, March 23, 2010

BitMagic Library v.3.6.4 released

Release 3.6.4 significantly (up to x2 times) improves performance of deserialization.
This round of optimization should make all bit-vector database IO operations more efficient.

Thursday, March 18, 2010

AMD CodeAnalyst Performance Analyzer for Windows®

Sharing some experience with AMD CodeAnalyst profiler.

This profiler from AMD is free, simple, lightweight and implemented as a Visual Studio plugin (apparently there is Linux version, but I did not try it yet). 
Default profiling mode is time-sampling. 

I took a  few screenshots:

From Blog

From Blog

From Blog

Pros and Cons:

1. Easy to use, you can start your session in a few minutes (+)
2. Hot-spot detection is easy, chart based, results are generally believable (+)
3. Profiler is lean and works fast, profiled program is not slowed down. (+)
3. AMD CodeAnalyst is free (+)
4. Different screens (chart and data) sometimes give contradicting results, which is a bit confusing. (-)
    In the screenshots above 1 and 3 indicate different proportion in hot-spot data (maybe I don't understand something or bug in visualization).
5. I failed to get call stack for hot-spots even though configuration promised to collect it. (-)
    This can be a big issue for large projects with multiple code paths.
    Similar tools like Intel Parallel Studio can show call stacks and allows easy call stack analysis.

Saturday, January 9, 2010

BitMagic library version 3.6.3 released

This release fixes critical bug in 64-bit mode (#define BM64OPT) and improves performance of bit-counting related operations.

Parallel superscalar bit-counting

While working on version 3.6.3 collective Internet wisdom brought yet another interesting bit-counting algorithm, which turns to be efficient for 64-bit systems with superscalar execution (a form of parallelizm, where linear code with data-independent instructions is actually executed in parallel).

This code below implements population count for 4 64-bit integers. (Tested with GCC 4.4.1 64-bit Linux). This becomes part of BM 3.6.3 (when SourceForge recovers from its (I hope temporary) down).

unsigned bitcount64_4way(bm::id64_t x, bm::id64_t y,
                    bm::id64_t u, bm::id64_t v)
    const bm::id64_t m1 = 0x5555555555555555;
    const bm::id64_t m2 = 0x3333333333333333;
    const bm::id64_t m3 = 0x0F0F0F0F0F0F0F0F;
    const bm::id64_t m4 = 0x000000FF000000FF;

    x = x - ((x >> 1) & m1);
    y = y - ((y >> 1) & m1);
    u = u - ((u >> 1) & m1);
    v = v - ((v >> 1) & m1);
    x = (x & m2) + ((x >> 2) & m2);
    y = (y & m2) + ((y >> 2) & m2);
    u = (u & m2) + ((u >> 2) & m2);
    v = (v & m2) + ((v >> 2) & m2);
    x = x + y;
    u = u + v;
    x = (x & m3) + ((x >> 4) & m3);
    u = (u & m3) + ((u >> 4) & m3);
    x = x + u;
    x = x + (x >> 8);
    x = x + (x >> 16);
    x = x & m4;
    x = x + (x >> 32);
    return x & 0x000001FF;