- Performance optimization count_and(), bvector destruction, etc

- Simplified template arguments for bvector<>

BitMagic is C++ library designed and developed to implement efficient platform independent bitsets. BitMagic is Open Source, free library. You can use this software in any commercial or non-commercial projects, free of any charge. Though we encorage you to let us know about your projects, and applications of BitMagic. Please mention the authors in any work derived from this project.

- Fixed logical bug in deserialization (multiple deserialization into 1 bit-vector with incorrect OR)

- Performance optimization count_and(), bvector destruction, etc

- Simplified template arguments for bvector<>

- Performance optimization count_and(), bvector destruction, etc

- Simplified template arguments for bvector<>

In this post I wanted to discuss a design pattern for implementing GROUP BY -- HAVING requests, common in data-mining application.

Problem description:

Let’s say we have a fact table with multiple fields, all indexed using bit-vector based inverted lists.

Our goal is to run a query which best to describe in pseudo SQL :

SELECT fact1, count(*) FROM FactsTable

WHERE fact1 > ‘value’ AND fact2 BETWEEN 10 and 100 OR fact 3 = ‘valueX’

GROUP BY fact1

HAVING COUNT(*) > 100

The proposed “query plan” for this type of query is a two step process.

Our goal is to run a query which best to describe in pseudo SQL :

SELECT fact1, count(*) FROM FactsTable

WHERE fact1 > ‘value’ AND fact2 BETWEEN 10 and 100 OR fact 3 = ‘valueX’

GROUP BY fact1

HAVING COUNT(*) > 100

The proposed “query plan” for this type of query is a two step process.

All we need to do now is to iteratively read all possible bit-vectors corresponding to fact1. Number of vectors here depends on the total number of variants (facts) in the fact1 field (in other words we are looking at fact1 cardinality). It can be quite large, but many intermediate results have a good chance to be filtered out with a HAVING criteria.

To perform GROUP BY – HAVING we propose to execute logical operation AND between the query vector and all fact1 vectors. HAVING is trivial – all you need is to count bits in the AND product vector.

Lets look at the pseudo code for this:

LOOP (until we have bit-vectors)

{

bv_file.read(buffer); // read BLOB

bvect* pbv = new bvect(BM_GAP); // GROUP vect

// deserialize BLOB

// deserialize BLOB

bm::deserialize(*pbv, buffer);

*pbv &= bv_query; // GROUP BY AND

if (pbv->count() >= 100 )// HAVING

{

// process query results

}

delete pbv;

}

Algorithmic complexity of this approach is directly proportional to number on GROUP BY vectors we have to read , deserialize and AND. The good news is that there is no loop iteration carried dependencies, so this algorithm can be easily parallelized – all deserializations, ANDs and bit-counting can be done in parallel. For dense bit-vectors BitMagic will use vector operations with SSE2, SSE4.2 to use wide-parallel hardware acceleration. So this algorithm is multi-core friendly and should scale well given that IO subsystem is sufficiently capable delivering new the raw data.

On a well balanced system we will be able to utilize a large number of CPUs / Cores and solve this problem fast.

Release 3.6.4 significantly (up to x2 times) improves performance of deserialization.

This round of optimization should make all bit-vector database IO operations more efficient.

This round of optimization should make all bit-vector database IO operations more efficient.

Sharing some experience with AMD CodeAnalyst profiler.

This profiler from AMD is free, simple, lightweight and implemented as a Visual Studio plugin (apparently there is Linux version, but I did not try it yet).

Default profiling mode is time-sampling.

I took a few screenshots:

From Blog |

From Blog |

From Blog |

Pros and Cons:

1. Easy to use, you can start your session in a few minutes (+)

2. Hot-spot detection is easy, chart based, results are generally believable (+)

3. Profiler is lean and works fast, profiled program is not slowed down. (+)

3. AMD CodeAnalyst is free (+)

4. Different screens (chart and data) sometimes give contradicting results, which is a bit confusing. (-)

In the screenshots above 1 and 3 indicate different proportion in hot-spot data (maybe I don't understand something or bug in visualization).

5. I failed to get call stack for hot-spots even though configuration promised to collect it. (-)

This can be a big issue for large projects with multiple code paths.

Similar tools like Intel Parallel Studio can show call stacks and allows easy call stack analysis.

This release fixes critical bug in 64-bit mode (#define BM64OPT) and improves performance of bit-counting related operations.

While working on version 3.6.3 collective Internet wisdom brought yet another interesting bit-counting algorithm, which turns to be efficient for 64-bit systems with superscalar execution (a form of parallelizm, where linear code with data-independent instructions is actually executed in parallel).

This code below implements population count for 4 64-bit integers. (Tested with GCC 4.4.1 64-bit Linux). This becomes part of BM 3.6.3 (when SourceForge recovers from its (I hope temporary) down).

This code below implements population count for 4 64-bit integers. (Tested with GCC 4.4.1 64-bit Linux). This becomes part of BM 3.6.3 (when SourceForge recovers from its (I hope temporary) down).

inline

unsigned bitcount64_4way(bm::id64_t x, bm::id64_t y,

bm::id64_t u, bm::id64_t v)

{

const bm::id64_t m1 = 0x5555555555555555;

const bm::id64_t m2 = 0x3333333333333333;

const bm::id64_t m3 = 0x0F0F0F0F0F0F0F0F;

const bm::id64_t m4 = 0x000000FF000000FF;

x = x - ((x >> 1) & m1);

y = y - ((y >> 1) & m1);

u = u - ((u >> 1) & m1);

v = v - ((v >> 1) & m1);

x = (x & m2) + ((x >> 2) & m2);

y = (y & m2) + ((y >> 2) & m2);

u = (u & m2) + ((u >> 2) & m2);

v = (v & m2) + ((v >> 2) & m2);

x = x + y;

u = u + v;

x = (x & m3) + ((x >> 4) & m3);

u = (u & m3) + ((u >> 4) & m3);

x = x + u;

x = x + (x >> 8);

x = x + (x >> 16);

x = x & m4;

x = x + (x >> 32);

return x & 0x000001FF;

}

Subscribe to:
Posts (Atom)

- BitMagic C++ Library
- Bit-vector library designed and developed to implement efficient, platform independent bitsets. http://bmagic.sourceforge.net/