In this post I wanted to discuss a design pattern for implementing GROUP BY -- HAVING requests, common in data-mining application.
Problem description:
Let’s say we have a fact table with multiple fields, all indexed using bit-vector based inverted lists.
Our goal is to run a query which best to describe in pseudo SQL :
SELECT fact1, count(*) FROM FactsTable
WHERE fact1 > ‘value’ AND fact2 BETWEEN 10 and 100 OR fact 3 = ‘valueX’
GROUP BY fact1
HAVING COUNT(*) > 100
The proposed “query plan” for this type of query is a two step process.
Step 1. Program runs a WHERE clause, retrieving bit-vectors corresponding to various parts of the logical expression (fact1 > ‘value’, fact2 BETWEEN 10 and 100, fact 3 = ‘valueX’). When all vectors are in memory, series of ANDs and Ors gives us the result. This is a low complexity problem.
Step 2. Program got the bit-vector from Step 1 (Query Bit-Vector).
All we need to do now is to iteratively read all possible bit-vectors corresponding to fact1. Number of vectors here depends on the total number of variants (facts) in the fact1 field (in other words we are looking at fact1 cardinality). It can be quite large, but many intermediate results have a good chance to be filtered out with a HAVING criteria.
To perform GROUP BY – HAVING we propose to execute logical operation AND between the query vector and all fact1 vectors. HAVING is trivial – all you need is to count bits in the AND product vector.
Lets look at the pseudo code for this:
LOOP (until we have bit-vectors)
{
bv_file.read(buffer); // read BLOB
bvect* pbv = new bvect(BM_GAP); // GROUP vect
// deserialize BLOB
bm::deserialize(*pbv, buffer);
*pbv &= bv_query; // GROUP BY AND
if (pbv->count() >= 100 )// HAVING
{
// process query results
}
delete pbv;
}
Scalability?
Algorithmic complexity of this approach is directly proportional to number on GROUP BY vectors we have to read , deserialize and AND. The good news is that there is no loop iteration carried dependencies, so this algorithm can be easily parallelized – all deserializations, ANDs and bit-counting can be done in parallel. For dense bit-vectors BitMagic will use vector operations with SSE2, SSE4.2 to use wide-parallel hardware acceleration. So this algorithm is multi-core friendly and should scale well given that IO subsystem is sufficiently capable delivering new the raw data.
On a well balanced system we will be able to utilize a large number of CPUs / Cores and solve this problem fast.