Friday, August 7, 2009

Search for Efficient Bitwise Traversal

One of the fundamental problems BitMagic library faces is optimization of bit traverse algorithms.
BitMagic implements a class bvector<>::emunerator to iterate 1 bits just as they are sequence of integers. Under the hood of the high level algorithm is a problem how to unpack a single integer value into a variable length array of 1 bit indexes.

For example an bit string 10011 should translate into { 0, 3, 4 }.

This problem is somewhat related to a famous POPCNT family of bit counting algorithms. POPCNT is beaten to death on the Internet, while there are not to many sources doing bit traversal (which looks very relevant too).

It is hard to implement a fast and generic algorithm to behave equally well for words with high and low bit density. In this post we want to encourage others to try finding alternative methods, optimizing for 64-bit, SSE2, maybe GPU, CUDA, etc.

Solution 1.
This algorithm uses 4-bit unrolled loop with a straight-forward check. This code is very compact and works well when both code and data fit into L1 CPU cache. Template parameter T represents integer word type, F is a functor-class to respond to found 1 bit.



template < typename T, typename F >
void bit_for_each(T w, F& func)
{
for (unsigned octet = 0; w != 0; w >>= 8, octet += 8)
{
if (w & 1) func(octet + 0);
if (w & 2) func(octet + 1);
if (w & 4) func(octet + 2);
if (w & 8) func(octet + 3);
if (w & 16) func(octet + 4);
if (w & 32) func(octet + 5);
if (w & 64) func(octet + 6);
if (w & 128) func(octet + 7);
}
}



Solution 2.
Second algorithm implements a hard coded switch table (C/C++-compiler most likely transforms this into a JUMP address table. As general purpose solution this algorithm looks faster than variant 1. It gives compiler a better compile time optimization plan.
The catch is that functor should support multiple arguments (up to4 in this case). All functor operators () must be inline, otherwise all performance benefits easily lost pushing parameters down to the stack. (This solution is inspired by template meta-programming articles, I believe this code can be written as a compact template meta program at the expense of clarity and maybe compatibility...)



template < typename T, typename F >
void bit_for_each_4(T w, F& func)
{
for (unsigned sub_octet = 0; w != 0; w >>= 4, sub_octet += 4)
{
switch (w & 15)
{
case 0: // 0000
break;
case 1: // 0001
func(sub_octet);
break;
case 2: // 0010
func(sub_octet + 1);
break;
case 3: // 0011
func(sub_octet, sub_octet + 1);
break;
case 4: // 0100
func(sub_octet + 2);
break;
case 5: // 0101
func(sub_octet, sub_octet + 2);
break;
case 6: // 0110
func(sub_octet + 1, sub_octet + 2);
break;
case 7: // 0111
func(sub_octet, sub_octet + 1, sub_octet + 2);
break;
case 8: // 1000
func(sub_octet + 3);
break;
case 9: // 1001
func(sub_octet, sub_octet + 3);
break;
case 10: // 1010
func(sub_octet + 1, sub_octet + 3);
break;
case 11: // 1011
func(sub_octet, sub_octet + 1, sub_octet + 3);
break;
case 12: // 1100
func(sub_octet + 2, sub_octet + 3);
break;
case 13: // 1101
func(sub_octet, sub_octet + 2, sub_octet + 3);
break;
case 14: // 1110
func(sub_octet + 1, sub_octet + 2, sub_octet + 3);
break;
case 15: // 1111
func(sub_octet, sub_octet + 1, sub_octet + 2, sub_octet + 3);
break;
default:
break;
}
}



Again there should be a way to write an assembly code using BSF (bit scan forward), BSR (bit scan reverse). References on relevant samples are highly appreciated.

1 comment:

  1. I'm very curious about this too. Would love more ideas on it.

    ReplyDelete