Monday, August 17, 2009

CUDA vs SSE

We all know that GPUs can be incredibly fast at some single precision floating point algorithms.
Developers of BitMagic Library Igor Tolstoy and Anatoliy Kuznetsov made an attempt to implement parallel bit-stream transposition and similarity algorithm (a lot of bit-shifting, population counting and general purpose integer logic) to better understand CUDA and GP-GPU computing in comparison with current (and future) Intel vectorization architectures (SSE2).

Benchmarking, CUDA and SSE source codes, some speculation about Larabee and implications for data-mining and large scale databases here:
http://bmagic.sourceforge.net/bmcudasse2.html

1 comment:

  1. We are working on similar stuff. To count the number of set bits in a byte, we use a constant uchar array of 256. Each entry in the array contains the number of bits in that byte... That way, to calculate number of set bits in a 32-bit number, all you need to do is to index 4 times in this array and add them all.
    One can use texture to fetch this array and will be easily cached inside...

    "popcll" is actually a macro that does a whole lot of shifting, adding, multiplication etc..

    ReplyDelete