tag:blogger.com,1999:blog-63389564003909198632024-03-13T19:35:06.099-07:00BitMagic Library BlogBitMagic is C++ library designed and developed to implement efficient platform independent bitsets.
BitMagic is Open Source, free library. You can use this software in any commercial or non-commercial projects, free of any charge. Though we encorage you to let us know about your projects, and applications of BitMagic.
Please mention the authors in any work derived from this project.BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.comBlogger27125tag:blogger.com,1999:blog-6338956400390919863.post-90299566190847450222010-04-25T13:35:00.000-07:002010-04-25T13:35:00.413-07:00BitMagic Library v.3.7.0 released- Fixed logical bug in deserialization (multiple deserialization into 1 bit-vector with incorrect OR)<br />
- Performance optimization count_and(), bvector destruction, etc<br />
- Simplified template arguments for bvector<><br />
<div><br />
</div>BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com2tag:blogger.com,1999:blog-6338956400390919863.post-79285169779927237092010-03-30T13:34:00.000-07:002010-03-30T13:37:17.569-07:00Use case: Implementing GROUP BY – HAVING with bit-vectors<div class="MsoNormal">In this post I wanted to discuss a design pattern for implementing GROUP BY -- HAVING requests, common in data-mining application.</div><div class="MsoNormal">Problem description:</div><div class="MsoNormal">Let’s say we have a fact table with multiple fields, all indexed using bit-vector based inverted lists.<br />
Our goal is to run a query which best to describe in pseudo SQL :<br />
<br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> SELECT fact1, count(*) FROM FactsTable<br />
WHERE fact1 > ‘value’ AND fact2 BETWEEN 10 and 100 OR fact 3 = ‘valueX’ <br />
GROUP BY fact1<br />
HAVING COUNT(*) > 100</span> <br />
The proposed “query plan” for this type of query is a two step process.</div><div class="MsoNormal"><br />
</div><div class="MsoNormal"><b>Step 1</b>. Program runs a WHERE clause, retrieving bit-vectors corresponding to various parts of the logical expression (fact1 > ‘value’, fact2 BETWEEN 10 and 100, fact 3 = ‘valueX’). When all vectors are in memory, series of ANDs and Ors gives us the result. This is a low complexity problem.</div><div class="MsoNormal"><br />
</div><div class="MsoNormal"><b>Step 2</b>. Program got the bit-vector from Step 1 (Query Bit-Vector). <br />
All we need to do now is to iteratively read all possible bit-vectors corresponding to fact1. Number of vectors here depends on the total number of variants (facts) in the fact1 field (in other words we are looking at fact1 cardinality). It can be quite large, but many intermediate results have a good chance to be filtered out with a HAVING criteria.<br />
<br />
To perform GROUP BY – HAVING we propose to execute logical operation AND between the query vector and all fact1 vectors. HAVING is trivial – all you need is to count bits in the AND product vector.</div><div class="MsoNormal">Lets look at the pseudo code for this:</div><div class="MsoNormal"><br />
</div><div class="MsoNormal"></div><div class="MsoNormal" style="line-height: normal; margin-bottom: .0001pt; margin-bottom: 0in; mso-layout-grid-align: none; text-autospace: none;"><span style="font-family: 'Courier New'; font-size: 10pt;"><span style="color: blue;">LOOP (until we have bit-vectors)</span><o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: .0001pt; margin-bottom: 0in; mso-layout-grid-align: none; text-autospace: none;"><span style="font-family: 'Courier New'; font-size: 10pt;">{<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: .0001pt; margin-bottom: 0in; mso-layout-grid-align: none; text-autospace: none;"><span style="font-family: 'Courier New'; font-size: 10pt;"> <o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: .0001pt; margin-bottom: 0in; mso-layout-grid-align: none; text-autospace: none;"><span style="font-family: 'Courier New'; font-size: 10pt;"><span style="color: #010001;"> bv_file</span>.<span style="color: #010001;">read</span>(<span style="color: #010001;">buffer</span>); // read BLOB<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: .0001pt; margin-bottom: 0in; mso-layout-grid-align: none; text-autospace: none;"><span style="font-family: 'Courier New'; font-size: 10pt;"> <span style="color: #010001;">bvect</span>* <span style="color: #010001;">pbv</span> = <span style="color: blue;">new</span> <span style="color: #010001;">bvect</span>(<span style="color: #010001;">BM_GAP</span>); // GROUP vect<o:p></o:p></span><br />
<span style="font-family: 'Courier New'; font-size: 10pt;"> // deserialize BLOB</span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: .0001pt; margin-bottom: 0in; mso-layout-grid-align: none; text-autospace: none;"><span style="font-family: 'Courier New'; font-size: 10pt;"> <span style="color: #010001;">bm</span>::<span style="color: #010001;">deserialize</span>(*<span style="color: #010001;">pbv</span>, <span style="color: #010001;">buffer</span>); <o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: .0001pt; margin-bottom: 0in; mso-layout-grid-align: none; text-autospace: none;"><span class="Apple-style-span" style="font-family: 'Courier New'; font-size: 13px;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: 'Courier New'; font-size: 13px;"> *<span style="color: #010001;">pbv</span> &= <span style="color: #010001;">bv_query</span>; // GROUP BY AND </span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: .0001pt; margin-bottom: 0in; mso-layout-grid-align: none; text-autospace: none;"><span style="font-family: 'Courier New'; font-size: 10pt;"> <span style="color: blue;">if</span> (<span style="color: #010001;">pbv</span>-><span style="color: #010001;">count</span>() >= 100 )// HAVING <o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: .0001pt; margin-bottom: 0in; mso-layout-grid-align: none; text-autospace: none;"><span style="font-family: 'Courier New'; font-size: 10pt;"> {<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: .0001pt; margin-bottom: 0in; mso-layout-grid-align: none; text-autospace: none;"><span style="font-family: 'Courier New'; font-size: 10pt;"> // process query results<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: .0001pt; margin-bottom: 0in; mso-layout-grid-align: none; text-autospace: none;"><span style="font-family: 'Courier New'; font-size: 10pt;"> }<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: .0001pt; margin-bottom: 0in; mso-layout-grid-align: none; text-autospace: none;"><span style="font-family: 'Courier New'; font-size: 10pt;"> <span style="color: blue;">delete</span> <span style="color: #010001;">pbv</span>;<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: .0001pt; margin-bottom: 0in; mso-layout-grid-align: none; text-autospace: none;"><span class="Apple-style-span" style="font-family: 'Courier New'; font-size: 13px; line-height: 14px;">}</span></div><div class="MsoNormal"><span style="font-family: 'Courier New'; font-size: 10pt; line-height: 115%;"><br />
</span></div><div class="MsoNormal"><span style="font-family: 'Courier New'; font-size: 10pt; line-height: 115%;"><br />
</span></div><div class="MsoNormal"><span style="font-family: 'Courier New'; font-size: 10pt; line-height: 115%;"></span></div><div class="MsoNormal"><b>Scalability?</b></div><div class="MsoNormal"><br />
</div><div class="MsoNormal">Algorithmic complexity of this approach is directly proportional to number on GROUP BY vectors we have to read , deserialize and AND. The good news is that there is no loop iteration carried dependencies, so this algorithm can be easily parallelized – all deserializations, ANDs and bit-counting can be done in parallel. For dense bit-vectors BitMagic will use vector operations with SSE2, SSE4.2 to use wide-parallel hardware acceleration. So this algorithm is multi-core friendly and should scale well given that IO subsystem is sufficiently capable delivering new the raw data. </div><div class="MsoNormal"><br />
</div><div class="MsoNormal">On a well balanced system we will be able to utilize a large number of CPUs / Cores and solve this problem fast. </div><br />
<div class="MsoNormal"><span style="font-family: 'Courier New'; font-size: 10pt; line-height: 115%;"><br />
</span></div>BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com1tag:blogger.com,1999:blog-6338956400390919863.post-44539554299593501692010-03-23T18:57:00.000-07:002010-03-23T18:57:19.815-07:00BitMagic Library v.3.6.4 released<span class="Apple-style-span" style="color: #5e5e5e; font-family: arial, helvetica, clean, sans-serif; font-size: 13px; line-height: 16px;">Release 3.6.4 significantly (up to x2 times) improves performance of deserialization.</span><br />
<span class="Apple-style-span" style="color: #5e5e5e; font-family: arial, helvetica, clean, sans-serif; font-size: small;"><span class="Apple-style-span" style="font-size: 13px; line-height: 16px;">This round of optimization should make all bit-vector database IO operations more efficient.</span></span>BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com0tag:blogger.com,1999:blog-6338956400390919863.post-70931501217762605872010-03-18T17:21:00.000-07:002010-03-18T17:21:46.162-07:00AMD CodeAnalyst Performance Analyzer for Windows®<span class="Apple-style-span" style="color: #666666; font-family: Verdana, Helvetica, Arial, sans-serif; font-size: x-large;"><span class="Apple-style-span" style="font-size: 19px;"><b><div><span class="Apple-style-span" style="color: black; font-family: 'Times New Roman';"><span class="Apple-style-span" style="font-size: medium; font-weight: normal;">Sharing some experience with AMD CodeAnalyst profiler.</span></span></div><div><span class="Apple-style-span" style="color: black; font-family: 'Times New Roman';"><span class="Apple-style-span" style="font-size: medium; font-weight: normal;"><br />
</span></span></div><div><span class="Apple-style-span" style="color: black; font-family: 'Times New Roman';"><span class="Apple-style-span" style="font-size: medium; font-weight: normal;">This profiler from AMD is free, simple, lightweight and implemented as a Visual Studio plugin (apparently there is Linux version, but I did not try it yet). </span></span></div><div><span class="Apple-style-span" style="color: black; font-family: 'Times New Roman';"><span class="Apple-style-span" style="font-size: medium; font-weight: normal;">Default profiling mode is time-sampling. </span></span></div><div><span class="Apple-style-span" style="color: black; font-family: 'Times New Roman';"><span class="Apple-style-span" style="font-size: medium; font-weight: normal;"><br />
</span></span></div><div><span class="Apple-style-span" style="color: black; font-family: 'Times New Roman';"><span class="Apple-style-span" style="font-size: medium; font-weight: normal;">I took a few screenshots:</span></span></div><div><span class="Apple-style-span" style="color: black; font-family: 'Times New Roman';"><span class="Apple-style-span" style="font-size: medium; font-weight: normal;"><br />
</span></span></div><div><span class="Apple-style-span" style="font-weight: normal;"><b><div style="display: inline !important;"><span class="Apple-style-span" style="color: black; font-family: 'Times New Roman'; font-size: medium; font-weight: normal;"><table style="width: auto;"><tbody>
<tr><td><a href="http://picasaweb.google.com/lh/photo/hyr74odrF1VVKU2giPzZPg?feat=embedwebsite"><img src="http://lh6.ggpht.com/_BFw7TLV64do/S6K-PlVRuzI/AAAAAAAABIs/AP0h4xpwnKw/s400/amd-profiler1.png" /></a></td></tr>
<tr><td style="font-family: arial,sans-serif; font-size: 11px; text-align: right;">From <a href="http://picasaweb.google.com/anatoliy.kuznetsov/Blog?feat=embedwebsite">Blog</a></td></tr>
</tbody></table></span></div></b></span></div></b></span></span><br />
<br />
<br />
<table style="width: auto;"><tbody>
<tr><td><a href="http://picasaweb.google.com/lh/photo/wQzPsv4GAHbx_vo2ooohmQ?feat=embedwebsite"><img src="http://lh6.ggpht.com/_BFw7TLV64do/S6K-PxEoZuI/AAAAAAAABIw/xxCe_QKjHns/s400/amd-profiler2.png" /></a></td></tr>
<tr><td style="font-family: arial,sans-serif; font-size: 11px; text-align: right;">From <a href="http://picasaweb.google.com/anatoliy.kuznetsov/Blog?feat=embedwebsite">Blog</a></td></tr>
</tbody></table><br />
<br />
<br />
<br />
<table style="width: auto;"><tbody>
<tr><td><a href="http://picasaweb.google.com/lh/photo/JihDIMSkICBZvKdilLlV_w?feat=embedwebsite"><img src="http://lh6.ggpht.com/_BFw7TLV64do/S6K-QCoXYRI/AAAAAAAABI0/nIHqEVMyF9g/s400/amd-profiler3.png" /></a></td></tr>
<tr><td style="font-family: arial,sans-serif; font-size: 11px; text-align: right;">From <a href="http://picasaweb.google.com/anatoliy.kuznetsov/Blog?feat=embedwebsite">Blog</a></td></tr>
</tbody></table><br />
<br />
Pros and Cons:<br />
<br />
1. Easy to use, you can start your session in a few minutes (+)<br />
2. Hot-spot detection is easy, chart based, results are generally believable (+)<br />
3. Profiler is lean and works fast, profiled program is not slowed down. (+)<br />
3. AMD CodeAnalyst is free (+)<br />
4. Different screens (chart and data) sometimes give contradicting results, which is a bit confusing. (-)<br />
In the screenshots above 1 and 3 indicate different proportion in hot-spot data (maybe I don't understand something or bug in visualization).<br />
5. I failed to get call stack for hot-spots even though configuration promised to collect it. (-)<br />
This can be a big issue for large projects with multiple code paths. <br />
Similar tools like Intel Parallel Studio can show call stacks and allows easy call stack analysis.<br />
<div><br />
</div><div><a href="http://developer.amd.com/cpu/CodeAnalyst/codeanalystwindows/Pages/default.aspx">http://developer.amd.com/cpu/CodeAnalyst/codeanalystwindows/Pages/default.aspx</a></div><div><br />
</div>BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com0tag:blogger.com,1999:blog-6338956400390919863.post-19543809769644142032010-01-11T14:49:00.000-08:002010-01-11T14:50:02.226-08:00BitMagic included in the Softpedia Linux software database <a href="http://linux.softpedia.com/get/Programming/Libraries/BitMagic-library-53513.shtml">http://linux.softpedia.com/get/Programming/Libraries/BitMagic-library-53513.shtml</a><br />
<br />
<div class="separator" style="clear: both; text-align: left;"><a href="http://www.softpedia.com/base_img/softpedia_logo4a_snow.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://www.softpedia.com/base_img/softpedia_logo4a_snow.gif" /></a><br />
</div>BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com0tag:blogger.com,1999:blog-6338956400390919863.post-6730275261798773692010-01-09T08:15:00.001-08:002010-01-09T08:15:21.126-08:00BitMagic library version 3.6.3 releasedThis release fixes critical bug in 64-bit mode (#define BM64OPT) and improves performance of bit-counting related operations.BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com0tag:blogger.com,1999:blog-6338956400390919863.post-48488699155505790242010-01-09T08:06:00.000-08:002010-01-09T08:06:17.370-08:00Parallel superscalar bit-countingWhile working on version 3.6.3 collective Internet wisdom brought yet another interesting bit-counting algorithm, which turns to be efficient for 64-bit systems with superscalar execution (a form of parallelizm, where linear code with data-independent instructions is actually executed in parallel).<br />
<br />
This code below implements population count for 4 64-bit integers. (Tested with GCC 4.4.1 64-bit Linux). This becomes part of BM 3.6.3 (when SourceForge recovers from its (I hope temporary) down).<br />
<br />
<blockquote><span style="font-family: "Courier New",Courier,monospace;">inline </span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;">unsigned bitcount64_4way(bm::id64_t x, bm::id64_t y, </span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> bm::id64_t u, bm::id64_t v)</span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;">{</span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> const bm::id64_t m1 = 0x5555555555555555;</span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> const bm::id64_t m2 = 0x3333333333333333; </span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> const bm::id64_t m3 = 0x0F0F0F0F0F0F0F0F; </span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> const bm::id64_t m4 = 0x000000FF000000FF;</span><br style="font-family: "Courier New",Courier,monospace;" /><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> x = x - ((x >> 1) & m1);</span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> y = y - ((y >> 1) & m1);</span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> u = u - ((u >> 1) & m1);</span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> v = v - ((v >> 1) & m1);</span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> x = (x & m2) + ((x >> 2) & m2);</span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> y = (y & m2) + ((y >> 2) & m2);</span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> u = (u & m2) + ((u >> 2) & m2);</span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> v = (v & m2) + ((v >> 2) & m2);</span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> x = x + y; </span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> u = u + v; </span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> x = (x & m3) + ((x >> 4) & m3);</span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> u = (u & m3) + ((u >> 4) & m3);</span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> x = x + u; </span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> x = x + (x >> 8);</span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> x = x + (x >> 16);</span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> x = x & m4; </span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> x = x + (x >> 32);</span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;"> return x & 0x000001FF;</span><br style="font-family: "Courier New",Courier,monospace;" /><span style="font-family: "Courier New",Courier,monospace;">}</span><br style="font-family: "Courier New",Courier,monospace;" /></blockquote>BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com0tag:blogger.com,1999:blog-6338956400390919863.post-73829051567322812702010-01-09T07:09:00.001-08:002010-01-09T07:54:04.722-08:00SourceForge down :-(<table style="width:auto;"><tr><td><a href="http://picasaweb.google.com/lh/photo/Mi4WHl5rpzl2gzBJTlDUBg?feat=embedwebsite"><img src="http://lh5.ggpht.com/_BFw7TLV64do/S0ibaQA3PEI/AAAAAAAABBY/2IR2rqOlUVY/s288/sf_down.jpg" /></a></td></tr>
</table>BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com0tag:blogger.com,1999:blog-6338956400390919863.post-43942144563223034952010-01-07T11:22:00.000-08:002010-01-07T11:24:15.094-08:00Python wrapper for BitMagic is in the works<b>Author:</b> William Waites <wwaites_at_gmail com=""></wwaites_at_gmail><br />
<a href="http://pypi.python.org/pypi/bitmagic">http://pypi.python.org/pypi/bitmagic</a><br />
<br />
<div style="text-align: left;"><a href="http://www.python.org/images/python-logo.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://www.python.org/images/python-logo.gif" /></a></div><div class="separator" style="clear: both; text-align: center;"><br />
</div>BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com0tag:blogger.com,1999:blog-6338956400390919863.post-82210226391345023212010-01-07T11:19:00.000-08:002010-01-07T11:19:28.808-08:00BitMagic library version 3.6.2 -- brokenBad news, version 3.6.2 confirmed to be broken for 64-bit builds. Urgent patch is coming...BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com0tag:blogger.com,1999:blog-6338956400390919863.post-89425051127110631062009-12-19T17:02:00.000-08:002010-01-07T11:17:23.152-08:00BitMagic library version 3.6.2 releasedNew features:<br />
- added configurations for MacOS-X<br />
- optimizations for SSE 4.2 in 64-bit mode<br />
<br />
now it is fine to use: <br />
#define BM64OPT<br />
#define BMSSE42OPT<br />
<br />
This two options produce SIMD vectorized code for SSE 4.2 which uses 64-bit aware bit-counting<br />
<br />
- new algorithm for picking random subset out of bvector<><br />
see <a href="http://bmagic.sourceforge.net/doxy/a00010.html">sample10.cpp</a>BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com0tag:blogger.com,1999:blog-6338956400390919863.post-14256543094622852512009-12-16T18:54:00.000-08:002009-12-16T18:59:37.117-08:00SSE2 vs CUDA in parallel bit stream transposition / similarity. Part 2.<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bmagic.sourceforge.net/cuda-power.jpg"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 480px; height: 340px;" src="http://bmagic.sourceforge.net/cuda-power.jpg" border="0" alt="" /></a><br />We made a second attempt to work with CUDA to better understand if it looks like a suitable technology for database and data-mining acceleration.<br />With a very valuable help from our readers we prepared a second version of GPU benchmark and tried to analyze performance and power consumption.<br /><br /><a href="http://bmagic.sourceforge.net/bmcudasse2-2.html">CUDA vs SSE. Part 2.</a>BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com0tag:blogger.com,1999:blog-6338956400390919863.post-57891043571988210322009-12-11T16:51:00.000-08:002009-12-11T17:44:04.065-08:00Win7 vs Ubuntu 9.10 (32-bit vs 64-bit)This benchmark is designed to illustrate platform-compiler choice for BitMagic derived computational algorithms.<br /><br /><span style="font-weight:bold;">Test total numbers (shorter - faster)</span><br /><br /><table style="width:auto;"><tr><td><a href="http://picasaweb.google.com/lh/photo/soSzxYlb-UvKnHDSG_DWpA?feat=embedwebsite"><img src="http://lh4.ggpht.com/_BFw7TLV64do/SyLyvRNQVRI/AAAAAAAAA64/Ifm7HMv3h30/s800/bm-test-total.png" /></a></td></tr><tr><td style="font-family:arial,sans-serif; font-size:11px; text-align:right">From <a href="http://picasaweb.google.com/anatoliy.kuznetsov/Blog?feat=embedwebsite">Blog</a></td></tr></table><br /><br />Test Platform: Nehalem Core i5, 2.67GHz (Turbo Boost enabled)<br />BitMagic Library 3.6.2 (not yet released)<br /><br />Windows 7 32-bit used default 32-bit compile using MSVC8. SSE4 optimization disabled.<br /><br />Ubuntu 9.10 64-bit GCC 4.4.1 with compile options ( -march=core2 -m64 -msse4 )<br /><br />BitMagic 64-bit optimizations + SSE4 optimizations:<br />#define BMSSE42OPT<br />#define BM64OPT<br /><br />(this combination will be supported in BitMagic Library 3.6.2).<br /><br /><span style="font-weight:bold;">All measurements in seconds (shorter - faster).</span><br /><br /><table style="width:auto;"><tr><td><a href="http://picasaweb.google.com/lh/photo/eThYbmv1Vhh5vOn2r0Ez-Q?feat=embedwebsite"><img src="http://lh3.ggpht.com/_BFw7TLV64do/SyLpbzZhsMI/AAAAAAAAA6s/hCkwDvL-JK0/s400/bm-chart.png" /></a></td></tr><tr><td style="font-family:arial,sans-serif; font-size:11px; text-align:right">From <a href="http://picasaweb.google.com/anatoliy.kuznetsov/Blog?feat=embedwebsite">Blog</a></td></tr></table><br />As you can see, 32-bit default compile using stock compiler looses in almost every test to 64-bit SSE4.2 version, tuned for the platform. <br /><br />Yes, this experiment intentionally staged for Windows7+MSVC to loose (by using unfavorable compilation and optimization settings: stock x86 32-bit software suffers from register starvation, ofter ignores presence of SIMD SSE unit. Having said that such resource under-utilization often happens in real-life and it is often ignored as insignificant. From this data we can see that sometimes insignificant things can add-up to measurable values.BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com0tag:blogger.com,1999:blog-6338956400390919863.post-91635214947288777352009-11-22T08:19:00.000-08:002009-11-22T08:27:24.367-08:00BitMagic library version 3.6.1 released<ul><li>Elias Gamma decoding optimizations significantly improved bit-vector de-serialization performance. Compressed bit-vectors now save disk space and IO bandwidth without big CPU costs</li><li>Fixed stack overrun introduced in 3.6.0</li><li>First version of SSE 4.2 optimizations for Nehalem (Core i5, i7, etc). Tested with MSVC 2008 and GCC 4.4.1. Use BMSSE42OPT to turn on SSE4.2 support in the code.<br /></li></ul>BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com0tag:blogger.com,1999:blog-6338956400390919863.post-6522167490192861392009-11-13T22:14:00.000-08:002009-11-13T22:17:43.303-08:0064-bit portability checklistUseful article from developers of <span style="font-weight:bold;">PVS Studio</span> tool on porting to 64-bit systems. <br /><br /><a href="http://www.viva64.com/content/articles/64-bit-development/?f=20_issues_of_porting_C++_code_on_the_64-bit_platform.html&lang=en&content=64-bit-development">20 issues of porting C++ code on the 64-bit platform</a>BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com0tag:blogger.com,1999:blog-6338956400390919863.post-69522502553113033322009-11-13T21:50:00.000-08:002009-11-13T21:58:09.495-08:00SSE4.2 readyness. GCC vs MSVC. Deathmatch.GCC 4.3.2 (cygwin) vs MS VC++ 2008 (Windows 7).<br /><br />First results porting BitMagic Library to Intel Nehalem SSE4.2 instruction set. :-)<br /><br /><blockquote><br /><pre><br /><br />1>Generating code<br />1>c:\dev\bm\src\bmsse4.h(140) : fatal error C1001: An internal error has occurred in the compiler.<br />1>(compiler file 'f:\dd\vctools\compiler\utc\src\p2\main.c[0x60375DBB:0x00000000]', line 182)<br />1> To work around this problem, try simplifying or changing the program near the locations listed above.<br />1>Please choose the Technical Support command on the Visual C++ <br />1> Help menu, or open the Technical Support help file for more information<br />1>LINK : fatal error LNK1000: Internal error during IMAGE::BuildImage<br />1> Version 9.00.30729.01<br />1> ExceptionCode = C0000005<br />1> ExceptionFlags = 00000000<br />1> ExceptionAddress = 60375DBB (60300000) "C:\Program Files\Microsoft Visual Studio 9.0\VC\bin\c2.dll"<br />1> NumberParameters = 00000002<br />1> ExceptionInformation[ 0] = 00000000<br />1> ExceptionInformation[ 1] = 00000000<br />1>CONTEXT:<br />1> Eax = 031F666C Esp = 0034EC78<br />1> Ebx = 00000000 Ebp = 0034ECA4<br />1> Ecx = 00000000 Esi = 03214298<br />1> Edx = 00000000 Edi = 03201608<br />1> Eip = 60375DBB EFlags = 00010246<br />1> SegCs = 0000001B SegDs = 00000023<br />1> SegSs = 00000023 SegEs = 00000023<br />1> SegFs = 0000003B SegGs = 00000000<br />1> Dr0 = 00000000 Dr3 = 00000000<br />1> Dr1 = 00000000 Dr6 = 00000000<br />1> Dr2 = 00000000 Dr7 = 00000000<br />1>Build log was saved at "file://c:\dev\bm\bmperf\Release\BuildLog.htm"<br />1>bmperf - 2 error(s), 5 warning(s)<br /><br /><br /><br /><br />$ make<br />======= Building : BitMagic Library Performance Test cygwin GNU_CC<br />< perf.cpp ><br />g++ -msse4 -march=core2 -Wall -c -D_REENTRANT -DCYGWIN_NT -D_GNU_SOURCE -Wno-d<br />eprecated -I/cygdrive/c/dev/bm/src perf.cpp -o /cygdrive/c/dev/bm/tests/perf<br />/Release/perf.o -g0 -O2 -march=core2 -fomit-frame-pointer -pipe<br />perf.cpp: In function 'void EnumeratorTest()':<br />perf.cpp:497: internal compiler error: in memory_address_length, at config/i386/<br />i386.c:16380<br />Please submit a full bug report,<br />with preprocessed source if appropriate.<br />See <http://gcc.gnu.org/bugs.html> for instructions.<br />make: *** [/cygdrive/c/dev/bm/tests/perf/Release/perf.o] Error 1<br /><br /><br /></pre><br /></blockquote>BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com1tag:blogger.com,1999:blog-6338956400390919863.post-91574988562929785362009-10-19T15:41:00.000-07:002009-10-19T15:44:27.509-07:00Notes on Elias Gamma encoding in BitMagic 3.6.0Version 3.6.0 implements a variant of Elias Gamma Encoding as part of bit-vector serialization. <br />Wikipedia has an article on this encoding here:<br /><a href="http://en.wikipedia.org/wiki/Elias_gamma_coding">http://en.wikipedia.org/wiki/Elias_gamma_coding</a><br /><br />The description is so good that one should be able to implement this encoding after reading a couple of pages.<br /><br />Older versions of BM library used D-Gap compression which is a variant of run length encoding. Or used plain lists of integers to encode very sparse sets. All this compression methods used word aligned algorithms and rely on presence of long runs of identical bits. Word aligned compression is also very fast. <br /><br />A lot of practical tests demonstrated that BM based search systems are bandwidth limited. It means that CPU is starved, waiting for memory operations or disk retrieval. Disk retrieval is particularly important because IO and network subsystems never followed terms of Moore Law. Non-Word-Aligned encoding seems to a good answer to this problem.<br /><br />Elias Gamma encoding developed by Peter Elias postulates bit-encoding of positive (non-zero) integers . Positive non-zero integers are essentially the native format for BM library to store sparse bit-vectors or inverted lists (basic building blocks of efficient bitmap indexes). <br /><br /><span style="font-weight:bold;">Advantages:</span><br />1.Simple implementation, requires no compression dictionary. It is easy to compress just a few ints without constructing and storing a dictionary. BM library works with independent blocks of bits (potentially computed in parallel).<br />2.Implied probability favors small integers – good for BM library since it fragments bit-vector into blocks of 65K bits. Methods allows some blocks to be encoded, and some to remain in GAP or raw bit-block format (if statistical distribution of bits suddenly changes).<br />3.Compact implementation reduces code bloat, fits into a C++ template library paradigm.<br /><br /><span style="font-weight:bold;">Disadvantages:</span><br />1.True adaptive statistical methods (like arithmetic compression) can(should?) give better compression ratio.<br />2.Encoding and decoding method is sequential, it seems hard to implement a parallel encoder/decoder using SIMD, SSE or GP-GPU. (BM blocks can potentially be decoded in parallel so maybe there is a way to use parallel methods for this).BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com0tag:blogger.com,1999:blog-6338956400390919863.post-56616830227324176412009-10-17T15:43:00.000-07:002009-10-17T15:47:45.495-07:00BitMagic Library v.3.6.0 released<li>improved bit-vector serialization with bitset compression. <br />New serialization adds Gamma encoding on top of D-Gap compression. This should <br />reduce IO for databases and search systems using serialized bit-vectors.<br /><li>fixed functionality bug in deserialization with logical AND <br /><li>new serialization API (old API kept for backward compatibility)BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com0tag:blogger.com,1999:blog-6338956400390919863.post-13379795943814556972009-08-22T10:22:00.000-07:002009-08-22T10:28:26.186-07:00BitMagic version 3.5.4 releasedThe scope of this release is general code cleanup (only partially successful) and SSE2 for GCC compatibility fixes.<br /><br />GCC supports SSE2 intrinsics, so this port was relatively easy. Practically the only change needed was correct syntax for 16-byte alignment declaration. SSE2 optimizations are now available for Intel C++, GCC and MSVC. <br /><br />Download: <a href="https://sourceforge.net/projects/bmagic/files/">https://sourceforge.net/projects/bmagic/files/</a>BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com0tag:blogger.com,1999:blog-6338956400390919863.post-89952631783282129102009-08-17T18:32:00.000-07:002009-08-17T18:43:51.779-07:00CUDA vs SSEWe all know that GPUs can be incredibly fast at some single precision floating point algorithms.<br />Developers of BitMagic Library Igor Tolstoy and Anatoliy Kuznetsov made an attempt to implement parallel bit-stream transposition and similarity algorithm (a lot of bit-shifting, population counting and general purpose integer logic) to better understand CUDA and GP-GPU computing in comparison with current (and future) Intel vectorization architectures (SSE2). <br /><br />Benchmarking, CUDA and SSE source codes, some speculation about Larabee and implications for data-mining and large scale databases here:<br /><a href="http://bmagic.sourceforge.net/bmcudasse2.html">http://bmagic.sourceforge.net/bmcudasse2.html</a><br /><br /><u><u></u></u>BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com1tag:blogger.com,1999:blog-6338956400390919863.post-74774700101659317152009-08-12T05:51:00.000-07:002009-08-12T05:52:15.519-07:00DoxyDocumentationDoxygen documentation made available on the WEB:<br /><a href="http://bmagic.sourceforge.net/doxy/modules.html">http://bmagic.sourceforge.net/doxy/modules.html</a>BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com0tag:blogger.com,1999:blog-6338956400390919863.post-90392795616190651042009-08-09T13:10:00.000-07:002009-08-09T13:58:03.917-07:00Intel Atom: Best GCC settingsAfter several experiments with 32-bit GCC 4.3.3 on Intel Atom N270 here some recommendations for optimization parameters.<br /><span style="font-family:courier new;">-march=core2 -O2 -fomit-frame-pointer -pipe</span><br />Another good alternative:<br /><span style="font-family:courier new;">-mtune=native -O2 -fomit-frame-pointer -pipe</span><br /><br />In some BM benchmarks<span style="font-family:courier new;"> -march=core2</span> is up to 20% better than <span style="font-family:courier new;">-march=prescott</span><br />So it looks like, Intel Atom is more Core2 than Pentium4. For low power devices like Atom it looks very important to use right settings since platform balances on a very fine edge in terms of usability. Given the growing popularity of this CPU it would be nice to see binary builds of software products optimized for the target (or doing internal CPUID auto-detection) to load the right binaries.<br /><br />SSE2 integer optimization seems to get a nice additional boost too. Current version of BitMagic Library (3.5.3) s not compatible with GCCs syntax for variable alignment (one of the key elements for getting best memory bandwidth with SSE). Next version is going to offer better compatibility.BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com0tag:blogger.com,1999:blog-6338956400390919863.post-37758852119217297852009-08-08T10:02:00.000-07:002009-08-08T10:05:18.054-07:00BitMagic version 3.5.3 released<ul><li>improved performance of AND logical operations and derived DOT product algorithms (cound_and(), any_and())</li><li>improved compression of serialized bvectors (serialization format remains backward compatible)</li><li>fixed compatibility issues with MSVC8 and Intel C++ 11.1 (SSE2 optimization mode)</li><li>improved automatic variables disambiguation using 'restrict'</li></ul><br />Download: <a href="https://sourceforge.net/projects/bmagic/files/">https://sourceforge.net/projects/bmagic/files/</a>BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com0tag:blogger.com,1999:blog-6338956400390919863.post-64134807011169359752009-08-07T18:47:00.000-07:002009-08-07T20:45:54.414-07:00Search for Efficient Bitwise TraversalOne of the fundamental problems BitMagic library faces is optimization of bit traverse algorithms.<br />BitMagic implements a class <span style="font-family:courier new;">bvector<>::emunerator</span> to iterate 1 bits just as they are sequence of integers. Under the hood of the high level algorithm is a problem how to unpack a single integer value into a variable length array of 1 bit indexes.<br /><br />For example an bit string 10011 should translate into { 0, 3, 4 }.<br /><br />This problem is somewhat related to a famous POPCNT family of bit counting algorithms. POPCNT is beaten to death on the Internet, while there are not to many sources doing bit traversal (which looks very relevant too).<br /><br />It is hard to implement a fast and generic algorithm to behave equally well for words with high and low bit density. In this post we want to encourage others to try finding alternative methods, optimizing for 64-bit, SSE2, maybe GPU, CUDA, etc.<br /><br />Solution 1.<br />This algorithm uses 4-bit unrolled loop with a straight-forward check. This code is very compact and works well when both code and data fit into L1 CPU cache. Template parameter T represents integer word type, F is a functor-class to respond to found 1 bit.<br /><br /><blockquote><br /><pre><br />template < typename T, typename F > <br />void bit_for_each(T w, F& func)<br />{<br /> for (unsigned octet = 0; w != 0; w >>= 8, octet += 8)<br /> {<br /> if (w & 1) func(octet + 0);<br /> if (w & 2) func(octet + 1);<br /> if (w & 4) func(octet + 2);<br /> if (w & 8) func(octet + 3);<br /> if (w & 16) func(octet + 4);<br /> if (w & 32) func(octet + 5);<br /> if (w & 64) func(octet + 6);<br /> if (w & 128) func(octet + 7);<br /> } <br />}</pre><br /></blockquote><br /><br />Solution 2.<br />Second algorithm implements a hard coded switch table (C/C++-compiler most likely transforms this into a JUMP address table. As general purpose solution this algorithm looks faster than variant 1. It gives compiler a better compile time optimization plan.<br />The catch is that functor should support multiple arguments (up to4 in this case). All functor operators () must be inline, otherwise all performance benefits easily lost pushing parameters down to the stack. (This solution is inspired by template meta-programming articles, I believe this code can be written as a compact template meta program at the expense of clarity and maybe compatibility...)<br /><br /><blockquote><br /><pre><br />template < typename T, typename F > <br />void bit_for_each_4(T w, F& func)<br />{<br /> for (unsigned sub_octet = 0; w != 0; w >>= 4, sub_octet += 4)<br /> {<br /> switch (w & 15)<br /> {<br /> case 0: // 0000<br /> break;<br /> case 1: // 0001<br /> func(sub_octet);<br /> break;<br /> case 2: // 0010<br /> func(sub_octet + 1);<br /> break;<br /> case 3: // 0011<br /> func(sub_octet, sub_octet + 1);<br /> break;<br /> case 4: // 0100<br /> func(sub_octet + 2);<br /> break;<br /> case 5: // 0101<br /> func(sub_octet, sub_octet + 2);<br /> break;<br /> case 6: // 0110<br /> func(sub_octet + 1, sub_octet + 2);<br /> break;<br /> case 7: // 0111<br /> func(sub_octet, sub_octet + 1, sub_octet + 2);<br /> break;<br /> case 8: // 1000<br /> func(sub_octet + 3);<br /> break;<br /> case 9: // 1001<br /> func(sub_octet, sub_octet + 3);<br /> break;<br /> case 10: // 1010<br /> func(sub_octet + 1, sub_octet + 3);<br /> break;<br /> case 11: // 1011<br /> func(sub_octet, sub_octet + 1, sub_octet + 3);<br /> break;<br /> case 12: // 1100<br /> func(sub_octet + 2, sub_octet + 3);<br /> break;<br /> case 13: // 1101<br /> func(sub_octet, sub_octet + 2, sub_octet + 3);<br /> break;<br /> case 14: // 1110<br /> func(sub_octet + 1, sub_octet + 2, sub_octet + 3);<br /> break;<br /> case 15: // 1111<br /> func(sub_octet, sub_octet + 1, sub_octet + 2, sub_octet + 3);<br /> break;<br /> default:<br /> break;<br /> } <br /> } <br /></pre><br /></blockquote><br /><br />Again there should be a way to write an assembly code using BSF (bit scan forward), BSR (bit scan reverse). References on relevant samples are highly appreciated.BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com1tag:blogger.com,1999:blog-6338956400390919863.post-54609079554831881092009-08-05T15:52:00.000-07:002009-08-07T20:23:00.635-07:00Compilers test ICC vs MSVC vs GCCBitMagic benchmark test compiled on Windows XP under different compilers (GCC 3.4, MSVC 8 and Intel C++ 11). Tests here are synthetic, but represent very practical functions used in data-mining and database search algorithms. No SSE2 or 64-bit optimization, pure 32-bit mode, single thread. System is running on Core2 Duo. <br /><br />BitMagic individual tests (shorter - faster)<br /><table style="width:auto;"><tr><td><a href="http://picasaweb.google.com/lh/photo/TUc73VNHz-62xYyd4ICElw?feat=embedwebsite"><img src="http://lh5.ggpht.com/_BFw7TLV64do/SnoM7rt8lJI/AAAAAAAAA1I/v-6sHpA425g/s400/compilers_test.png" /></a></td></tr></table><br /><br />Overall results (shorter - faster)<br /><table style="width: auto;"><tbody><tr><td><a href="http://picasaweb.google.com/lh/photo/ul9hModOg39atTqxioLrPw?feat=embedwebsite"><img src="http://lh6.ggpht.com/_BFw7TLV64do/SnoM70zUhSI/AAAAAAAAA1M/zloCPwM8ZDM/s800/compiler_test_sum.png" /></a></td></tr></tbody></table><br /><br />Test results are relatively uniform (code is rather well tuned manually), it was a bit of surprise to see Intel C++ 11 slower in XOR count test (Humming distance). <br />Overall winner for integer algorithms - GNU C++.BitMagic C++ Libraryhttp://www.blogger.com/profile/08268313621390173726noreply@blogger.com1