Find First Set and Popcount purpose
Newsgroups: comp.arch
From: pmontgom@cwi.nl (Peter L. Montgomery)
Subject: Re: Find first bit and population count
Organization: CWI, Amsterdam
Date: Mon, 24 Jun 1996 07:44:08 GMT
In article <4qkih7$85p@agate.berkeley.edu> davids@ICSI.Berkeley.EDU
(David Petrie Stoutamire) writes:
>An application I am working on needs to quickly identity the first set
>bit in a word, starting from LSB. Sparc v9 has a population count
>instruction from which find-first-bit can be synthesized:
>
> neg in, temp
> xnor in, temp, temp
> popc temp, result
Be careful, since in = 0 and in = 2^31 yield the
same values for in ^ (-in) using 32-bit operations.
It's safer to use (in - 1) & ~in or in & (-in).
>When I replace v8 hand-written code for find-first-bit with this
>instruction sequence on an ultrasparc I, performance drops severely on
>this section of code. Based on this, I suspect "popc" is not really
>supported in hardware. I couldn't find any on-line documentation from
>Sun one way or the other. Sun provides ffs() as a library, but
>disassembly shows this to not be a clever implementation, nor does it
>use "popc".
>
>This leads to my questions:
>
> 1. Which popular ISAs support find-first-bit or pop count?
Cray (leading zero count, population count).
Intel x86 (find first bit starting from right or left).
CDC Cyber series (population count).
I think the Alliant also had these operations.
Of these, only x86 is `popular' today.
> 3. Can any common applications make use of it?
It occurs in the binary GCD algorithm (ulong = unsigned long):
ulong binary_gcd (ulong x, ulong y)
{ if (x == 0 || y == 0) {
return x | y;
} else {
ulong n1 = x >> trailing_zero_count(x);
ulong n2 = y >> trailing_zero_count(y);
while (n1 != n2) { /* Both are odd */
if (n1 > n2) {
n1 -= n2;
n1 >>= trailing_zero_count(n1);
} else {
n2 -= n1;
n2 >>= trailing_zero_count(n2);
}
}
return n1 << trailing_zero_count(x | y); } } Yes, we can replace n1 >>= trailing_zero_count(n1);
by
do {n1 >>= 1;} while (n1 & 1);
since n1 is known to be even and nonzero at that point.
But we desire an algorithm with predictable branches.
The IF-ELSE can be replaced by a branchless sequence which first
finds MIN(x, y) and MAX(x, y), on architectures with
conditional moves. It's harder to optimize
trailing_zero_count without hardware support such as
the population count function.
In a linear algebra application over the field GF(2),
I need the dot product of two binary vectors v1 and v2.
It is population_count(v1 & v2) & 1.
Using the binary method of exponentiation, computing
a power b**e takes truncated_base2_logarithm(e) squarings
and population_count(e)-1 multiplications, for integer e > 0.
In 1970, after I had taken a statistics course, and was
vending machine manager at my dormitory, I took a
survey in which everyone marked the products he liked.
Let V[v, p] = 1 if voter v likes product p,
and V[v, p] = 0 if voter v dislikes product p.
I computed sum_v V[v, p1] * V[v, p2] for all pairs (p1, p2),
as part of the formula for the correlation between p1 and p2.
A few weeks later, in an assembly language course, I learned
that the machine (a CDC 6400) lacked an integer multiply
instruction even though this was a primitive in the programming
language (Fortran) I had learned earlier.
Had I used binary AND rather than multiplication,
the program would have run faster (recall all values are 0 or 1).
This was in the days when CPU usage was expensive.
By the end of the term, I realized I could have
packed (for fixed p) all V[v, p] values into a few 60-bit words
(3 words since we had 150 voters). A few bitwise ANDs
and population counts would produce the sums I needed.
The operations of population count, finding the leftmost
nonzero bit (i.e., base 2 logarithm), and finding the rightmost
nonzero bit (i.e., finding the exponent of 2 dividing an integer)
occur frequently enough in applications that programming languages
should have these primitives and compilers should inline them.
At this time, however, their performance is not critical
for my applications.
> 4. Are there clever hacks for fast find-first-bit? The pop count hack:
One can compute
temp = in & (-in);
Then temp = 0 if in = 0; otherwise temp is a power of 2.
If p > 32 (or p > 64) is a prime for which 2 is a primitive root,
then you can compute (temp % p) and look up the answer
using a memory reference.
Alas, integer division is slow. If the high half of the
integer product is available, you can get temp/p
using one multiplication and a few simple operations.
I recall that SPARC v9 has the high half of a 32 x 32
product, but not of a 64 x 64 product.
If you are willing to use floating point, try
float f = (float) (signed long) (in | (-in));
Now look at the exponent of f (which is the negative of a
power of 2 unless in = 0).
--
Peter L. Montgomery pmontgom@cwi.nl San Rafael, California
My sex is male. I am ineligible for an abortion.
From: glew@ichips.intel.com (Andy Glew)
Newsgroups: comp.arch
Subject: Re: Find first bit and population count
Date: 30 Jun 1996 07:05:33 GMT
Organization: Intel Corporation
From: d.sand@ix.netcom.com(Duane Sand)
Subject: Re: Find first bit and population count
Newsgroups: comp.arch
Date: 24 Jun 1996 18:19:48 GMT
Organization: Netcom
davids@ICSI.Berkeley.EDU (David Petrie Stoutamire) writes:
> 3. Can any common applications make use of it?
I believe the decryption spooks at the National Security Agency are
very fond of the population-count operation. It seems to be so useful
to them that every US-made govt-subsidized supercomputer has included
this instruction even though there is little other use for it.
Actually, speech recognition people (and probably any pattern
recognition people) have a use for POP count:
POP( A AND B ) = the inner product (dot product) of one bit precision feature
vectors
POP( A XOR B ) = Huffman distance metric
A preliminary stage in many pattern recognition algorithms is to
compute a quick distance metric between the input pattern, and some
subset of the reference patterns in the database. You can do this via
Huffman distance, or you can interpret it is a projection.
When projecting, you can arbitrarily tradeoff precision for speed
- i.e. sometimes you might do the dot product in 64 bit floating
point, sometimes in the 8 bit precision that Intel's MMX and similar
"multimedia" instruction sets provide. 1 bit precision is just the
extrapolation of this.
Thus, you can use POP count to take a quick check of many simple
reference vectors, afterwards zooming in on a more full precision
check.
Come to think of it, this is probably the same reason that the NSA
uses.
---
Andy "Krazy" Glew, glew@ichips.intel.com, Intel,
M/S JF3-206, 2111 NE 25th Ave., Hillsboro, OR 97124
Place URGENT in email subject line for mail filter prioritization.
DISCLAIMER: private posting, not representative of employer.
We're looking for a few good microarchitects for our research labs.
{{ VLIW: the leading edge of the last generation of computer architecture }}
From: preston@Tera.COM (Preston Briggs)
Newsgroups: comp.arch
Subject: Re: Find first bit and population count
Date: 24 Jun 1996 09:20:25 -0700
davids@ICSI.Berkeley.EDU (David Petrie Stoutamire) writes:
> 1. Which popular ISAs support find-first-bit or pop count?
The Tera (certainly the most popular architecture around here) has
instructions for pop-count, find-first-1 and find-first-0, from left
or right).
> 3. Can any common applications make use of it?
Pop count might be used for cardinality of sets represented as bit
vectors. We use find-leftmost-1 (along with bitwise-matrix-multiply)
to support some of the common C string manipulation functions. Lets
us do things like copy or compare NULL-terminated strings at a rate of
8 bytes per 3 instructions.
Preston Briggs
From: preston@Tera.COM (Preston Briggs)
Newsgroups: comp.arch
Subject: Re: Find first bit and population count
Date: 1 Jul 1996 11:10:35 -0700
I wrote:
> We use find-leftmost-1 (along with bitwise-matrix-multiply)
> to support some of the common C string manipulation functions. Lets
> us do things like copy or compare NULL-terminated strings at a rate of
> 8 bytes per 3 instructions.
And Cliff Click wrote:
>Ok, I give. How does find-leftmost-1 help in C string functions?
Imagine we're doing the the easy one -- strlen.
We want to do it in word-sized chunks, 8 bytes per load.
We load a word, then do a bitwise matrix multiply with -1,
then find-leftmost-0 to see if there are any 0 bits in the result.
If so, we've found the end of the string and the position of the bit,
divided by 8, should be added to the accumulated length.
Obviously the trickiness is in the bitwise matrix multiply. This
operation treats a pair of 64-bit words as 8x8 bit matrices and
multiplies them using the boolean operations of AND for multiplying
two bits and either OR of XOR for summing. For this application, we
need the OR variant (the XOR version is useful for computing GF(2)).
Imagine we have 8 bytes of a string that look like
abcdefg\0
organized as an 8x8 bit matrix, we'd have
01100001 11111111 11111111
01100010 11111111 11111111
01100011 11111111 11111111
01100100 x 11111111 = 11111111
01100101 11111111 11111111
01100110 11111111 11111111
01100111 11111111 11111111
00000000 11111111 00000000
If you're counting cycles, recall that we issue 3 operations per tick,
and all operations require the same time. The operations used in the
inner loop are
load
inc
bit-mat-mul
find-leftmost-0
conditional branch
which we software pipeline into 3 instructions.
Preston Briggs
Newsgroups: comp.arch
Subject: Re: Find first bit and population count
From: Bruce@hoult.actrix.gen.nz (Bruce Hoult)
Date: Tue, 2 Jul 1996 20:22:42 +1200 (NZST)
cliffc@ami.sps.mot.com (Cliff Click) writes:
> preston@Tera.COM (Preston Briggs) writes:
>
> davids@ICSI.Berkeley.EDU (David Petrie Stoutamire) writes:
>
> > 1. Which popular ISAs support find-first-bit or pop count?
> > 3. Can any common applications make use of it?
>
> Pop count might be used for cardinality of sets represented as bit
> vectors. We use find-leftmost-1 (along with bitwise-matrix-multiply)
> to support some of the common C string manipulation functions. Lets
> us do things like copy or compare NULL-terminated strings at a rate of
> 8 bytes per 3 instructions.
>
> Preston Briggs
>
> Ok, I give. How does find-leftmost-1 help in C string functions?
The expression...
~n & (n - 0x01010101) & 0x80808080
... gives zero if and only if none of the bytes in the word are nonzero.
If the result is nonzero then the position of the first 1 tells you which
byte was nonzero.
You might want to disassemble your own company's "libmoto", where the C
string functions make use of this property. :-)
-- Bruce