I'm late to the party. But, 2-adic Newton is pretty fast. Here's a 64-bit version:
/* If x is square, return +sqrt(x).
* Otherwise, return -1.
*/
long long isqrt(unsigned long long x) {
/* Precondition: make x odd if possible. */
int sh = __builtin_ctzll(x);
x >>= (sh&~1);
/* Early escape. */
if (x&6) return -1;
/* 2-adic Newton */
int i;
const int ITERATIONS = 5; // log log x - 1
unsigned long long z = (3-x)>>1, y=x*z;
for (i=0; i> 1;
y *= w;
z *= w;
}
assert(x==0 || (y*z == 1 && x*z == y));
/* Get the positive square root. Also the top bit
* might be wrong. */
if (y & (1ull<<62)) y = -y;
y &= ~(1ull<<63);
/* Is it the square root in Z? */
if (y >= 1ull<<32) return -1;
/* Yup. */
return y<<(sh/2);
}
The function listed above takes 31 Haswell cycles when inlined. So it's about half as fast as the double-precision sqrtsd() call. It computes the square root of its argument, which is not what OP asked for. But it takes advantage of the fact that you only care about integer square roots. As a result, it will be much faster on extended-precision integers than an approximate square root algorithm.
This works for longer integer types, with a few changes.
- It's a fixed-precision algorithm. If you use mpn_t for this, it will allocate new limbs instead of wrapping which is not what you want. You will need to explicitly clamp to the lowest n limbs.
- You need log log x - 1 ITERATIONS. So if x is 2048 bits, you need 10 instead of 5. That is, it's Newton's method so each iteration doubles the precision.
- You can compute the first 4 iterations with longs if you want.
- You need to replace the 32 with half the number of bits of your integer type, likewise 62 and 63 with 2 less and 1 less.