Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

designing a suckless bignum library

Name: Anonymous 2015-11-16 22:11

Let's design a suckless bignum library. (I'm not part of suckless though, just curious about replacing GMP).

I researched a bit into algorithms and the rundown is this:
* long multiplication: O(n^2)
* karatsuba O(n^1.5)
* Toom-Cook, fourier transform based methods - even faster but only used for numbers 10k digits+ long. Much more complex.

So we should probably use karatsuba for all multiplications. Squaring can be done a bit faster than multiplying two different numbers sometimes.

Now I suggest programming it in assembly, that gives you access to the carry bit (C doesn't get you that). Of course we will use libc and the normal C calling conventions so that it's a regular C library.

What to do about memory management? e.g. if you want to add two numbers do we need to allocate a new 'number' as long as the largest to write the result into or do it destructively "x <- x + y"? Maybe the library should support both - then a calculator program would figure out the best primitives to use for a given computation.

It might be nice to also support things like (big modulus) modular arithmetic and polynomials. stuff like exponentiation and modular inverses have interesting algorithms.

What other integer operations would we want? I don't really want to do anything with arb. prec. real numbers - arithmetic with rationals could be done though.

Name: Anonymous 2015-11-20 16:39

op, if you're serious about this, do:

* modular arith only (ie, crypto only). for fast infinite precision, you really need hairy ball of code like GMP, no way to be suckless
* shift-and-add multiplier, then you get modulo for free
* Why not schoolbook & barrett reduction? Sure it works better on superscalar CPUs with ALUs, but it *will* suck horribly on a CPU without multiplier (which is everything really low power these days)
* basics, ie modular add,sub,mul,pow are ~500loc. prime tester (miller-rabin) 100 loc. other bells and whistles (ie actual PKCS+RSA, DHE, maybe even ECDHE) 1kloc. 2kloc all.

What to do about memory management?

The golden rule of suckless C - just forget dynamic memory. Dynamically allocated memory was a mistake. K&R were right. You free memory through return or exit(). Just cap your bignum type to some hardcoded limb count. Sacrifice a bit of stack depth, and and no need for silly realloc() every time you add two numbers. Not to mention then you don't even need libc and can run baremetal easily.

As an example of bignum of this style, here is 60 loc RSA signature verifier for a (very) ROM space constrained embedded MCU bootloader:

#define RSA_BYTES (RSA_BITS/8)
#define DIGIT uint32_t
#define DIGIT_BYTES (sizeof(DIGIT))
#define DIGIT_BITS (DIGIT_BYTES*8)
#define DIGIT_MAX ((DIGIT)(-1))
#define NDIGITS (RSA_BITS/DIGIT_BITS)

static int b_add(DIGIT * restrict r, const DIGIT *x, const DIGIT *y)
{
DIGIT w, carry = 0;
for (int i = 0; i < NDIGITS; i++) {
if ((w = x[i] + carry) < carry)
w = y[i];
else
carry = ((w += y[i]) < y[i]);
r[i] = w;
}
return carry;
}

static int b_mulmod(DIGIT * restrict res, const DIGIT *xsrc, const DIGIT *y, const DIGIT *mod)
{
DIGIT rbuf1[NDIGITS], rbuf2[NDIGITS], xbuf1[NDIGITS], xbuf2[NDIGITS];

DIGIT *r1 = rbuf1;
DIGIT *r2 = rbuf2;
DIGIT *x1 = xbuf1;
DIGIT *x2 = xbuf2;

DIGIT *swp;

memset(rbuf1, 0, sizeof rbuf1);
memcpy(xbuf1, xsrc, sizeof xbuf1);

for (int i = 0; i < NDIGITS; i++) {
for (DIGIT bit = 1; bit; bit += bit) {
if (y[i] & bit) {
if (b_add(r2, r1, x1))
return -1;
if (!b_add(r1, r2, mod)) {
swp = r1;
r1 = r2;
r2 = swp;
}
}
if (b_add(x2, x1, x1))
return -1;

if (!b_add(x1, x2, mod)) {
swp = x1;
x1 = x2;
x2 = swp;
}
}
}
memcpy(res, r1, sizeof rbuf1);
return 0;
}

static int rsa_public(DIGIT *output, const DIGIT *input, const DIGIT *modulus)
{
DIGIT buf2[NDIGITS];
/* buf2 = buf^2 % modulus */
if (b_mulmod(buf2, input, input, modulus))
return -1;
/* buf3 = buf^3 % modulus */
return b_mulmod(output, buf2, input, modulus);
}


(modulus is stored negated, thus no need for b_sub)

Name: Anonymous 2015-11-20 17:12

>>41
this is really nice and I appreciate it.

One question: what about nice code for modular arithmetic like this, but then also another bit of code for infinite precision - does it really have to be hairy? If we just pick one algorithm that works reasonably well it should be decent enough and not suck?

Name: Anonymous 2015-11-20 18:05

>>41
for fast infinite precision, you really need hairy ball of code like GMP, no way to be suckless

Prove it.

Name: Anonymous 2015-11-20 18:18

>>1-43
suck less hairy balls

Name: Anonymous 2015-11-20 18:43

Suckless is the Amish of programming.

Name: Anonymous 2015-11-20 18:48

I'm a white hat. I have always been a white hat. Reverse engineering is for
inferior monkey engineers. Humiliation. A glorious engineer makes new
standards.

Name: Anonymous 2015-11-20 19:36

>>35
Make a multiplier next.

Name: Anonymous 2015-11-21 4:24

>>35
>>41
Why not just use libtommath? It's already recommended by suckless. The only problem is that it being pure C, isn't as optimized as it could be if it were hand-coded assembly.

It's right near the top as an alternative to GMP: http://suckless.org/sucks/

Name: Anonymous 2015-11-21 4:40

>>37
I love these quotes! But what post are they from?

Name: sage 2015-11-21 16:50

>>49
hi! i think you are a bit confused, you're using memes from 4chan's /jp/ board, but this isn't actually /jp/ at all! it's actually on a completely different site, called progrider (not related to 4chan.org). the board you actually meant to post on is at https://boards.4chan.org/jp/ . i'm glad we could clear up this confusion so that you can get back to posting on the right board. well, that's about all, enjoy being on your correct board, and please do try to be a bit more careful in the future. you're welcome!

Name: Anonymous 2015-11-21 17:17

>>48
Indeed libtom is very close to what
>>42
is suggesting - a good compromise in not being too slow for somewhat general use, but still far from a good choice for heavy (ie non-crypto) number crunching. That's why it might be wiser to strive for some absolute reductionism (fe specialize to embedded and crypto, and then you can do a lot under 2kloc without being too slow) and not directly compete with tommath.

The problem here is that general fast infinite precision is simply collection of dozens of algorithms (libtom is far beyond 2kloc suckless limit :), the more you specialize for various cases (including heuristics to auto choose method to do something, for easier use), the "faster" is the lib overall. The disadvantage is eventually you end up with a kitchen sink like GMP, NTT or ARPREC.

>>43
Engineering is empirical - hearsays and superstitions. You can find rigorous proofs of complexity of individual algorithms, but not for bignum in general, as that depends on what number theoretic tricks you actually decide to implement.

>>48
Assembly
Ah, just drop it. Hand written asm (of which there is not immense amount in GMP, just inner loop kernels) can squeeze say, 10-20% more. But that makes sense only after you exhausted algorithmic optimizations which can be often much faster. For example NTT or ARPREC can be still magnitude faster than GMP in some cases (say, non-modular multiplication), despite being pure C++.

I think that tommath indeed is pretty close to to GMP in terms of implemented tricks, but lacks ntt, and asm kernels IIRC.

Name: Anonymous 2015-11-21 19:00

>>51
Really strong points. Gotta rethink the scope of this somewhat to be able to pull it off.

regarding assembly: But I don't want to have to do some stupid thing that relies on the compiler pattern matching against it to figure out i meant the carry bit. I just want to say 'check the carry bit'.

Name: Anonymous 2015-11-21 20:31

>>50
Hi! think you are a bit confused, you're using sage as the 4chan imageboard does, but this isn't actually 4chan at all! it's actually on a completely different site, called progrider (not related to 4chan.org). the board you actually meant to post on is at https://boards.4chan.org/g/ . i'm glad we could clear up this confusion so that you can get back to posting on the right board. well, that's about all, enjoy being on your correct board, and please do try to be a bit more careful in the future. you're welcome!

Name: Anonymous 2015-11-21 22:54

>>1,35,41,48,51,52
https://github.com/suiginsoft/hebimath

This was how I spent my Saturday morning. Not that I'm taking ownership of this or anything, baka, but you're free to use it as a starting point.

It supports only x86_64 currently. Run make to make the library and test programs. Run make test to execute the test programs in the test dir. However, there's only one right now, which is a microbenchmark profiling the addition function.

>>52
Ah, just drop it. Hand written asm (of which there is not immense amount in GMP, just inner loop kernels) can squeeze say, 10-20% more.

$ make test
test/profile_add
>>35-san add: 7.24666365 seconds
Hebimath add: 0.22406670 seconds

Name: Anonymous 2015-11-21 23:00

>>54
We love you!

Name: Anonymous 2015-11-21 23:11

>>54
are you the avatarfag on /tech/?

Name: Anonymous 2015-11-21 23:14

>>54
I like the code, nice one

Name: Anonymous 2015-11-21 23:50

>>56
The what on what?

Name: Anonymous 2015-11-22 0:06

>>55,57
Thanks!

>>56
No. I sometimes post on /tech/, but always anonymously and not with smugface anime meme images.

Name: suigin 2015-11-22 0:37

I made a mistake in test/profile_add.c. I wasn't computing the right number of base 10 digits to generate for >>35-san's BCD adder. Using the correct formula, it turns out the optimized assembly version is nearly 100 times faster. I've submitted the changes to github.

$ test make
test/profile_add
>>35-san add: 22.01722459 seconds
Hebimath add: 0.22316967 seconds

Name: Anonymous 2015-11-22 9:14

What voodoo do you need to compile these down to vector operations using SSE/AVX/etc? Will the Intel compiler pull it off? Because shuffling all these numbers through normal CPU registers is loldumb.

Name: Anonymous 2015-11-22 13:56

suigin, why don't you announce your suckless projects (sstoy, hebimath, etc) to other suckless devs on irc or mailing list?
That way they can put it on their git and give more visibility (and development help) on the projects.
Sorry for the off-topic, just wanted to give out this idea. They have git.suckless.org - git.2f30.org - git.r-36.net

Name: Anonymous 2015-11-22 13:58

git sucks, use hg

Name: Anonymous 2015-11-22 14:42

>>41
The golden rule of suckless C - just forget dynamic memory.

Nice to see Forth finally getting some representation in /prog/. Shame about trying to do it in C, though.

Name: Anonymous 2015-11-22 14:43

>>63
suckless used hg, but they migrated to git on the end on 2012.
there was a long discussion on the mailing list that started here http://lists.suckless.org/dev/1211/13601.html
read the rest to see pros/cons.
anyway, this is the suckless bignum library thread, not the versioning system war thread.

Name: Anonymous 2015-11-22 15:04

ok fuck off about version control.
make a separate thread or something.
this is about BIG NUM

Name: Anonymous 2015-11-22 15:05

BIG NUM

Name: suigin 2015-11-22 20:26

>>61
SSE/AVX/NEON doesn't support addition or multiplication with carries. You have to compute the carry manually. You're better off using the scalar multiplier. Intel has their ADX extensions, but it has nothing to do with SIMD.

You can still use vector instructions for other things like bitwise and, or, xor, etc.

GCC and Clang are pretty good at automatically vectorizing code these days, so long as you configure the target architecture correctly and align your memory to the SIMD register boundary size.

>>62
Because I'm not a suckless developer. One must first prove himself before asking for a git account and sanctioning of one's own projects.

Name: suigin 2015-11-22 21:03

So, I've been thinking some more. Writing these low-level bignum kernels gets you unsigned arithmetic, but not signed arithmetic.

It's possible to augment the adder to do sign-extension during the propagation phase so you can stick with two's complement notation, but you're still in trouble when it comes to multiplication. You have to keep track of whether an operand is negative or not, and negate the subdivided components during reduction, which requires new code paths, or negate the entire operand up-front and recompute the sign at the end. I can see why most bignum libraries store the sign externally, and use an unsigned internal representation.

So then we need a data structure to store the sign in, and once you do that, it also starts to make sense to include size of the operand in the struct and as a result, it naturally makes sense to handle memory management in the library. I'm not sure if there's a cleaner way to do it.

We can still do a better job than libtommath.

A compact bignum type on a 64-bit platform might look like this:

typedef struct {
uint64_t *data;
size_t capacity;
size_t sign:1; /* using bit-fields for brevity, but probably */
size_t size:63; /* want to do the mask and shifting manually as */
} Bignum; /* the standard only supports bitfields for 'int' */


However, for bignums smaller than 128-bits, it's possible to store the data directly in the bignum type to avoid memory allocations.

typedef struct {
union {
struct {
uint64_t *data;
size_t capacity;
} big;
uint64_t small[2];
} num;
size_t sign:1;
size_t isbig:1; /* 0 = number stored in num.small, 1 = allocated externally */
size_t size:62;
} Bignum;


But there's something else to consider. What if we were to make to make the bignum kernels only deal with 256-bit aligned chunks (or 128-bit on 32-bit platforms). Our compute kernels would shrink in half, would be easier to understand, and would be more friendly on CPU instruction caches, as we don't have to handle cases where we have left over words that aren't modulo 256-bits. CPUs would literally scream running our code.

But the smallest possible bignum would be 256-bits in size, and so our small-number memory allocation trick will no longer work unless we artificially pad the struct. This may or may not be a bad idea.

It really boils down to the usage patterns. Is the user computing lots of large numbers bigger than 256-bits in size? Or are the majority say 128-bits or less?

What's the best way to weigh these trade-offs? What do you prefer?

Name: Anonymous 2015-11-22 21:46

>>63
hg is bloated crap

Name: >>35 2015-11-22 23:22

>>60
ZOMG, optimized! This little endian (maybe big endian, I always get those mixed up), decimal adder is a little over 15% faster than my last version. This time, when you run the benchmark, do it on ARMv9, it really brings out the advantages of my approach.
#define swap(__a,__b) { __typeof__(__a) __t = __a; __a = __b; __b = __t; }
/* This is little endian, 99959 is encoded as "95999" */
void little_endian_add(char* addend, char* addend2, char* result){
char* a = addend, *b = addend2;
int a1len = strlen(addend), a2len = strlen(addend2);
if(a1len < a2len) {
swap(a, b);
swap(a1len, a2len);
}
int reslen = a1len + 1, carry = 0, i, immval;

for (i = 0; i < reslen - 1; i++){
immval = a[i] + ((i < a2len)? b[i] : 48) + carry;
carry = 0;
if (immval >= 106){
result[i] = immval - 58;
carry = 1;
}
else{
result[i] = immval - 48;
}
}
result[reslen - 1] = (carry == 1) ? '1' : '\0';
result[reslen] = '\0';

return;
}

Name: suigin 2015-11-23 0:14

>>71
Thanks. I've added it to the profiling test.

I need to port the assembly routines to ARMv7 32-bit. Don't currently have a 64-bit ARM SoC, just a Cubieboard v3 and a Raspberry Pi 2.

Name: Anonymous 2015-11-23 2:38

Just pipe everything through bc — it's the Unix way.

Name: Anonymous 2015-11-23 10:52

>>52
I just want to say 'check the carry bit'

Like it or not, but compiler generally knows what it's doing. It also gives direct access to carry - look at disassembly of:

unsigned getsumcarry(unsigned *carry, unsigned a, unsigned b)
{
unsigned v;
*carry = (v=(a+b)) < a;
return v;
}


Is compiled to:
18: 89 f0 mov %esi,%eax
1a: 01 d0 add %edx,%eax
1c: 0f 92 07 setb (%rdi)


But you're correct that carry and other flags are the kryptonite of modern cpus and 386-era superstitious assembly is often not helping. You better shove it into register asap and pray the cpu is clever enough to not stall the pipeline when it sees that. If you jump based on it, you fuck things up pretty much always. Why? You killed out of order execution with a data dependent branch.

If you want to go slightly faster, use radix 27-30 and delayed carries (ie "normalize" the number every now and then). In some cases (P4 netburst) it can be like 2-5x faster than add/adc chains if the cpu is too stupid to not short circuit these pairs in its opcode folding pass of JIT.

Delayed carries are often used on moderately superscalar p4-like cpus which don't JIT, but have deep pipelines (these days mostly ARM), because anything depending on ALU flag will introduce hard pipeline stall there, the CPU is simply too dumb to avoid it.

tl;dr: 386 asm programmers are surefire way to trigger full autism like this rant

Name: Anonymous 2015-11-23 11:47

>>54
S-senpai, can you push your secret radix 2^(10/3) version? I'm having problems substituting >>35-san's code with yours, as your digits are, um, too big.

Name: Anonymous 2015-11-23 12:35

>>71
char* a = addend, *b = addend2;

Check out this bell-end whom don't even know C.

Name: Andreas Wagner 2015-11-23 14:49

Name: Anonymous 2015-11-23 14:51

>>75
There is no secret version, the size of the digits is fine, it makes maximal use of CPU resources.

What needs to be done instead is implement functions that convert character strings into the optimized internal representation, and vice versa.

Name: Anonymous 2015-11-24 12:44

>>76
who Are you le quoting xD

Name: suigin 2015-12-06 4:03

I've updated the hebimath library with a gmp/libtommath style integer API that handles memory management, and I've implemented runtime CPUID dispatching for the assembly functions. I wanted to make the memory management as flexible as possible, so you could easily override it with different allocators and do so in a thread-safe manner. If you don't need memory management, you can still use the low-level assembly kernels directly.

Currently working on adding multiplication and division by a small constant so I can implement hebi_int <-> base 10 string conversion, which will be needed for future unit tests.

Please take a look!

https://github.com/suiginsoft/hebimath/

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List