2
$\begingroup$

I'm bad at Math and english isn't my native language. Bear with me. Thanks.

I'm running a prime number search script and writing the result to a SQLite3 database.

Now I'm looking for a way to shorten these prime numbers because I don't want to be forced to store them as strings if they get very big/long. I don't want to loose precision, so the shortened version should allow it to recompute the original value.

I don't care if the database has human-readable data. I can make it human-readable when I fetch data from the database again.

Browsed https://oeis.org and https://primes.utm.edu to see how they organize their database, but I was only more confused after that. Is there a best practice for this? I'm really stuck.

  • 0
    How about, simply compressing it with zip or rar?2017-01-02
  • 0
    @barak manos Basically yes, but not in my case. I want to be able to query the data tru a relational database.2017-01-02
  • 0
    Do you want to get the n-th prime for random n or do you want to generate the primes sequentially?2017-01-02
  • 0
    @marty cohen I generate them sequentially. My script works with range entries like `1 to 100`.2017-01-02
  • 2
    You can save the differences between consecutive primes. So $2,3,5,7,11,13,17,19$ gets stored as $2,1,2,2,4,2,4,2$.2017-01-02
  • 0
    @Gerry Myerson : Oh sure! I didn't see that option while thinking about it (too much maybe). Will try this out. Thanks!2017-01-02
  • 0
    @NinjaCat By the way, these differences are very small in practice. For primes up to $10^9$, there won't be a difference larger than $300$ (this could inform what data type you use to store them), and the typical difference will be closer to $20$.2017-01-02

3 Answers 3

1

If you are generating them sequentially, a standard way is to store the difference between consecutive primes.

This can be done much more compactly (I did this many years ago) by using the fact that there are only 8 possible primes between 30n and 30n+29 (they are 30n+1, 7, 11, 13, 17, 19, 23, 29).

By storing 8 bits in one byte, each bit telling whether or not that particular increment is prime, the primes in 30n to 30n+29 can be represented by one byte.

Note that 30 = 2x3x5 and 8 = 1x2x4.

To go a little further, since 210 = 2x3x5x7 and 48 = 1x2x4x6, the primes from 210n+1 to 210n+211 can be represented in 48 bits.

For the 30 case, the increments are 2, 6, 4,2, 4, 2, 4, 6.

The algorithm would be to generate 2, 3, 5, 7 initially. Then, with increments of 4,2, 4, 2, 4, 6, 2, 6, check if the corresponding bit is on and, if it is, output the value as a prime.

To check numbers up to $m$, you have to compute the primes up to $\sqrt{m}$ and store them.

1

You could try to save the database in binary format I guess. Otherwise each numeral in a number has fixed size of presumably 1-byte, whereas even without improvements you could store numbers up to 256 in 1-byte in binary format.

Also you may wanna check Huffman coding but I doubt it would help in a prime case.

  • 0
    Thanks for your suggestions. I'm reading tru the Huffman coding link right now. About saving the database in binary format: I thought when i create a SQLite3 database file, it's already a binary file. So you mean storing the values as Binary representation in a BLOB field?2017-01-02
  • 0
    I don't know about SQLite3 or BLOB fields, but in C or Python you can write a file in binary format and in them the numbers are saved as numbers instead of characters, and thus reducing the size.2017-01-02
  • 0
    I see. I understood you correctly. Just wanted to be sure. Thank you!2017-01-02
1

Let's consider how much compression you can get by clever encoding of your primes to reduce the amount of memory or file space each one occupies.

Whatever encoding you choose, I assume you want it to work for any prime, so that no matter which prime you encode by this method, you can recover the original prime from the encoding later.

Suppose you had a function that could tell you the $n$th largest prime quickly enough so that you could just encode the $n$ largest prime as the number $n$ and store it in that format. That's the densest possible encoding that you could have. Assuming you stored the number in decimal format, in order to store primes larger than $15\,485\,863$ (the millionth prime) you would need to use seven digits for some of the primes. That is, by this incredibly efficient compression scheme you would be able to store eight- or nine-digit numbers such as $15\,485\,867$ or $179\,424\,673$ in just seven digits. In order to store primes larger than $179\,424\,673$ (the ten millionth prime) you would need to use eight digits, and this would allow you to store nine-digit primes and some ten-digit primes.

In general, in order to encode a prime with numeric value $N$ in this fashion, the encoding itself will be about $N/\ln(N).$ That means to encode a prime with twelve decimal digits the encoding will be about $10^{11}/\ln(10^{11}) \approx 3.9\times10^9$ at least--that is, ten digits or more. The encoding for a large twelve-digit prime would be closer to $10^{11}/\ln(10^{11}) \approx 3.6\times10^{10},$ which has eleven digits.

To encode primes of up to twenty digits using this scheme, most of the numbers after encoding would be greater than $10^{19}/\ln(10^{19}) \approx 2.2\times10^{17},$ an eighteen-digit number.

In short, unless you want to limit yourself to particular classes of primes (storing only Mersenne primes, for example), you're not going to be able to reduce the required storage space much by some clever function that encodes prime numbers using smaller numeric values.

What you can do is to choose a numeric representation that stores large numbers in fewer bytes of memory. SQLite 3 provides the INTEGER datatype, which stores an eight-byte signed integer with a maximum value of about $9\times10^{18}.$ That covers (approximately) the first $2\times10^{17}$ primes. Will your prime-number search script be searching for primes beyond that?

  • 0
    Thank you for this great explanation! I searched the web for "scientific number notation" and found a lot. Ended up with trying [math.frexp()](https://docs.python.org/2/library/math.html#math.frexp). But not sure if that's the way to go. Yes my script doesn't have a limit. It's slow with very big numbers, but it can theoretically search forever. I'm aware of the fact that the system I'm running it on will limit it, but I just want to make it this way. It's a fun project.2017-01-02
  • 1
    Be aware that math.frexp() has _less_ precision than a 64-bit integer. There are pairs of consecutive primes you can distinguish using the integer but not using math.frexp(). If you want to be able theoretically to handle numbers much larger than $10^{19}$ you'll need to "roll your own" data type. That also means all the operations in your script have to be valid and exact for larger numbers, as well. A wiser course may be to accept the limitations of INT8, since most likely other limitations (running time, in particular) are likely to be what actually limits how many primes you find.2017-01-02
  • 0
    I see. That whole precision topic is still kinda a mistery to me. Will continue reading on the topics you mentioned tomorrow. Thank you and good night for now :)2017-01-02