How to find a checksum of the same checksum? (job-interview question)_问答_开发者

Devise a simple algorithm which creates a file which contains nothing but its own chec开发者_Go百科ksum.

Let's say it is CRC-32, so this file must be 4 bytes long.

There might be some smart mathematical way of finding it out (or proving that none exists), if you know how the algorithm works.

But since I'm lazy and CRC32 has only 2^32 values, I would brute force it. While waiting for the algorithm to go through all 2^32 values, I would use Google and Stack Overflow to find whether somebody has a solution to it.

In case of SHA-1, MD5 and other more-or-less cryptographically secure algorithms, I would get intimidated by the mathematicians who designed those algorithms and just give up.

EDIT 1: Brute forcing... This far I've found one; CC4FBB6A in big-endian encoding. There might still be more. I'm checking 4 different encodings: ASCII uppercase and lowercase, and binary big-endian and little-endian.

EDIT 2: Brute force done. Here are the results:

CC4FBB6A (big-endian)
FFFFFFFF (big-endian & little-endian)
32F3B737 (uppercase ASCII)

The code is here. On my overclocked C2Q6600 that takes about 1.5 hours to run. Now that program is single-threaded, but it would be easy to make it multi-threaded, which would give a nice linear scalability.

Aside from Jerry Coffin and Esko Luontola's good answers to an unusual problem, I'd like to add:

Mathematically, we're looking for X such that F(X) = X, where F is the checksum function, and X is the data itself. Since the checksum's output is of fixed size, and the input we are looking for is of the same size, there is no guarantee that such an X even exists! It could very well be that every input value of the fixed size is correlated with a different value of that size.

EDIT: Your question didn't specify the exact way the checksum is supposed to be formatted within the file, so I assumed you mean the byte-representation of the checksum. When strings and encodings and formatted-strings come to play, things become more complex.

Lacking any specific guidance to the contrary, I'd define the checksum of nonexistent data as a nonexistent checksum, so creating an empty file would fulfill the requirement.

Another typical method is a negative checksum -- i.e. after the data you write a value that makes the checksum of the whole file (including the checksum) come out to zero. In this case, you write a checksum of 0, and it all works out.

Brute force. This is Adler32, which I haven't implemented before, and didn't bother testing, so it's quite likely I've messed it up. I wouldn't expect a corrected version to run significantly slower, though, unless I've done something colossally wrong.

This assumes that the 32bit checksum value is written to the file little-endian (I didn't find a fixed point with it big-endian):

#include <iostream>
#include <stdint.h>
#include <iomanip>

const int modulus = 65521;

void checkAllAdlers(uint32_t sofar, int depth, uint32_t a, uint32_t b) {
    if (depth == 4) {
        if ((b << 16) + a == sofar) {
            std::cout << "Got a fixed point: 0x" << 
                std::hex << std::setw(8) << std::setfill('0') << 
                sofar << "\n";
        }
        return;
    }
    for (uint32_t i = 0; i < 256; ++i) {
        uint32_t newa = a + i;
        if (newa >= modulus) newa -= modulus;
        uint32_t newb = b + a;
        if (newb >= modulus) newb -= modulus;

        checkAllAdlers(sofar + (i << (depth*8)), depth + 1, newa, newb);
    }
    return;
}

int main() {
    checkAllAdlers(0, 0, 1, 0);
}

Output:

$ g++     adler32fp.cpp   -o adler32fp -O3 && time ./adler32fp
Got a fixed point: 0x03fb01fe

real    0m31.215s
user    0m30.326s
sys     0m0.015s

[Edit: several bugs fixed already, I have no confidence whatever in the correctness of this code ;-) Anyway, you get the idea: a 32 bit checksum which uses each byte of input only once is very cheap to brute force. Checksums are usually designed to be fast to compute, whereas hashes are usually much slower, even though they have superficially similar effects. If your checksum was "2 rounds of Adler32" (meaning that the target checksum was the result of computing the checksum and then computing the checksum of that checksum) then my recursive approach wouldn't help so much, there'd be proportionally less in common between inputs with a common prefix. MD5 has 4 rounds, SHA-512 has 80.]

Brute force it. CRC-32 gives you a string of length 8 containing digits and letters of A-F (in other words, it's a hexadecimal number). Try every combination, giving you 16⁸ = many possibilities. Then hash each possibility and see if it gives you the original string.

You can try optimizing it by assuming the solution will use each character no more than two or three times, this might make it finish faster.

If you have access to a CRC32 implementation, you can also try to break the algorithm and find a solution much faster, but I have no idea how you'd do this.