Given a 1TB data set on disk with around 1KB per data record, how can I find duplicates using 512MB RAM and infinite disk space?

Use a Bloom filter : a table of simultaneous hashes. According to Wikipedia, the optimal number of hashes is ln(2) * 2^32 / 2^30? 2.77?

3 . (Hmm, plugging in 4 gives fewer false positives but 3 is still better for this application.) This means that you have a table of 512 megabytes, or 4 gigabits, and processing each record sets three new bits in that vast sea. If all three bits were already set, it's a potential match.

Record the three hash-values to a file. Otherwise, record them to another file. Note the record index along with each match.

I'm late to the party, so unfortunately I doubt my solution will get much consideration, but I think it deserves to be offered.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions