Given a 1TB data set on disk with around 1KB per data record, how can I find duplicates using 512MB RAM and infinite disk space?

Use a Bloom filter : a table of simultaneous hashes. According to Wikipedia, the optimal number of hashes is ln(2) * 2^32 / 2^30? 2.77?

3 . (Hmm, plugging in 4 gives fewer false positives but 3 is still better for this application.) This means that you have a table of 512 megabytes, or 4 gigabits, and processing each record sets three new bits in that vast sea. If all three bits were already set, it's a potential match.

Record the three hash-values to a file. Otherwise, record them to another file. Note the record index along with each match.

I'm late to the party, so unfortunately I doubt my solution will get much consideration, but I think it deserves to be offered.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Given a 1TB data set on disk with around 1KB per data record, how can I find duplicates using 512MB RAM and infinite disk space?

Related Questions

What is the difference between 1kb and 1kb?

Can you put a 1GB ddr1 ram in single slot of d845gvsr motherboard or do you have to split them in two ram of 512MB each?

To TopWhy do I get the event Very low disk space.... even after freeing up disk space?

Will you be able to play modern warfare 2 on a 2.4 ghz processor 4gb DDR2 ram 512mb graphics card on max graphics?

Is this a good laptop to play some games Acer E5-551, 15.6", Laptop, AMD A10, 8GB RAM, 1TB - Black Plz read?

Would an infinite sequence of random numbers contain an infinite sequence of zeroes?