Print unique lines of a 10GB file?

If you don't care about preserving order, I bet the following is faster than the previously posted solutions (e.g. DBM::Deep).

Up vote 10 down vote favorite 4 share g+ share fb share tw.

I have a 10GB file with 200 million lines. I need to get unique lines of this file. My code: while() { chomp; $tmp{$_}=1; } #print... I only have 2GB memory.

How can I solve this problem? Perl link|improve this question asked Apr 5 at 2:32John Water512.

– Matthew Flaschen Apr 5 at 2:37 I don't know how much duplication lines. In top, this script takes 20G VIRT and 10G RES, and growing... – John Water Apr 5 at 2:40 If you can't store 1 copy of each line in memory in plaintext, you could try compressing them, or hashing them (which might result in errors! ).

If there's a lot of these guys, using a DBMS would be a lot slower, but would also get the job done. – bdares Apr 5 at 2:44.

If you don't care about preserving order, I bet the following is faster than the previously posted solutions (e.g. DBM::Deep): sort -u file.

That isn’t saying much, since just about anything is faster than DBM::Deep. – tchrist Apr 5 at 13:25 2 What is the memory requirement for sort -u? Also, sorting files is O(N*logN), while merely finding unique lines is O(N).

This is MUCH slower than any other approach – DVK Apr 5 at 15:23 @ikegami - Since when is sort -u doing 200M comparisons on 200M line file? Wouldn't it be, like 200M*31? – DVK Apr 5 at 19:34 @ikegami - actually, O(NlogN) is 200M*31 and O(N) is 200M*c, and if your scanning algo is smart enough, c is way less than 31.

– DVK Apr 5 at 19:37 @DVK, oops, I added instead of multiplied (previously comment deleted), but 27 times more checks in sort is still likely faster than a Perl solution. Where do you get "MUCH slower" from!?!? – ikegami Apr 5 at 19:38.

As I commented on David's answer, a database is the way to go, but a nice one might be DBM::Deep since its pure-Perl and easy to install and use; its essentially a Perl hash tied to a file. Use DBM::Deep; tie my %lines, 'DBM::Deep', 'data. Db'; while() { chomp; $lines{$_}=1; } This is basically what you already had, but the hash is now a database tied to a file (here data.

Db) rather than kept in memory.

1 No, you don’t want to use DBM::Deep. It is really really really really slow. Just use DB_File and be done with it.

There is no moral highground to be siezed by using something that’s “pure Perl”, you know. – tchrist Apr 5 at 13:25 No moral highground intended, I really just know very little about databases, and DBM::Deep seems to do what I expect. I haven't used DB_File before.

Am I right in saying that it looks like it needs an external db server running? I guess that for a one-off perhaps I would trade some running speed for staying under the memory cap and not having to install an external db. For real projects I will keep DB_File in mind.

Thanks. – Joel Berger Apr 5 at 17:31 Reading some of the comments, I will just reiterate, the question wasn't about speed it was about staying under 2Gb RAM. I am working under the assumption that the OP is cleaning up some old data and that this is just a one-off, in which case, just get it done.

If I am incorrect and this is a task to be performed repeatedly then, by all means, we must consider speed. – Joel Berger Apr 5 at 17:35 No, you're not right, there is no server. It simply links against libdb (/usr/lib64/libdb-4.8. So on my system), an embedded database.

Its bindings ship with the Perl core. DBM::Deep is not strictly required because there are no nested data structures, so if thinking about the right tool saves us a couple of minutes of run-time, that's already a trivial win. Since thinking about this takes only a couple of seconds, why would one not consider speed?

– daxim 2 days ago @daxim, I was envisioning a nightmare of configuration. If its as easy as you say, then no problem. Like I said, I've only used DBM::Deep out of convenience.

– Joel Berger 2 days ago.

If you don't care about time/IO constraints, nor disk constraints (e.g. You have 10 more GB space), you can do the following dumb algorithm: 1) Read the file (which sounds like it has 50 character lines). While scanning it, remember the longest line length $L. 2) Analyze the first 3 characters (if you know char #1 is identical - say "" - analyze the 3 characters in position N that is likely to have more diverse ones).

3) For each line with 3 characters $XYZ, append that line to file 3char. $XYZ and keep the count of how many lines in that file in a hash. 4) When your entire file is split up that way, you should have a whole bunch (if the files are A-Z only, then 26^3) of smaller files, and at most 4 files that are >2GB each.

5) Move the original file into "Processed" directory. 6) For each of the large files (>2GB), choose the next 3 character positions, and repeat steps #1-#5, with new files being 6char. $XYZABC 7) Lather, rinse, repeat.

You will end up with one of the 2 options eventually: 8a) A bunch of smaller files each of which is under 2GB, all of which have mutually different strings, and each (due to its size) can be processed individually by standard "stash into a hash" solution in your question. 8b) Or, most of the files being smaller, but, you have exausted all $L characters while repeating step 7 for >2GB files, and you still have between 1-4 large files. Guess what - since those up-to-4 large files have identical characters within a file in positions 1..$L, they can ALSO be processed using the "stash into a hash" method in your question, since they are not going to contain more than a few distinct lines despite their size!

Please note that this may require - at the worst possible distributions - 10GB * L / 3 disk space, but will ONLY require 20GB disk space if you change step #5 from "move" to "delete". Voila. Done.

As an alternate approach, consider hashing your lines. I'm not a hashing expert but you should be able to compress a line into a hash If you want to be fancy about this, you will do a frequency analysis on character sequences on the first pass, and then do compression/encoding this way.

In most cases, you could store the line as a key in a hash. However, when you get this big, this really isn't very efficient. In this case, you'd be better off using a database.

One thing to try is the Berkeley Database that use to be included in Unix (BDB). Now, it's apparently owned by Oracle. Perl can use the BerkeleyDB module to talk with a BDB database.

In fact, you can even tie a Perl hash to a BDB database. Once this is done, you can use normal Perl hashes to access and modify the database. BDB is pretty robust.

Bitcoins uses it, and so does SpamAssassin, so it is very possible that it can handle the type of database you have to create in order to find duplicate lines. If you already have DBD installed, writing a program to handle your task shouldn't take that long. If it doesn't work, you wouldn't have wasted too much time with this.

The only other thing I can think of is using an SQL database which would be slower and much more complex. Addendum Maybe I'm over thinking this... I decided to try a simple hash. Here's my program: #!

/usr/bin/env perl use strict; use warnings; use feature qw(say); use autodie; use constant DIR => "/usr/share/dict"; use constant WORD_LIST => qw(words web2a propernames connectives); my %word_hash; for my $count (1..100) { for my $file (WORD_LIST) { open my $file_fh, ") { chomp $word; $word_hash{"$file-$word-$count"} = $word; } } } The files read in contain a total of about 313,000 lines. I do this 100 times to get a hash with 31,300,000 keys in it. It is about as inefficient as it can be.

Each and every key will be unique. The amount of memory will be massive. Yet... It worked.

It took about 10 minutes to run despite the massive inefficiencies in the program, and it maxed out at around 6 gigabytes. However, most of that was in virtual memory. Strangely, even though it was running, gobbling memory, and taking 98% of the CPU, my system didn't really slow down all that much.

I guess the question really is what type of performance are you expecting? If taking about 10 minutes to run isn't that much of an issue for you, and you don't expect this program to be used that often, then maybe go for simplicity and use a simple hash. I'm now downloading DBD from Oracle, compiling it, and installing it.

I'll try the same program using DBD and see what happens. Using a BDB Database After doing the work, I think if you have MySQL installed, using Perl DBI would be easier. I had to: Download Berkeley DB from Oracle, and you need an Oracle account.

I didn't remember my password, and told it to email me. Never got the email. I spent 10 minutes trying to remember my email address.

Once downloaded, it has to be compiled. Found directions for compiling for the Mac and it seemed pretty straight forward. Running CPAN crashed.

Ends up that CPAN is looking for /usr/local/BerkeleyDB and it was installed as /usr/local/BerkeleyDB. 5.3. Creating a link fixed the issue. All told, about 1/2 an hour getting BerkeleyDB installed.

Once installed, modifying my program was fairly straight forward: #! /usr/bin/env perl use strict; use warnings; use feature qw(say); use autodie; use BerkeleyDB; use constant { DIR => "/usr/share/dict", BDB_FILE => "bdb_file", }; use constant WORD_LIST => qw(words web2a propernames connectives); unlink BDB_FILE if -f BDB_FILE; our %word_hash; tie %word_hash, "BerkeleyDB::Hash", -Filename => BDB_FILE, -Flags => DB_CREATE or die qq(Cannot create DBD_Database file ") . BDB_FILE .

Qq("\n); for my $count (1..10) { for my $file (WORD_LIST) { open my $file_fh, ") { chomp $word; $word_hash{"$file-$word-$count"} = $word; } } } All I had to do was add a few lines. Running the program was a disappointment. It wasn't faster, but much, much slower.

It took over 2 minutes while using a pure hash took a mere 13 seconds. However, it used a lot less memory. While the old program gobbled gigabytes, the BDB version barely used a megabyte.

Instead, it created a 20MB database file. But, in these days of VM and cheap memory, did it accomplish anything? In the old days before virtual memory and good memory handling, a program would crash your computer if it used all of the memory (and memory was measured in megabytes and not gigabytes).

Now, if your program wants more memory than is available, it simply is given virtual memory. So, in the end, using a Berkeley database is not a good solution. Whatever I saved in programming time by using tie was wasted with the installation process.

And, it was slow. Using BDB simply used a DBD file instead of memory. A modern OS will do the same, and is faster.

Why do the work when the OS will handle it for you? The only reason to use a database is if your system really doesn't have the required resources. 200 million lines is a big file, but a modern OS will probably be okay with it.

If your system really doesn't have the resource, use a SQL database on another system, and not a DBD database.

I might even think about using a DBM::Deep hash for this, its easy to setup and use. – Joel Berger Apr 5 at 4:07 What's the speed of BDB solution? – DVK Apr 5 at 15:27 @DVK At one time, BDB was a standard library found on all Unix system and in Perl 3.

X days, the the tie command could only tie your hash to a BDB database. It was how you handled really big hashes. However, 20,000,000 is a big number, and BDB might not be able to handle it.

The other solution would be to use a true SQL database. That's more complex. You could implement a BDB solution fairly quickly and see if it works or not before moving to a SQL solution.

– David W. Apr 5 at 20:36 Post Addendum: Can you also run using DBM::Deep? I am genuinely curious now after all the bashing it has taken.

I'm not married to it, but I have used in on occasion for simplicity. – Joel Berger Apr 5 at 21:40 @DVK Using BDB ended up bing much slower. It took 1minute 13 seconds to do 300K keys vs. 12 seconds for the standard Perl hash.

However, it also took only a fraction of the memory, but in exchange, it also created a 21MB DBD file. I guess I should have realized that writing to the physical disk would slow BDB down. In the old days when system memory was measured in tens of megabytes and not gigabytes, using DBD may have actually been faster.

And, back before VM, using DBD wouldn't cause your computer to crash due to lack of memory. The result use a simple Perl hash when possible. – David W.

Apr 5 at 22:22.

You might consider calculating a hash code for each line, and keeping track of (hash, position) mappings. You wouldn't need a complicated hash function (or even a large hash) for this; in fact, "smaller" is better than "more unique", if the primary concern is memory usage. Even a CRC, or summing up the chars' codes, might do.

The point isn't to guarantee uniqueness at this stage -- it's just to narrow the candidate matches down from 200 million to a few dozen. For each line, calculate the hash and see if you already have a mapping. If you do, then for each position that maps to that hash, read the line at that position and see if the lines match.

If any of them do, skip that line. If none do, or you don't have any mappings for that hash, remember the (hash, position) and then print the line. Note, i'm saying "position", not "line number".

In order for this to work in less than a year, you'd almost certainly have to be able to seek right to a line rather than finding your way to line #1392499.

You could break you file into 10 1 Gbyte files Then reading in one file at a time, sorting lines from that file and writing it back out after they are sorted. Opening all of the 10 files and merge them back into one file (making sure you you merge them in the correct order). Open an output file to save the unique lines.

Then read the merge file one line at a time, keeping the last line for comparison. If the last line and the current line are not a match, write out the last line and save the current line as the last line for comparison. Otherwise get the next line from the merged file.

That will give you a file which has all of the unique lines. It may take a while to do this, but if you are limited on memory, then breaking the file up and working on parts of it will work. It may be possible to do the comparison when writing out the file, but that would be a bit more complicated.

Sorting files is O(N*logN), while merely finding unique lines is O(N). This is MUCH slower than any other approach – DVK Apr 5 at 15:24.

Posix shell: sort | uniq done, let's go drink beers.

You aren’t suppose to sort them. – tchrist Apr 5 at 13:26 I see no such requirement in the question or comments. – dbenhur Apr 5 at 15:06 Sorting files is O(N*logN), while merely finding unique lines is O(N).

This is MUCH slower than any other approach. Also, are you sure that sort and uniq won't run into memory constraint? – DVK Apr 5 at 15:25 Unix sort is actually pretty frickin fast,implemented at a much lower level than HLL perl code executes.

It also automatically handles the limited memory issue, performing external merge sort on partitions sized appropriately. @DVK Log2(2e8) is ~28. If unix sort is ~30 times faster in its inner loops than perl chomp and hash loop, sort wins.

It is not uncommon to observe C code run 100+ times as fast as equivalent perl. – dbenhur Apr 5 at 15:32 @dbenhur - Hm... What you say sounds plausible but I would not trust it as a mental excercise in the absence of an actual benchmark – DVK Apr 5 at 15:38.

If you have more processor and have at least 15GB free space and your storage fast enough, you could try this out. This will process it in paralel. Split --lines=100000 -d 4 -d input.

File find . -name "x*" -print|xargs -n 1 -P10 -I SPLITTED_FILE sort -u SPLITTED_FILE>unique. SPLITTED_FILE cat unique.

*>output. File rm unique. * x.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions