Most likely, what's going on is that the file does not have line terminators, and so the reader just keeps growing it's StringBuffer unbounded until it runs out of memory.
Most likely, what's going on is that the file does not have line terminators, and so the reader just keeps growing it's StringBuffer unbounded until it runs out of memory. The solution would be to read a fixed number of bytes at a time, using the 'read' method of the reader, and then look for new lines (or other parsing tokens) within the smaller buffer(s).
Absolutely correct – gd1 May 4 '11 at 22:40 This would probably be a good place for the NIO package- he'd need all of the performance he can get to process 40GB or so of text data. – Dataknife May 4 '11 at 23:36 It makes a lot of sense now because I noticed that doesn't matter the max heap size I set, the final output file size is always the same. So I suspect that there's one line somewhere that causes the trouble.
I'm now checking it. Thanks a lot! – user431336 May 4 '11 at 23:42 @user431336: Also, don't forget to close your PrintStream... your example leaves it open when you terminate the method.
– Dataknife May 4 '11 at 23:46 @Dataknife the PrintStream? I do close it once the loop terminates. – user431336 May 5 '11 at 2:44.
Somewhere in the file this is probably the problem. Thanks a lot. – user431336 May 4 '11 at 23:43.
I have 3 theories: The input file is not UTF-8 but some indeterminate binary format that results in extremely long lines when read as UTF-8. The file contains some extremely long "lines" ... or no line breaks at all. Something else is happening in code that you are not showing us; e.g. You are adding new elements to set.
To help diagnose this: Use some tool like od (on UNIX / LINUX) to confirm that the input file really contains valid line terminators; i.e. CR, NL, or CR NL. Use some tool to check that the file is valid UTF-8.
Add a static line counter to your code, and when the application blows up with an OOME, print out the value of the line counter. Keep track of the longest line seen so far, and print that out as well when you get an OOME. For the record, your slightly suboptimal use of trim will have no bearing on this issue.
Thanks a lot for this great answer and excellent suggestions! – user431336 May 4 '11 at 23:50.
One possibility is that you are running out of heap space during a garbage collection. The Hotspot JVM uses a parallel collector by default, which means that your application can possibly allocate objects faster than the collector can reclaim them. I have been able to cause an OutOfMemoryError with supposedly only 10K live (small) objects, by rapidly allocating and discarding.
You can try instead using the old (pre-1.5) serial collector with the option -XX:+UseSerialGC. There are several other "extended" options that you can use to tune collection.
You might want to try removing the String fields declaration out of the loop. As you are creating a new array in every loop. You can just reuse the old one right?
He's not creating anything. He's declaring a variable that holds a reference to an array of String objects (returned by split()). Since its required scope is only in the loop, it's perfectly fine to declare it there.
– Brian Roach May 4 '11 at 22:42 The String is a local variable within the scope of the loop and any allocated memory for the array will be garbage collected by the JVM. – bstick12 May 4 '11 at 22:45.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.