Apache POI Java Excel Performance for Large Spreadsheets?

I did also some processing with thousands of large excel files and in my opinion POI is very fast. Loading that excel files tooks also about 1 minute in Excel itself. So I would confirm that the problem lies out of POI code.

Up vote 5 down vote favorite 3 share g+ share fb share tw.

I have a spreadsheet I'm trying to read with POI (I have both xls and xlsx formats), but in this case, the problem is with the xls file. My spreadsheet has about 10,000 rows and 75 columns, and reading it in can take several minutes (though Excel opens in a few seconds). I'm using the event based reading, rather than reading the whole file into memory.

The meat of my code is below. It's a bit messy right now, but it's really just a long switch statement that was mostly copied from the POI examples. Is it typical for POI performance using the event model to be so slow?

Is there anything I an do to speed this up? I think several minutes will be unacceptable for my application. POIFSFileSystem poifs = new POIFSFileSystem(fis); InputStream din = poifs.

CreateDocumentInputStream("Workbook"); try { HSSFRequest req = new HSSFRequest(); listener = new FormatTrackingHSSFListener(new HSSFListener() { @Override public void processRecord(Record rec) { thisString = null; int sid = rec.getSid(); switch (sid) { case SSTRecord. Sid: strTable = (SSTRecord) rec; break; case LabelSSTRecord. Sid: LabelSSTRecord labelSstRec = (LabelSSTRecord) rec; thisString = strTable.

GetString(labelSstRec .getSSTIndex()).getString(); row = labelSstRec.getRow(); col = labelSstRec.getColumn(); break; case RKRecord. Sid: RKRecord rrk = (RKRecord) rec; thisString = ""; row = rrk.getRow(); col = rrk.getColumn(); break; case LabelRecord. Sid: LabelRecord lrec = (LabelRecord) rec; thisString = lrec.getValue(); row = lrec.getRow(); col = lrec.getColumn(); break; case BlankRecord.

Sid: BlankRecord blrec = (BlankRecord) rec; thisString = ""; row = blrec.getRow(); col = blrec.getColumn(); break; case BoolErrRecord. Sid: BoolErrRecord berec = (BoolErrRecord) rec; row = berec.getRow(); col = berec.getColumn(); byte errVal = berec.getErrorValue(); thisString = errVal == 0? Boolean.

ToString(berec .getBooleanValue()) : ErrorConstants . GetText(errVal); break; case FormulaRecord. Sid: FormulaRecord frec = (FormulaRecord) rec; switch (frec.

GetCachedResultType()) { case Cell. CELL_TYPE_NUMERIC: double num = frec.getValue(); if (Double. IsNaN(num)) { // Formula result is a string // This is stored in the next record outputNextStringRecord = true; } else { thisString = formatNumericValue(frec, num); } break; case Cell.

CELL_TYPE_BOOLEAN: thisString = Boolean. ToString(frec . GetCachedBooleanValue()); break; case Cell.

CELL_TYPE_ERROR: thisString = HSSFErrorConstants . GetText(frec. GetCachedErrorValue()); break; case Cell.

CELL_TYPE_STRING: outputNextStringRecord = true; break; } row = frec.getRow(); col = frec.getColumn(); break; case StringRecord. Sid: if (outputNextStringRecord) { // String for formula StringRecord srec = (StringRecord) rec; thisString = srec.getString(); outputNextStringRecord = false; } break; case NumberRecord. Sid: NumberRecord numRec = (NumberRecord) rec; row = numRec.getRow(); col = numRec.getColumn(); thisString = formatNumericValue(numRec, numRec .getValue()); break; case NoteRecord.

Sid: NoteRecord noteRec = (NoteRecord) rec; row = noteRec.getRow(); col = noteRec.getColumn(); thisString = ""; break; case EOFRecord. Sid: inSheet = false; } if (thisString! = null) { // do something with the cell value } } }); req.

AddListenerForAllRecords(listener); HSSFEventFactory factory = new HSSFEventFactory(); factory. ProcessEvents(req, din); java performance excel apache-poi link|improve this question edited Aug 4 '11 at 20:25Danilo Piazzalunga1,659615 asked May 13 '11 at 13:23Jeff Storey8,38912575 73% accept rate.

For me it is opposite, writing a file with many rows just uses some megabytes of heap for xls, but even 2GB wasn't enough for xlsx. I need to fallback to xls if output has more than X rows – rlovtang May 13 '11 at 13:45 @rlovtang, xlsx is even slower. I just meant I am using POI instead of something like JExcel because I need to deal with both xls and xlsx.

– Jeff Storey May 13 '11 at 13:46.

See my answer, you are correct. – Jeff Storey May 13 '11 at 14:33.

I did some more detailed profiling and it looks like the problem is actually in code outside of POI. I just assumed this was the bottleneck, but I believe this is incorrect.

– David M Kershaw Jan 18 at 13:12 1 It was my own code. I was mistakenly trying to do some very inefficient parsing of text fields to see if they had date values (for when users entered text dates into cells that did not have a date format). – Jeff Storey Jan 18 at 23:45.

I would attempt to use the streaming hssf as well introduced in poi-beta3. This helped the memory issues on large spreadsheets with 1000+ columns.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions