Python or Java for text processing (text mining, information retrieval, natural language processing)?

Both are good. Java has a lot of steam going into text processing Stanford s text processing system OpenNLP UIMA and GATE seem to be the big players (I know I am missing some). You can literally run the StanfordNLP module on a large corpus after a few minutes of playing with it.

But, it has major memory requirements (3 GB or so when I was using it).

Both are good. Java has a lot of steam going into text processing. Stanford's text processing system, OpenNLP, UIMA, and GATE seem to be the big players (I know I am missing some).

You can literally run the StanfordNLP module on a large corpus after a few minutes of playing with it. But, it has major memory requirements (3 GB or so when I was using it). NLTK, Gensim, Pattern, and many other Python modules are very good at text processing.

Their memory usage and performance are very reasonable. Python scales up because text processing is a very easily scalable problem. You can use multiprocessing very easily when parsing/tagging/chunking/extracting documents.

Once your get your text into any sort of feature vector, then you can use numpy arrays, and we all know how great numpy is... I learned with NLTK, and Python has helped me greatly in reducing development time, so I opine that you give that a shot first. They have a very helpful mailing list as well, which I suggest you join. If you have custom scripts, you might want to check out how well they perform with PyPy.

I think this is a good overview of packages used for the NLP side of a project. Another thing to consider is the machine learning side. Though I am only familiar with the Java libraries - WEKA, MALLET, Apache Mahout.

– Thien Jun 11 at 21:44.

It's very difficult to answer questions like this without trying. So why don't you Figure out what would be a difficult operation Implement that (and I mean the simplest, quickest hack that you can make work) Run it with a lot of data, and see how long it takes Figure out if it's too slow I've done this in the past and it's really the way to see if something performs well enough for something.

Just write it, the biggest flaw in programming people have is premature optimization. Work on a project, write it out and get it working. Then go back and fix the bugs and ensure that its optimized.

There are going to be a number of people harping on about speed of x vs y and y is better than x but at the end of a day its just a language. Its not what a language is but how it does it.

It's not language you have to evaluate, but frameworks and app servers for clustering, data storage/retrieval etc available for the language. You can use jython and use all the java enterprise technologies for high load system and do text parsing with python.

I have never used Jython. I read that it is slower than Python. But I guess it might be made faster than Python by converting critical parts of code into java?

Is that correct? – kga May 17 at 12:57 @user757256: yes, jython is slightly slower and is much more memory-hungry. Hack some prototype, benchmark with CPython and Jython, then see what you can optimize.

And again, make your choice only after evaluating which libraries you can use. – Denis Tulskiy May 17 at 13:28.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions