Java Open Source Text Mining Frameworks [closed]?

Although not a specialized text mining framework, Weka has a number of classifiers usually employed in text mining tasks such as: SVM, kNN, multinomial NaiveBayes, among others. It also has a few filters to wok with textual data like the StringToWordVector filter which can perform TF/IDF transformation. Check out the Weka wiki website for more information.

The problem is that I need to perform Named Entity Recognition (NER), and Weka does not provide features to extract features from words, such as orthographic and morphological characteristics. But it will be cool if I can use Weka's methods on IR. – David Campos Feb 20 '10 at 18:59 1 I think Wikipedia page on the topic has a few links to some packages for NER.

Also I just came across UIMA project by Apache, perhaps you'll find it useful: incubator.apache. Org/uima/index. Html – Amro Feb 20 '10 at 20:16 Yes I know UIMA.

But it does not provide ML Methods. It is a perfect solution for systems that make NER with dictionary-based approaches. I don't know how to integrate ML methods on UIMA.

– David Campos Feb 20 '10 at 20:24.

I honestly think that the several answers presented here are very good. However, to fulfill my requirements I have chosen to use Apache UIMA with ClearTK. It supports several ML Methods and I do not have any licences problem.

Plus, I can make wrappers to other ML methodologies, and I take the advantage of the UIMA framework, which is very well organized and fast. Thank you all for your interesting answers. Best Regards, ukrania.

Maybe have a look at Java Open Source NLP and Text Mining tools.

I've already seen this web site, it is really nice, thanks. But I was asking for your experience feedback. I've already tried some of them but I don't know which one is the best.

Or even if I have to use one, two or maybe more frameworks to accomplish my task. – David Campos Feb 20 '10 at 19:23 @ukrania Sorry, I'm not the right person then. Good luck.

– Pascal Thivent Feb 20 '10 at 19:35.

I've used LingPipe -- a suite of Java libraries for the linguistic analysis of human language -- for text mining (and other related) tasks. It is a very well documented software package, and the site contains several tutorials which thoroughly explain how to do a certain task with LingPipe, such as named entity recognition. There is also a newsgroup, wherein you can post any question you have about the software (or NLP related tasks), and have a prompt reply from the authors of the package themselves; and of course, a blog.

The source code is also very easy to follow and well documented which, for me, is always a big plus. As for Machine Learning algorithms, there are plenty, from Naïve Bayes to Conditional Random Field. On the other hand, for dictionary-matching algorithms, they have an ExactDicitonaryChunker, which is an implementation of the Aho-Corasich algorithm (a very, very, fast algorithm for this task).

In sum, I think it is one of the best NLP software package for Java (I haven't used every single package that is out there, so I can't say it's the best), and I definitely recommend it for the task that you have at hand.

1 @JG Thanks for your advice :). I'm doing my system for research. I've to pay something even if I make a commercial tool?

What are the limitations? – David Campos Feb 20 '10 at 20:52.

You may already know about GATE: gate.ac.uk/ ...but that's what we've used (at my day job) for lots of different text mining problems. It's pretty flexible and open.

PSpeed Yes I already know it. GATE is very similar to UIMA. Actually, GATE was the first one to emerge.

However, I don't know if it is possible to perform ML methods with GATE. Do you know something about that? – David Campos Feb 21 '10 at 21:46 I think GATE is more flexible too... we found UIMA to be very confining.

I don't have specific experience with ML but it just seemed like if someone was working on it then GATE would be a likely platform. It's where I might start if I were writing something like that... but I haven't searched for any specific projects. – PSpeed Feb 21 '10 at 22:53 Looks like there has been at least some work in ML and GATE: gate.ac.Uk/gate/doc/plugins.

Html#Machine_Learning – PSpeed Feb 21 '10 at 22:55.

I built a maximum entropy named entity recognizer for CoNLL data using OpenNLP MaxEnt sourceforge.net/projects/maxent/ for a course once. Required a lot of data preprocessing with custom perl scripts do get all the features extracted into nice neat numerical vectors though.

We use lucene to process live streams from the internet. It has a native java api. lucene.apache.org/java/docs/ You can then use mahout which is a bunch of machien learning algorithms which operate on top of lucene.

lucene.apache.org/mahout.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions