Like this: Use a list of stopwords, get all words or phrases not in the stopwords, count occurances of each, sort in descending order. The stopwords needs to be a list of all common English terms. It should also include punctuation, and you will need to preg_replace all the punctuation to be a separate word first, e.g."Something, like this.
" -> "Something , like this . " OR, you can just remove all punctuation. $content=preg_replace('/^a-z\s/', '', $content); // remove punctuation $stopwords='the|and|is|your|me|for|where|etc...'; $stopwords=explode('|',$stopwords); $stopwords=array_flip($stopwords); $result=array(); $temp=array(); foreach ($content as $s) if (isset($stopwords$s) OR strlen($s)0) { $result=implode(' ',$temp); $temp=array(); } } else $temp=$s; if (sizeof($temp)>0) $result=implode(' ',$temp); $phrases=array_count_values($result); arsort($phrases); Now you have an associative array in order of the frequency of terms that occur in your input data.
How you want to do the matches depends upon you, and it depends largely on the length of the strings in the input data. I would see if any of the top 3 array keys match any of the top 3 from any other in the data. These are then your groups.
Let me know if you have any trouble with this.
I forgot to mention to strtolower() first, though it should be obvious. – Alasdair Nov 15 at 4:07.
... cluster them into meaningful groups" is a bit to vague, you'll need to be more specific. For starters you could look into K-Means clustering. Have a look at this page and website: PHP/irInformation Retrieval and other interesting topics EDIT: You could try some data mining yourself by cross referencing search results with something like the open directory dmoz RDF data dump and then enumerate the matching categories.
EDIT2: And here is a dmoz/category question that also mentions "Faceted Search"! Dmoz/Monster algorithme to calculate count of each category and sub category?
Thanks, I had found that one already … While an interesting read and good example code, it's far from being a library. As for "meaningful groups", this Yippy search (mind what they call "clouds") illustrates what I'm trying to implement pretty well. – vzwick Nov 2 at 11:43 @vzwick: You mean... faceting?
– netcoder Nov 8 at 1:50 @vzwick Ah, the example site explains all. The simple answer is no - you won't find a library to automatigically do that for you. – zaf Nov 8 at 8:23 Added an edit for an idea to diy.
– zaf Nov 8 at 8:30.
If you're doing this for English only, you could use WordNet: wordnet.princeton.edu/. It's a lexicon widely used in research which provides, among other things, sets of synonyms for English words. The shortest distance between two words could then serve as a similarity metric to do clustering yourself as zaf proposed.
Apparently there is a PHP interface to WordNet here: foxsurfer.com/wordnet/. It came up in this question: How to use word Net with php, but I have not tried it. However, interfacing with a command line tool from PHP yourself is feasible as well.
You could also have a look at Programming Collective Intelligence (Chapter 3 : Discovering Groups) by Toby Segaran which goes through just this use case using Python. However, you should be able to implement things in PHP once you understand how it works. Even though it is not PHP, the Carrot2 project offers several clustering engines and can be integrated with Solr.
This may be way off but check out OpenCalais. They have a web service which allows you to pass a block of text in and it will pass you back a parseable response of things that it found in the text, such as places, people, facts etc. You could use these categories to build your "clouds" and too choose which results to display. I've used this library a few times in php and it's always been quite easy to work with.
Again, might not be relevant to what your trying to do. Maybe you could post an example of what your trying to accomplish?
If you can pre-define the filters for your faceted search (the named groups) then it will be much easier. Rather than relying on an algorithm that uses the current searcher's input and their particular results to generate the filter list, you would use an aggregate of the most commonly performed searches by all users and then tag results with them if they match. You would end up with a table (or something) of URLs in a many-to-many join to a table of tags, so each result url could have several appropriate tags.
When the user searches, you simply match their search against the full index. But for the filters, you take the top results from among the current resultset. I'll work on query examples if you want.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.