This is actually a really complex question. The first decision you have to make is whether to lemmatize your input tokens (your words). If you do this, you dramatically decrease your type count, and your syntax parsing gets a lot less complicated.
However, it takes a lot of work to lemmatize a token. Now, in a computer language, this task gets greatly reduced, as most languages separate keywords or variable names with a well defined set of symbols, like whitespace or a period or whatnot.
This is actually a really complex question. The first decision you have to make is whether to lemmatize your input tokens (your words). If you do this, you dramatically decrease your type count, and your syntax parsing gets a lot less complicated.
However, it takes a lot of work to lemmatize a token. Now, in a computer language, this task gets greatly reduced, as most languages separate keywords or variable names with a well defined set of symbols, like whitespace or a period or whatnot. The second crucial decision is what you're going to do with the data post-facto.
The "bag-of-words" method, in the binary form you've presented, ignores word order, which is completely fine if you're doing summarization of a text or maybe a Google-style search where you don't care where the words appear, as long as they appear. If, on the other hand, you're building something like a compiler or parser, order is very much important. You can use the token-vector approach (as in your second paragraph), or you can extend the bag-of-words approach such that each non-zero entry in the bag-of-words vector contains the linear index position of the token in the phrase.
Finally, if you're going to be building parse trees, there are obvious reasons why you'd want to go with the token-vector approach, as it's a big hassle to maintain sub-phrase ids for every word in the bag-of-words vector, but very easy to make "sub-vectors" in a token-vector. In fact, Eric Brill used a token-id sequence for his part-of-speech tagger, which is really neat. Do you mind if I ask what specific task you're working on?
Thank you for a good start of an answer! : ) I will certainly check out the details of Brills token-id sequence. About using the BOW-representation with an integer to represent the tokens linear index, do you really think this would work (give good performance) with an SVM classifier?
– Sebastian Ganslandt Feb 26 '09 at 13:24 The specific task is an implementation of Nivres linear time, transition-based parsing algorithm together with the maximum entropy classifier of liblinear. – Sebastian Ganslandt Feb 26 '09 at 13:30 @sganslandt: for SVM classifiers, you might think about using n-grams (bigrams, trigrams etc) instead of tokens - this preserves local contextual order, but ignores global order. You can then use a regular old bag-of-words and still maintain some context information.
– Mike Feb 26 '09 at 16:24.
Binarization is the act of transforming colorful features of an entity into vectors of numbers, most often binary vectors, to make good examples for classifier algorithms. I have mostly come across numeric features that take values between 0 and 1 (not binary as you describe), representing the relevance of the particular feature in the vector (between 0% and 100%, where 1 represents 100%). A common example for this are tf-idf vectors: in the vector representing a document (or sentence), you have a value for each term in the entire vocabulary that indicates the relevance of that term for the represented document.As Mike already said in his reply, this is a complex problem in a wide field.
In addition to his pointers, you might find it useful to look into some information retrieval techniques like the vector space model, vector space classification and latent semantic indexing as starting points. Also, the field of word sense disambiguation deals a lot with feature representation issues in NLP.
Not a direct answer It all depends on what you are try to parse and then process, but for general short human phrase processing (e.g. IVT) another method is to use neural networks to learn the patterns. This can be very acurate for smallish vocubularies.
Binarization is in image segmentation. Converting a colourful image to binary image which is just black and white. Changing the image to 0 and 1 for the pixels which consider the foreground and background colours.
Producing a black and white image. But I'm still confuse of Otsu's method. I'm doing a small research on it but it seems confusing to me.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.