Is POS tagging deterministic?

My best effort to understand uncovered this from someone not using the whole Brown corpus.

My best effort to understand uncovered this from someone not using the whole Brown corpus: Note that words that the tagger has not seen before, such as decried, receive a tag of None. So, I guess something that looks like ae1.111 must appear in the corpus file, but nothing like ae0.842. That's kind of weird, but that's the reasoning for giving the -NONE- tag.

Edit: I got super-curious, downloaded the Brown corpus myself, and plain-text-searched inside it. The number 111 appears in it 34 times, and the number 842 only appears 4 times. 842 only appears either in the middle of dollar amounts or as the last 3 digits of a year, and 111 appears many times on its own as a page number.775 also appears once as a page number.So, I'm going to make a conjecture, that because of Benford's Law, you will end up matching numbers that start with 1s, 2s, and 3s much more often than numbers that start with 8s or 9s, since these are more often the page numbers of a random page that would be cited in a book.

I'd be really interested in finding out if that's true (but not interested enough to do it myself, of course! ).

1 for the Benford's law. And by the way you are right. They are indeed page numbers.In my experiment, any number > 790 is being given the None tag and for other it is giving the NN tag.

:) Awesome catch! – Legend Jun 30 at 22:33 Haha. That makes my day; thanks for the followup comment!

– Chris Cunningham Jul 1 at 17:48.

It is "deterministic" in the sense that the same sentence is going to be tagged the same way using the same algorithm every time, but since your words aren't in nltk's data (in fact, aren't even real words in real sentences) it's going to use some algorithm to try to infer what the tags would be. That is going to mean that you can have different taggings when the words change (even if the change is a different number like you have) and that the taggings aren't going to make much sense anyway. Which makes me wonder why you're trying to use NLP for non-natural language constructs.

I love this response. It /is/ very questionable as to why NLTK is being used here. However, if it really is required, then a possible hack would be to define a custom corpus with all possible "words" and a custom chunker.

But this seems like way too much work for something that should (and could) probably be done a lot simpler – inspectorG4dget Jun 30 at 22:02 +1 for a nice explanation. I am trying out a few exploratory techniques to figure out the best way to transform my data into some intermediate format that will aid in template extraction. This seems to work pretty well for my case (maybe not for every case).

– Legend Jun 30 at 22:19 Strictly speaking, NLTK's pos_tag is not required to be deterministic at all. Some machine learning algorithms flip coins internally. – larsmans Jul 3 at 15:55.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions