I think this is more of a clustering than a classification problem. Your entities are data points and the sameas data is a mapping of entities to clusters. In this case, clusters are the distinct 'things' your entities refer to.
I think this is more of a clustering than a classification problem. Your entities are data points and the sameas data is a mapping of entities to clusters. In this case, clusters are the distinct 'things' your entities refer to.
You might want to take a look at semi-supervised clustering. A brief google search turned up the paper Active Semi-Supervision for Pairwise Constrained Clustering which gives pseudocode for an algorithm that is incremental/active and uses supervision in the sense that it takes training data indicating which entities are or are not in the same cluster. You could derive this easily from your sameas data, assuming that - for example - uids 87w39423 and 4535k3l535 are definitely distinct things.
However, to get this to work you need to come up with a distance metric based on the features in the data. You have a lot of options here, for example you could use a simple Hamming distance on the features, but the choice of metric function here is a little bit arbitrary. I'm not aware of any good ways of choosing the metric, but perhaps you have already looked into this when you were considering nearest neighbour algorithms.
You can come up with confidence scores using the distance metric from the centres of the clusters. If you want an actual probability of membership then you would want to use a probabilistic clustering model, like a Gaussian mixture model. There's quite a lot of software to do Gaussian mixture modelling, I don't know of any that is semi-supervised or incremental.
There may be other suitable approaches if the question you wanted to answer was something like "given an entity, which other entities are likely to refer to the same thing? ", but I don't think that is what you are after.
This is an interesting idea. However, the reason why I still think this is a classification, and not a clustering problem, is that clustering generally attempts to fit a vector into a single cluster without any confidence score. I suppose you could use the distance metric to measure how far away a vector is from each cluster, but I'm not sure how that would translate into a probability.
– Cerin Nov 9 at 21:01 @Cerin So, is the problem that you want some kind of confidence score of membership, that you want a actual probability of membership or you want entities to have membership of multiple clusters? – StompChicken Nov 10 at 6:47 @Cerin as in, a given entity can genuinely belong to to multiple 'things' (in terms of the ground truth) – StompChicken Nov 10 at 9:56 Good question. The more I think about it, the more it seems that supervised clustering == classification.
I suppose I'm looking for an entity to belong to multiple groups/things. For example, "my apple" would belong to one "thing" representing a specific apple belonging to me, but it might also belong to another "thing" representing "fruit", and other groups representing hypernyms/holonyms. – Cerin Nov 10 at 13:54 I suppose supervised clustering is quite like a generative classification algorithm (e.g. Naive bayes), less so for something discriminative like logistic regression.
However the above algorithm is semi-supervised, so it does train its model on unlabelled examples. – StompChicken Nov 10 at 14:14.
You may want to take a look at this method: "Large Scale Online Learning of Image Similarity Through Ranking" Gal Chechik, Varun Sharma, Uri Shalit and Samy Bengio, Journal of Machine Learning Research (2010). PDF Project homepage More thoughts: What do you mean by 'entity'? Is entity the thing that is referred by 'obj_called'?
Do you use the content of 'obj_called' to match different entities, e.g. 'John' is similar to 'John Doe'? Do you use proximity between sentences to indicate similar entities? What is the greater goal (task) of the mapping?
In my case, "entity" is no single piece of text. It's more of a "virtual" object, similar to the concept of a group in a clustering algorithm. The greater goal is to better build a model from natural language by linking together attributes defined in separate sentences that refer to the same entity.I.e.
I just want to better understand the subject of "entity recognition". – Cerin Nov 10 at 16:14.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.