Building on Darin's solution I would look for the generator meta tag for known blog editors and combine it with a lookup table of common sites, ie WordPress. Com Blogspot. Com Livejournal.Com and so forth.
That should give you 80-95% in the near term, though it won't be robust enough for an ongoing process over an extended period of time An extended solution is much harder, given the amorphous definition of the term "blog". In which case, you'll want to consider breaking the list down into its hosting site and defining characteristics and create hard and fast rules on what constitutes a blog: Is it hosted by a blogging service provider? Is it listed in a blog aggregator, such as Technorati?
Does it include blog-like services, such as user-generated articles, tags, and the ability to comment? Does it provide meta information that I can use to easily identify it as a blog? Does it otherwise identify itself as a blog, via the inclusion of the term "blog" or some other criteria?
I can easily see a neural network constructed to determine if a page is a blog or not, but this serverely oversteps the bounds of your requirements. I'd say start simple, then extend your solution relative to the proposed lifetime of your system.
Building on Darin's solution, I would look for the generator tag for known blog editors and combine it with a lookup table of common sites, ie. WordPress. Com, Blogspot.Com, Livejournal.
Com, and so forth. That should give you 80-95% in the near term, though it won't be robust enough for an ongoing process over an extended period of time.An extended solution is much harder, given the amorphous definition of the term "blog". In which case, you'll want to consider breaking the list down into its hosting site and defining characteristics and create hard and fast rules on what constitutes a blog: Is it hosted by a blogging service provider?
Is it listed in a blog aggregator, such as Technorati? Does it include blog-like services, such as user-generated articles, tags, and the ability to comment? Does it provide meta information that I can use to easily identify it as a blog?
Does it otherwise identify itself as a blog, via the inclusion of the term "blog" or some other criteria? I can easily see a neural network constructed to determine if a page is a blog or not, but this serverely oversteps the bounds of your requirements. I'd say start simple, then extend your solution relative to the proposed lifetime of your system.
I would look at the generator tag for known blog editors. For example here's how it looks for Wordpress.
3 +1. Simple, effective strategy that can be trivially extended for known blog hosts via DNS. – MrGomez Dec 4 '10 at 20:17 +1 And... with that strategy you can achieve 80% accuracy simply by collecting the big blog sites.
I'm not completely sure but I would assume that a dozen blog sites are responsible for 90% of blog content online. At least in the anglophone side of the interwebs. – Paul Sasik Dec 4 '10 at 20:21 Content is harder to determine than traffic, I'd imagine.
A quick scan of Alexa reveals Facebook, Blogger and Twitter currently in the top 11 for traffic, which implies a lookup table will be pretty powerful in this case. (Thanks again for the help yesterday, by the way. ) – MrGomez Dec 4 '10 at 20:32.
Search on the word "blog". I bet that gives you an 80% (or much better) result. Here's the problem -- whatever approach you take, you will need to SLED different approaches one against another, to test them.
I can almost guarantee you that the heuristic that gives the best mix of being staggeringly simple and achieving an amazingly good hit rate is "searching on 'blog'". So whatever else you do, I would sled that one first, manually check a few hundred and see what your percentage is. If it's 90%+, take the cheque and go home to your wife and children for dinner.
If it's moderate, you now have a baseline. And you need that to begin work. You can think up heuristics all day - it's trivial - (check the provider, look for 'wordpress', check it over time for new entries, look for dates that appear by blocks of text, etc etc) But that's really just the easy part.
You have to sled them one against the other and see what happens. IT IS EXTREMELY IMPORTANT TO NOTE: these sort of things wildly depend on your feed of URLs, in the example at hand. There's no truly random feed of input in a situation like this.
YOU WILL FIND that one heuristic or another works better because of the nature of your input data. You know? You might find that searching on the word "tags" (to make a wild example) just absolutely cracks it and you're done.
You're not trying to find a really universal general smartass solution, what you want is a specific smartass solution that really seems to work with your data incoming. That's always how it is!
YOU WILL FIND that one heuristic or another works better because of the nature of your input data. You might find that searching on the word "tags" (to make a wild example) just absolutely cracks it and you're done. You're not trying to find a really universal general smartass solution, what you want is a specific smartass solution that really seems to work with your data incoming.
That's always how it is!
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.