I'm not sure how the big search engines do it, but one technique I've used is min hasing with n-grams of the content. We did this for a crawler where we were finding many broken sites that link to the same page with an infinite number of unique urls. We needed a quick way to detect similar pages out of a very large so that we could then apply more expensive duplicate content checks.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.