Extracting data from a web page?

If all you care is textual similarity, you could just write a regex to strip out all the HTML tags of the form?(every|single|valid|tag)^> (perhaps first removing all script>. *Also you don't need access to the hierarchical structure, just the text. Otherwise a parser would be better than a regex (which would otherwise be a terrible idea).

If all you care is textual similarity, you could just write a regex to strip out all the HTML tags of the form *> (perhaps first removing all tags), then mash all the content up in a very long paragraph. That wouldn't be a bad use of a regex at all; that's what they're there for. I might recommend docs.python.org/library/xml.dom.minidom.... , but imho the interface can be very awkward.

Also you don't need access to the hierarchical structure, just the text. Otherwise a parser would be better than a regex (which would otherwise be a terrible idea).

I'll be doing the process for thousands of docs. And My doubt is that If I parse the data using regex, JavaScript functions might appear. One more thing is I'll be missing dynamic content or javascript rendered data.

Thanks for answering :) – Aditya Apr 19 at 2:54 I believe the example algorithm I gave you will probably not cause javascript functions to appear as long as you aren't parsing the entire world-wide web. Also you will be missing javascript-rendered content nomatter what program you use, unless you are doing it via the web browser. – ninjagecko Apr 19 at 3:07.

I would highly recommend this question's first answer in an effort to keep you away from parsing HTML with regular expressions. That answer does a far better job of illustrating why you shouldn't than I could, so I defer to that. You will also find that you should look into XML parsers instead of trying to "parse by hand" via a regex (which you'll read in the referenced question and its answers).

I'll be doing the process for thousands of docs. And My doubt is that If I parse the data using regex, JavaScript functions might appear. One more thing is I'll be missing dynamic content or javascript rendered data.

Thanks for answering :) – Aditya Apr 19 at 2:53.

I am doing a school project which needs extracting data from web pages. To be precise I need a library or opensource program to extract human readable content from html/text data. Something like web browser rendered text content.

I know parsing html with regexs is worst method to extract text from it. I need it for computing similarity between text documents.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Extracting data from a web page?

Related Questions

Extracting data (or reshaping) a data frame from an existing data frame in R?

Difficulty in extracting main content from a news web page?

Extracting data from Web?

Extracting Data from a dataset returned from a web service?

Is there a HTML code or other method you can use to hide a Web page on your Web Site to use as a confirmation page when people opt-in or unsubscribe to a mailing list on that site?

Can I recieve both HTML markup for page and code in the ASP.NET web page's source code portion in the Web browser?