Difficulty in extracting main content from a news web page?

Well you have two problems, one is getting the page contents (syntactic I guess), for which I would use the following idiom: (not that theres something terribly wrong with the code you posted, just a bit too verbose for my taste).

Up vote 1 down vote favorite share g+ share fb share tw.

I need to extract main contents (excluding links,advertisements,etc) from a news web page. I have read about it on web and came to know that to do that I need to parse html page and then select contents from html tags. I have written a code which takes a html file as input and extracts the text from the web page using Htmleditorkit available in java.swing.

* . Import java.io" rel="nofollow">java.io" rel="nofollow">java.io" rel="nofollow">java.io. IOException; import java.io" rel="nofollow">java.io" rel="nofollow">java.io" rel="nofollow">java.io.

FileReader; import java.io" rel="nofollow">java.io" rel="nofollow">java.io" rel="nofollow">java.io. Reader; import java.util. List; import java.util.

ArrayList; import javax.swing.text.html.parser. ParserDelegator; import javax.swing.text.html.HTMLEditorKit. ParserCallback; import javax.swing.text.html.HTML.

Tag; import javax.swing.text. MutableAttributeSet; public class HTMLUtils { private HTMLUtils() {} public static List extractText(Reader reader) throws IOException { final ArrayList list = new ArrayList(); ParserDelegator parserDelegator = new ParserDelegator(); ParserCallback parserCallback = new ParserCallback() { @Override public void handleText(final char data, final int pos) { list. Add(new String(data)); } @Override public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) { } @Override public void handleEndTag(Tag t, final int pos) { } @Override public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { } @Override public void handleComment(final char data, final int pos) { } @Override public void handleError(final java.lang.

String errMsg, final int pos) { } }; parserDelegator. Parse(reader, parserCallback, true); return list; } public static void main(String args) throws Exception{ FileReader reader = new FileReader("C://Users//Mukul//Desktop//demo. Html"); List lines = HTMLUtils.

ExtractText(reader); for (String line : lines) { System.out. Println(line); } } } But my problem is I'm not abled to figure out how can I select only main content from a web page like an article from a news web page. Also,I want to know the way I'm doing parsing is fine or should I use some open source libraries like Jsoup,Jtidy,etc. for same thing.

Please help me and correct me where I'm doing wrong. Java html-parsing text-extraction web-mining link|improve this question asked Feb 17 at 16:48dark_shadow4787 62% accept rate.

Well you have two problems, one is getting the page contents (syntactic I guess), for which I would use the following idiom: (not that theres something terribly wrong with the code you posted, just a bit too verbose for my taste) String text = new Scanner( new URL("C://Users//Mukul//Desktop//demo. Html").openConnection().getInputStream()). UseDelimiter("\\A").next(); and the other is interpreting the String you just read (semantic).

I don't think theres a single right answer but if its one single page you want to parse everytime, it should have some fixed layout. You will have to find some pattern to distinguish main content from advertisements, headers, links, etc. and then maybe you can extract it using regexes. Check this: http://code.google.com/p/boilerpipe.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions