Java - Best way to download a webpage's source html?

Personally, I'm very pleased with the Apache HTTP library hc.apache.org/httpcomponents-client-ga If you're writing a web crawler, which I am also, you may greatly appreciate the control it gives over things like cookies and client sharing and the like.

Personally, I'm very pleased with the Apache HTTP library hc.apache.org/httpcomponents-client-ga/. If you're writing a web crawler, which I am also, you may greatly appreciate the control it gives over things like cookies and client sharing and the like.

It's not a rich crawler but I want it to do what it does as fast and perfect as possible. For the moment I just want it to download the page's source and I don't need that much of control but if it doesn't introduce so much complexity to my code, I'd be more than happy to use it. What do you think?

– Alireza Noori May 2 at 21:32 Now that I found out what the problem was, do you think I should go with this library or I should use my old code? – Alireza Noori May 3 at 6:02 Both solutions work but there were some problems with the first one (for example imdb. Com would return a 403 error).

So I chose this one as answer. It's very easy to use. – Alireza Noori May 13 at 10:51.

I use commons-io String html = IOUtils. ToString(url.openStream(), "utf-8").

Again the same problem. The returned string ends with ... and it's length is: 439802 – Alireza Noori May 2 at 19:28 @Alireza Noori give me the target url – Bozho May 2 at 19:29 Utilities. DownloadPage("stackoverflow.com").length() has returned : 191760 and it's last characters are: ... – Alireza Noori May 2 at 20:58 OK.

Problem solved! Eclipse was showing just half of my String hence this problem! I wrote this output to a file and everything was fine!

I couldn't find any way that I could check the full String at the debug time! Awesome! Seriously!

The more I spend time with Java, the more I love . Net! – Alireza Noori May 2 at 6:01.

Personally, I'm very pleased with the Apache HTTP library http://hc.apache.org/httpcomponents-client-ga . If you're writing a web crawler, which I am also, you may greatly appreciate the control it gives over things like cookies and client sharing and the like.

I'm currently using little piece of code below but some times the result is just half of the page's source! I don't know what's the problem. Some people suggested that I should use Jsoup but using .get.html() function from Jsoup also returns half of the page's source if it's too long.

Since I'm writing a crawler, it's very important that the method support unicode (UTF-8) and the efficiency is also very important. I wanted to know the best modern way to do it so I asked you guys since I'm new to Java.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions