Unable to scrape content from a website?

Yes, you are missing something very basic: It's XHTML, so you must register (and use! ) the proper namespace before you can expect to get results $xpath->registerNamespace('x', 'w3.org/1999/xhtml'); $path1="//x:body/x:table4/x:tbody/x:tr3/x:td4"; $path2="//x:body/x:table4/x:tbody/x:tr1/x:td4"; $item1=$xpath->query($path1); $item2=$xpath->query($path2).

Yes, you are missing something very basic: It's XHTML, so you must register (and use! ) the proper namespace before you can expect to get results. $xpath->registerNamespace('x', 'w3.org/1999/xhtml'); $path1="//x:body/x:table4/x:tbody/x:tr3/x:td4"; $path2="//x:body/x:table4/x:tbody/x:tr1/x:td4"; $item1=$xpath->query($path1); $item2=$xpath->query($path2).

Tomalak: when I modify my code as above it gives me an error as Parse error: syntax error, unexpected T_VARIABLE in C:\xampp\htdocs\rtu\rtu_results. Php on line 24 line 24 here is the line $path1="//x:body/x:table4/x:tbody/x:tr3/x:td4"; I have scraped web pages like this before from my localhost but never needed namespace – lovesh May 29 at 16:49 @Lovesh that syntax error would indicate you're missing a ; on the previous line. – Marc B May 29 at 17:09 @Marc: You are right, I missed a semi-colon.Thanks.

@lovesh: A little more independent thinking, please. ;-) I'm sure that's not the first time you see such an error. – Tomalak May 29 at 17:12 1 @lovesh: Please test with a simple "//x:table" as an XPath expression.

If this gives you all tables in your document, then the namespace is working but your own XPath expression is wrong. If this is not working then the namespace "http://www.w3.org/1999/xhtml" is not the right one and you must check against your XHTML document what namespace it is actually using. – Tomalak May 29 at 17:41 1 The XHTML may not have the correct xmlns.

Check the value of $page->documentElement->namespaceURI and if it is not null you should pass that value into registerNamespace(). – cbuckley May 297 at 11:54.

It seems that the problem is somehow related to XPath and namespaces. Php manual revealed an interesting user comment If you've registered your namespaces, loaded your XHTML, etc. , into your XPath's DOMDocument object and still can't get it to work, check to make sure you haven't used the DOMDocument's loadHTML() or loadHTMLFile() function. For XHTML always use the XML versions, otherwise your XPath will never, ever work.

Your code uses loadHTML() $content=getXHTML($content); //this is a tidy function to convert bad html to xhtml $page->loadHTML($content); // its okay till here when I echo $page->saveHTML the page is displayed HTML is not namespace aware so loadHTML() might not set the namespaces on the elements of the document object even though the original document (or the XHTML outputted by Tidy) had them. Because you use Tidy to convert the document to XHTML, I guess you could safely use loadXML() without running into parsing errors. Note that it will require that the input is well-formed XML.

Also it might not be aware of HTML predefined entities like and if that is the case, it can't replace the entities with their correct character values. If such problem arises, try setting different options for loadXML().

1 Recommended in private e-mail. Should have followed it up here, but thanks for adding the user comment. – cbuckley May 31 at 11:32 thanks for this.

You are right, using loadXML is giving errors Entity 'nbsp' not defined in Entity, line: 212 in filename on line 10 where line 10 is the line with loadXML. I tried using options for loadXML like $page->loadXML($content,LIBXML_NOENT); for substituting entities but the errors remain. Can you tell me which option or combination of options can make this work?

– lovesh May 31 at 14:06 @lovesh: Sorry, I'm not familiar with those options. Other possibility to fix entity problems is to check if Tidy can do the entity replacement. – jasso Jun 1 at 15:01.

I have heard that FireFox adds a tbody element if such isn't present. In addition to or independently of @Tomalak's advice, try the XPath expressions with the /tbody location step removed. Also, use another tool as the XPath Visualizer to construct correct XPath expressions and see immediately what they are selecting.

Dimitre Novatchev: I tried your suggestion but it giving an error as Parse error: syntax error, unexpected T_VARIABLE in C:\xampp\htdocs\rtu\rtu_results. Php on line 27 where line 27 is $path1="//body/table4/tr3/td4"; – lovesh May 29 at 16:56 @Dimitre Novatchev: I tried the xpath with google chrome but I get the same error – lovesh May 29 at 16:58 @lovesh: The XPath expression is syntactically correct -- the syntax error should be in your PHP statement. – Dimitre Novatchev May 29 at 17:00 @Dimitre Novatchev: I think the php is correct too because I have scraped content from other pages but I have done it from localhost that is my own web server in the same way.

I used to save pages to my disk first. Any other suggestion that you have got? Can it happen like you can forward my question to someone who can help?

– lovesh May 29 at 17:06 1 @lovesh: Why should you post the question again? Better edit it and add new, relevant information. For example, provide a sample XML file -- as minimal as possible.

Then many people will be able to help. – Dimitre Novatchev May 29 at 18:00.

This question reminds me that a lot of times the solution to a problem lies in simplicity and not complications. I was trying namespaces,error corrections,etc but the solution just demanded close inspection of the code. The problem with my code was the order of loadHTML() and xpath initialization.

Initially the order was $xpath=new DOMXPath($page); $page->loadHTML($content); by doing this I was actually initializing xapth on an empty document. Now reversing the order by first loading the dom with the html and then initializing the xpath I was able to get the desired results. Also as suggested that by removing the tbodyelement from xpath as firefox automatically inserts it.So the correct xpath should be $path1="//body/table4/tr3/td4"; $path2="//body/table4/tr1/td4"; thanks to everyone for their suggestions and bearing this.

(Try the following both in combination with and separately from the other answers, as they are other possible caveats. ) If your XPath isn't working, try applying just parts of it to make sure you are indeed following the right path. So do something like: $path1="//body"; $item1 = $xpath->query($path1); foreach ($item1 as $t) { // to see the full XML of the returned node, as the nodeValue may be empty echo $t->ownerDocument->saveXML($t); } Then keep increasing your XPath to the location you want.

Also, if you find that nodeValue and textContent of your nodes are empty, you should make sure that you are loading into the DOMDocument with the correct encoding (e.g. If the cURL response returns UTF-8, you'll need to pass 'UTF-8' as the second parameter when constructing your DOMDOcument).

I tried your suggestion but it is not showing any outpt. Now I am absolutely sure what the problem is. $xpath->query($path1); is not getting the xpath.

Can you imagine why? – lovesh May 30 at 16:45 the DOMDocument is being loaded properly as I have checked with $page->saveHTML(). It is displaying the page in the browser – lovesh May 30 at 22:28 How about instead of using XPaths for testing, you check the element returned by $page->getElementsByTagName('body')->item(0)?

You can keep following the path in the same way by chaining those methods. – cbuckley May 30 at 23:28 how do I find the encoding of the cURL response? – lovesh May 31 at 20:39 It will (hopefully) be in the Content-Type response header.

You'll need to do something like curl_setopt($ch, CURLOPT_HEADER, 1); and then split the headers from the body with list($header, $body) = explode("\r\n\r\n", $content, 2);. Have a look at sitepoint. Com/forums/php-34/… for more info.

– cbuckley May 31 at 22:11.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Unable to scrape content from a website?

Related Questions

Python- is there a module that will automatically scrape the content of an article off a webpage?

How can I scrape a website with invalid HTML?

Content from MY OWN website that I have written MYSELF still classed as duplicate content?

How Content Duplication affect a Website Ranking and Popularity in SE? and How we can avoid Content Duplication?

Python script using beautifulSoup to scrape webpage?

How to scrape websites when cURL and allow_url_fopen is disabled?