Dom and xpath query for html parsing?

The simple answer is: foreach ($tags as $tag) echo $dom->saveXML($tag); If you want html unstripped a tags, the xpath would be //a@class="articleDesc" That's assuming the a tags have that class attribute.

Thank you very much! – Tadej Magajna Nov 26 at 21:49.

Try using php.net/manual/en/simplexmlelement.asxml... Or, alternative: function getNodeInnerHTML(DOMNode $oNode) { $oDom = new DOMDocument(); foreach($oNode->childNode as $oChild) { $oDom->appendChild($oDom->importNode($oChild, true)); } return $oDom->saveHTML(); }.

Meh.. that would work in a way, but the perfect way for me would be to get from 'examplesite. Com'; => '//div/a@class="articleDesc"/@href' a list of html unstripped strings for the elements matching... I wonder how I'd do that – Tadej Magajna Nov 21 at 11:14 I might get you wrong here, but doesn't that just require you to get the innerHTML, using one of the functions above, of the parent element matching your XPath? – Sjaak Trekhaak Nov 21 at 13:03 I think not.... inner html of the parent element matching xpath would return all the html inside it.

However, I'd like to get all the div tags that have class article desc for instance... – Tadej Magajna Nov 22 at 16:56 So echo getNodeInnerHTML($tag) is not what you were looking for? If so, I'm having trouble understanding exactly what you want. Is it possible to show an example of your input, and the desired output?

– Sjaak Trekhaak Nov 23 at 11:26.

This should load all of the inner tags as well. While its not DOM they are interchangeable. And later you can dom_import_simplexml tobring it back into DOM.

$xml=simplexml_load_string($html); $tags=$xml->xpath('//body/div@class="articleDesc"').

Giver an error. Expath doesn't work with $xml. If I try to $xml = dom_import_simplexml($xml) prior to second line it doesn't work either – Tadej Magajna Nov 25 at 20:06 Exact error would be helpful.

The first line imports the $html string into simplexml, if its not a string try simplexml_load_file instead. The second line is copied directly from yours but converted for simplexml. Admittedly I have not run it myself, but this is the same code I use at work, and it works for me there.

Dom_import_simplexml($tags) should only be used after the simplexml has been loaded and assuming you have something you want to do with it in DOM, otherwise it is not necessary, just included in case you wanted to switch back to DOM after loading the results. – showerhead Nov 25 at 22:45 simplexml_load_string($html) returns false and after I put that into xpath() it breaks of course... it also giver a lot of warnings like: Warning: simplexml_load_string() function. Simplexml-load-string: Entity: line 36: parser error : Opening and ending tag mismatch: META line 8 and HEAD in /usr/share/nginx/html/synd/robots/robot.

Php on line 25 I know the html may not be perfect which may be the cause of simplexml returning false, but it is a proper html webpage wtich gets rendered in browser – Tadej Magajna Nov 26 at 0:16 From the sounds of it your html isn't well formed. Which, while not necessary for it to show up in the browser properly, it is if you wish to use any kind of parser on it. Try closing your meta and head tags and try again.

Meta tags are self-closing so just add a forward slash to the end of them, that's easy enough to forget. Once your html is well formed it should work. – showerhead Nov 26 at 2:00 Try using php.Net/tidy to clean up your markup first.

– elazar Nov 26 at 3:53.

You could use this awesome spider framework (in Python) Scrapy.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions