How can I scrape a website with invalid HTML?

DOM handles broken HTML fine if you use loadHTML or loadHTMLFile : $dom = new DOMDocument; libxml_use_internal_errors(TRUE); $dom->loadHTMLFile('courseschedules.njit.edu/index.aspx?seme...); libxml_clear_errors(); $xPath = new DOMXPath($dom); $links = $xPath->query('//div@class="courseList_section"//a'); foreach($links as $link) { printf("%s (%s)\n", $link->nodeValue, $link->getAttribute('href')); } will output ACCT - Accounting (index. Aspx? Semester=2010f&subjectID=ACCT) AD - Art and Design (index.

Aspx? Semester=2010f&subjectID=AD ) ARCH - Architecture (index. Aspx?

Semester=2010f&subjectID=ARCH) ... many more ... TRAN - Transportation Engr (index. Aspx? Semester=2010f&subjectID=TRAN) TUTR - Tutoring (index.

Aspx? Semester=2010f&subjectID=TUTR) URB - Urban Systems (index. Aspx?

Semester=2010f&subjectID=URB ) Using echo $dom->saveXML($link), PHP_EOL in the foreach loop will output the full outerHTML of the links.

DOM handles broken HTML fine if you use loadHTML or loadHTMLFile: $dom = new DOMDocument; libxml_use_internal_errors(TRUE); $dom->loadHTMLFile('courseschedules.njit.edu/index.aspx?seme...); libxml_clear_errors(); $xPath = new DOMXPath($dom); $links = $xPath->query('//div@class="courseList_section"//a'); foreach($links as $link) { printf("%s (%s)\n", $link->nodeValue, $link->getAttribute('href')); } will output ACCT - Accounting (index. Aspx? Semester=2010f&subjectID=ACCT) AD - Art and Design (index.

Aspx? Semester=2010f&subjectID=AD ) ARCH - Architecture (index. Aspx?

Semester=2010f&subjectID=ARCH) ... many more ... TRAN - Transportation Engr (index. Aspx? Semester=2010f&subjectID=TRAN) TUTR - Tutoring (index.

Aspx? Semester=2010f&subjectID=TUTR) URB - Urban Systems (index. Aspx?

Semester=2010f&subjectID=URB ) Using echo $dom->saveXML($link), PHP_EOL; in the foreach loop will output the full outerHTML of the links.

This does a little better than Simple HTML DOM Parser but if you count the results, it only gives 107 of the 123 links. – Telanor Oct 8 '10 at 23:33 @Telanor updated. The XPath now searches for all links inside divs with the class courseList_section instead of for all links inside spans inside divs. I am pretty sure you could have fixed that easily yourself though.

Also possible '//aancestor::div@class="courseList_section"' – Gordon Oct 9 '10 at 8:36 You're right, it does work now. I'm still not sure how I didn't already try this. That's actually the same XPath query I was using locally after running Tidy – Telanor Oct 9 '10 at 18:45.

If you know the errors you might apply some regular expressions to fix them specifically. While this ad-hoc solution might seem dirty, it may actually be better as if the HTML is indeed malformed it might be complex to infer a correct interpretation automatically. EDIT: Actually it might be better to simply extract the needed information through regular expressions as the page has many errors which would be hard or at least tedious to fix.

1. It seems dirty because it's difficult to maintain. – TrueWill Sep 9 at 18:16.

Tidy is the only sane way I know of fixing broken markup.

Consider using a real browser or the webbrowser control. I tested with iMacros and the web scraping works well. Test macro for the first two links: VERSION BUILD=7050962 URL GOTO=courseschedules.njit.edu/index.aspx?seme... 'Get text 'TAG POS=2 TYPE=A FORM=ID:form1 ATTR=TXT:*-* EXTRACT=TXT 'Get link first entry TAG POS=2 TYPE=A FORM=ID:form1 ATTR=TXT:*-* EXTRACT=HREF 'Get link second entry TAG POS=3 TYPE=A FORM=ID:form1 ATTR=TXT:*-* EXTRACT=HREF You can move between the entries by incrementing the POS= value.

Another simple way to solve the problem could be passing the site you are trying to scrape through a mobile browser adapter package such as google's mobilizer for complicated websites. This will correct the invalid html and enable you to use the simple html dom parser package, but it might not work if you need some of the information that is stripped out of the site. The links to this adapter are below.

I use this for sites on which the information is poorly formatted or if I need a way to simplify the formatting so that it is easy to parse. The html returned by the google mobilizer is simpler and much easier to process. google.com/gwt/n.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions