Once you have this HTML fragment, just use a regex to replace br followed by an optional newline by a single newline, then split on multiple newlines. This should result in multiple individual paragraphs which you can process manually.
Once you have this HTML fragment, just use a regex to replace followed by an optional newline by a single newline, then split on multiple newlines. This should result in multiple individual paragraphs which you can process manually.
Thanks for the answer, but unfortunately it's not as simple as just using a regex. I've simplified the above document to better illustrate my question. The actual document has a jumble of HTML formatting tags and the like.
– jamieb Feb 21 '10 at 7:46 1 But you don't care about the document, just the part separated by tags. Use BeatifulSoup to extract that part first. – Ignacio Vazquez-Abrams Feb 21 '10 at 7:50 I'm not sure why someone downvoted your answer; I appreciate the help.
I will try a couple of ideas based on your suggestion. I was just hoping that BeautifulSoup would have eliminated the need for manual parsing. Thank you.
– jamieb Feb 21 '10 at 7:58 1 BeautifulSoup is good for the tags that deal with structure and style, but doesn't fall into either of those. – Ignacio Vazquez-Abrams Feb 21 '10 at 8:01 While I probably would have preferred to work with Michal's answer, I didn't see it until after I completed my project. I was able to do what I needed using your suggestion.
Thank you. – jamieb Feb 21 '107 at 16:07.
You can do a little bit of manipulation first before anything. Eg change all newlines to blanks, then substitute 2 occurrences and more of to some other delimiter like |. After that you can get your fields.
Html=""" Company A 123 Main St. Suite 101 Someplace, NY 1234 Company B 456 Main St. Someplace, NY 1234 """ import re newhtml=html. Replace("\n","") pat=re. Compile("(){2,}",re.
DOTALL|re. M) print pat. Sub("|",newhtml) output $ .
/python. Py Company A123 Main St. Suite 101Someplace, NY 1234|Company B456 Main St. Someplace, NY 1234| Now your company information are separated by pipes.
Perhaps you could use this function: def partition_by(pred, iterable): current = None current_flag = None chunk = for item in iterable: if current is None: current = item current_flag = pred(current) chunk = current elif pred(item) == current_flag: chunk. Append(item) else: yield chunk current = item current_flag = not current_flag chunk = current if len(chunk) > 0: yield chunk Add something to check for being a tag or newline: def is_br(bs): try: return bs. Name == u'br' except AttributeError: return False def is_br_or_nl(bs): return is_br(bs) or u'\n' == bs (Or whatever else is more appropriate... I'm not that good with BeautifulSoup.) Then use partition_by(is_br_or_nl, cs) to yield (for cs set to BeautifulSoup.
BeautifulSoup(your_example_html).childGenerator()) u'Company A', , u'\n123 Main St. ', , u'\nSuite 101', , u'\nSomeplace, NY 1234', , u'\n', , u'\n', , u'\n', , u'\nCompany B', , u'\n456 Main St.', , u'\nSomeplace, NY 1234', , u'\n', , u'\n', , u'\n', That should be easy enough to process. To generalise this, you'd probably have to write a predicate to check whether its argument is something you care about... Then you could use it with partition_by to have everything else lumped together. Note that the things which you care about are lumped together as well -- you basically have to process every item of every second list produced by the resulting generator, starting with the first one which includes things you care about.
Eg change all newlines to blanks, then substitute 2 occurrences and more of to some other delimiter like |. After that you can get your fields. Now your company information are separated by pipes.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.