Irregular String Parsing on Python?

Use the regular expression module re For example, if you have the third delimited field of your sample record in a variable s then you can do match = re. Match(r"^(?P^#*) #(?P0-9+) \((?P. *)\)$", s) title = match.

Groups('title') issue = match. Groups('num') special = match. Groups('special') You'll get an IndexError in the last three lines for a missing field.

Adapt the RE until it parses everything your want.

Use the regular expression module re. For example, if you have the third |-delimited field of your sample record in a variable s, then you can do match = re. Match(r"^(?P^#*) #(?P0-9+) \((?P.

*)\)$", s) title = match. Groups('title') issue = match. Groups('num') special = match.

Groups('special') You'll get an IndexError in the last three lines for a missing field. Adapt the RE until it parses everything your want.

Parsing the title is the hard part, it sounds like you can handle the dates etc yourself. The problem is that there is not one rule that can parse every title but there are many rules and you can only guess which one works on a particular title. I usually handle this by creating a list of rules, from most specific to general and try them out one by one until one matches.To write such rules you can use the re module or even pyparsing.

The general idea goes like this: class CantParse(Exception): pass # one rule to parse one kind of title import re def title_with_special( title ): """ accepts only a title of the form # () """ m = re. Match(r"^#*#(\d+) \((^)+)\)", title) if m: return m. Group(1), m.

Group(2) else: raise CantParse(title) def parse_extra(title, rules): """ tries to parse extra information from a title using the rules """ for rule in rules: try: return rule(title) except CantParse: pass # nothing matched raise CantParse(title) # lets try this out rules = title_with_special # list of rules to apply, add more functions here titles = "Stan Lee's Traveler #12 (10 Copy Incentive Cover)", "Betrayal Of The Planet Of The Apes #1 (Of 4)(25 Copy Incentive Cover) )" for title in titles: try: issue, special = parse_extra(title, rules) print "Parsed", title, "to issue=%s special='%s'" % (issue, special) except CantParse: print "No matching rule for", title As you can see the first title is parsed correctly, but not the 2nd. You'll have to write a bunch of rules that account for every possible title format in your data.

Thanks! This gets me to identify the issue numbers/special, but how do I then go on to record them in the loop? Book.

Issue_num =? – Alxjrvs Oct 8 at 18:00 @Alxjrvs: there is still much to do to parse all your title formats. The rule function simply returns (issue_num, special) pairs so you use that.

– Jochen Ritzel Oct 8 at 22:40 Thanks. I am a little lost - how would I go about calling that in the script? I am not sure what is holding the values of the issue number and the special.

– Alxjrvs Oct 8 at 23:26 Nevermind my last comment - a transcription error on my part made me go briefly stupid. Works great! – Alxjrvs Oct 8 at 23:34.

Regular expression is the way to go. But if you fill uncomfortably writing them, you can try a small parser that I wrote (https://github. Com/hgrecco/stringparser).

It translates a string format (PEP 3101) to a regular expression. In your case, you will do the following: >>> from stringparser import Parser >>> p = Parser(r"{date:s}\|{date2:s}\|{title:s}#{issue:d} \({special:s}\)") >>> x = p("10/12/11|10/12/11|Stan Lee's Traveler #12 (10 Copy Incentive Cover)") OrderedDict(('date', '10/12/11'), ('date2', '10/12/11'), ('title', "Stan Lee's Traveler "), ('issue', 12), ('special', '10 Copy Incentive Cover')) >>> x. Issue 12 The output in this case is an (ordered) dictionary.

This will work for any simple cases and you might tweak it to catch multiple issues or multiple () One more thing: notice that in the current version you need to manually escape regex characters (i.e. If you want to find |, you need to type \|). I am planning to change this soon.

I am not sure how to do this kind of rigorous parsing. What would be a good road to start onto crack this problem? My knowledge of If/else statements quickly fell apart for the more complicated lines.

How can I efficiently and (if possible) pythonic-ly parse through these lines and subdivide them so I can later slot them into the correct place in my database?

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions