If you want to get both Chinese phrases when there are two of them (as in adult and aircraft), you'll need to work harder. The code below is for Python 3. X coding: utf8 import re s = """“作為â€(act) ,用於罪行或民事éŽå¤±æ™‚,包括一連串作為ã€ä»»ä½•é•æ³•çš„ä¸ä½œç‚ºå’Œä¸€é€£ä¸²é•æ³•çš„ä¸ä½œç‚ºï¼› “行政上訴委員會â€(Administrative Appeals Board) æŒ‡æ ¹æ“šã€Šè¡Œæ”¿ä¸Šè¨´å§”å“¡æœƒæ¢ä¾‹ã€‹(第442ç« )è¨ç«‹çš„行政上訴委員會;(ç”±1994年第6號第32æ¢å¢žè£œ) “æˆäººâ€ã€â€œæˆå¹´äººâ€(adult)* 指年滿18æ²çš„人; (ç”±1990年第32號第6æ¢ä¿®è¨‚) “飛機â€ã€â€œèˆªç©ºå™¨â€(aircraft) 指任何å¯æ†‘空氣的å作用而在大氣ä¸ç²å¾—支承力的機器; “外ç±äººå£«â€(alien) 指並éžä¸åœ‹å…¬æ°‘的人; (ç”±1998年第26號第4æ¢å¢žè£œ) “修訂â€(amend) 包括廢除ã€å¢žè£œæˆ–更改,亦指åŒæ™‚進行,或以åŒä¸€æ¢ä¾‹æˆ–文書進行上述全部或其ä¸ä»»ä½•äº‹é …ï¼› (ç”±1993年第89號第3æ¢ä¿®è¨‚) “å¯é€®æ•çš„罪行â€(arrestable offence) 指由法律è¦é™å›ºå®šåˆ‘ç½°çš„ç½ªè¡Œï¼Œæˆ–æ ¹æ“šã€æ†‘藉法例å°çŠ¯è€…å¯è™•è¶…éŽ12個月監ç¦çš„罪行,亦指犯任何這類罪行的ä¼åœ–ï¼› (ç”±1971年第30號第2æ¢å¢žè£œ) “《基本法》â€(Basic Law) 指《ä¸è¯äººæ°‘共和國香港特別行政å€åŸºæœ¬æ³•ã€‹ï¼› (ç”±1998年第26號第4æ¢å¢žè£œ) “行政長官â€(Chief Executive) 指─""" for zh1, zh2, en in re.
Findall(r"“(^â€*)â€(?:ã€â€œ(^â€*)â€)? \((.*? )\)",s): print(ascii((zh1, zh2, en))) resulting in: ('\u4f5c\u70ba', '', 'act') ('\u884c\u653f\u4e0a\u8a34\u59d4\u54e1\u6703', '', 'Administrative Appeals Board') ('\u6210\u4eba', '\u6210\u5e74\u4eba', 'adult') ('\u98db\u6a5f', '\u822a\u7a7a\u5668', 'aircraft') ('\u5916\u7c4d\u4eba\u58eb', '', 'alien') ('\u4fee\u8a02', '', 'amend') ('\u53ef\u902e\u6355\u7684\u7f6a\u884c', '', 'arrestable offence') ('\u300a\u57fa\u672c\u6cd5\u300b', '', 'Basic Law') ('\u884c\u653f\u9577\u5b98', '', 'Chief Executive').
If you want to get both Chinese phrases when there are two of them (as in adult and aircraft), you'll need to work harder. The code below is for Python 3.x. #coding: utf8 import re s = """“作為â€(act) ,用於罪行或民事éŽå¤±æ™‚,包括一連串作為ã€ä»»ä½•é•æ³•çš„ä¸ä½œç‚ºå’Œä¸€é€£ä¸²é•æ³•çš„ä¸ä½œç‚ºï¼› “行政上訴委員會â€(Administrative Appeals Board) æŒ‡æ ¹æ“šã€Šè¡Œæ”¿ä¸Šè¨´å§”å“¡æœƒæ¢ä¾‹ã€‹(第442ç« )è¨ç«‹çš„行政上訴委員會;(ç”±1994年第6號第32æ¢å¢žè£œ) “æˆäººâ€ã€â€œæˆå¹´äººâ€(adult)* 指年滿18æ²çš„人; (ç”±1990年第32號第6æ¢ä¿®è¨‚) “飛機â€ã€â€œèˆªç©ºå™¨â€(aircraft) 指任何å¯æ†‘空氣的å作用而在大氣ä¸ç²å¾—支承力的機器; “外ç±äººå£«â€(alien) 指並éžä¸åœ‹å…¬æ°‘的人; (ç”±1998年第26號第4æ¢å¢žè£œ) “修訂â€(amend) 包括廢除ã€å¢žè£œæˆ–更改,亦指åŒæ™‚進行,或以åŒä¸€æ¢ä¾‹æˆ–文書進行上述全部或其ä¸ä»»ä½•äº‹é …ï¼› (ç”±1993年第89號第3æ¢ä¿®è¨‚) “å¯é€®æ•çš„罪行â€(arrestable offence) 指由法律è¦é™å›ºå®šåˆ‘ç½°çš„ç½ªè¡Œï¼Œæˆ–æ ¹æ“šã€æ†‘藉法例å°çŠ¯è€…å¯è™•è¶…éŽ12個月監ç¦çš„罪行,亦指犯任何這類罪行的ä¼åœ–ï¼› (ç”±1971年第30號第2æ¢å¢žè£œ) “《基本法》â€(Basic Law) 指《ä¸è¯äººæ°‘共和國香港特別行政å€åŸºæœ¬æ³•ã€‹ï¼› (ç”±1998年第26號第4æ¢å¢žè£œ) “行政長官â€(Chief Executive) 指─""" for zh1, zh2, en in re.
Findall(r"“(^â€*)â€(?:ã€â€œ(^â€*)â€)? \((.*? )\)",s): print(ascii((zh1, zh2, en))) resulting in: ('\u4f5c\u70ba', '', 'act') ('\u884c\u653f\u4e0a\u8a34\u59d4\u54e1\u6703', '', 'Administrative Appeals Board') ('\u6210\u4eba', '\u6210\u5e74\u4eba', 'adult') ('\u98db\u6a5f', '\u822a\u7a7a\u5668', 'aircraft') ('\u5916\u7c4d\u4eba\u58eb', '', 'alien') ('\u4fee\u8a02', '', 'amend') ('\u53ef\u902e\u6355\u7684\u7f6a\u884c', '', 'arrestable offence') ('\u300a\u57fa\u672c\u6cd5\u300b', '', 'Basic Law') ('\u884c\u653f\u9577\u5b98', '', 'Chief Executive').
Yes it work, Thank you very much – Walapa Jul 13 '10 at 9:37.
You want to use the groups feature of regular expressions: import re myRegExp = re. Compile('"(?P. *?)".
*? \((?P. *?
)\)') myRegExp. Finall(YourStringHere).
Import re >>> s = u"""“作為â€(act) ,用於罪行或民事éŽå¤±æ™‚,包括一連串作為ã€ä»»ä½•é•æ³•çš„ä¸ä½œç‚ºå’Œä¸€é€£ä¸²é•æ³•çš„ä¸ä½œç‚ºï¼› “行政上訴委員會â€(Administrative Appeals Board) æŒ‡æ ¹æ“šã€Šè¡Œæ”¿ä¸Šè¨´å§”å“¡æœƒæ¢ä¾‹ã€‹(第442ç« )è¨ç«‹çš„行政上訴委員會;(ç”±1994年第6號第32æ¢å¢žè£œ) “æˆäººâ€ã€â€œæˆå¹´äººâ€(adult)* 指年滿18æ²çš„人; (ç”±1990年第32號第6æ¢ä¿®è¨‚) “飛機â€ã€â€œèˆªç©ºå™¨â€(aircraft) 指任何å¯æ†‘空氣的å作用而在大氣ä¸ç²å¾—支承力的機器; “外ç±äººå£«â€(alien) 指並éžä¸åœ‹å…¬æ°‘的人; (ç”±1998年第26號第4æ¢å¢žè£œ) “修訂â€(amend) 包括廢除ã€å¢žè£œæˆ–更改,亦指åŒæ™‚進行,或以åŒä¸€æ¢ä¾‹æˆ–文書進行上述全部或其ä¸ä»»ä½•äº‹é …ï¼› (ç”±1993年第89號第3æ¢ä¿®è¨‚) “å¯é€®æ•çš„罪行â€(arrestable offence) 指由法律è¦é™å›ºå®šåˆ‘ç½°çš„ç½ªè¡Œï¼Œæˆ–æ ¹æ“šã€æ†‘藉法例å°çŠ¯è€…å¯è™•è¶…éŽ12個月監ç¦çš„罪行,亦指犯任何這類罪行的ä¼åœ–ï¼› (ç”±1971年第30號第2æ¢å¢žè£œ) “《基本法》â€(Basic Law) 指《ä¸è¯äººæ°‘共和國香港特別行政å€åŸºæœ¬æ³•ã€‹ï¼› (ç”±1998年第26號第4æ¢å¢žè£œ) “行政長官â€(Chief Executive) 指─""" >>> for x,y in re. Findall(u"“(.*? )â€\((.*?)\)",s): ... print x, y ... 作為 act 行政上訴委員會 Administrative Appeals Board æˆå¹´äºº adult 航空器 aircraft 外ç±äººå£« alien 修訂 amend å¯é€®æ•çš„罪行 arrestable offence 《基本法》 Basic Law 行政長官 Chief Executive If you want to use this in a program, you should use # -*- coding: utf-8 -*- at the top of the file, so the “ and †are interpreted correctly.
I prefer a greedy pattern u'“(^â€+)â€\\((^)+)\\)'. – KennyTM Jul 13 '10 at 6:46 I don't want â€ã€â€œ between Chinese words, thank you very much – Walapa Jul 13 '10 at 9:35.
To match multiple definitions you need multiple regexes. # Assume Python 3.x. Use u'...' instead of '...' for Python 2.x.
Import re collector_re = re. Compile('((?:“^� +�
À? )+)\\((^)+)\\)') splitter_re = re. Compile('“(^�
+)â€?') def find_all_definitions(text): def_pairs = collector_re. Finditer(text) for match in def_pairs: (chinese, english) = match.groups() terms = splitter_re. Findall(chinese) yield (terms, english) Usage: text = '''“作為â€?(act) ,用於罪行或民事é?
Žå¤±æ™‚,包括一連串作為ã€? Ä»»ä½•é? •æ³•çš„ä¸?
Ľœç‚ºå’Œä¸€é€£ä¸²é? •æ³•çš„ä¸? Ľœç‚ºï¼› “行政上訴委員會â€?(Administrative Appeals Board) æŒ‡æ ¹æ“šã€Šè¡Œæ”¿ä¸Šè¨´å§”å“¡æœƒæ¢?
ľ‹ã€‹(第442ç« )è¨ç«‹çš„行政上訴委員會;(ç”±1994年第6號第32æ¢? Å¢žè£œ) “æˆ? ĺºâ€?
À? €œæˆ? Ź´äººâ€?(adult)* 指年滿18æ²çš„人; (ç”±1990年第32號第6æ¢?
Ä¿®è¨‚) “飛機â€? À? €œèˆªç©ºå™¨â€?(aircraft) 指任何å?
¯æ†‘空氣的å? Ľœç”¨è€Œåœ¨å¤§æ°£ä¸ç? ²å¾—支承力的機器; “外ç±?
ĺºå£«â€?(alien) 指並é? Žä¸åœ‹å…¬æ°‘的人; (ç”±1998年第26號第4æ¢? Å¢žè£œ) “修訂â€?(amend) 包括廢除ã€?
Å¢žè£œæˆ–更改,亦指å? Œæ™‚進行,或以å? Œä¸€æ¢?
ľ‹æˆ–文書進行上述全部或其ä¸ä»»ä½•äº‹é …ï¼› (ç”±1993年第89號第3æ¢? Ä¿®è¨‚) “å? ¯é€®æ?
•çš„罪行â€?(arrestable offence) 指由法律è¦? É™? Å›ºå®šåˆ‘ç½°çš„ç½ªè¡Œï¼Œæˆ–æ ¹æ“šã€?
Ɔ‘藉法例å°? ÇŠ¯è€…å? ¯è™•è¶…é?
Ž12個月監ç¦? Çš„罪行,亦指犯任何這類罪行的ä¼? Åœ–ï¼› (ç”±1971年第30號第2æ¢?
Å¢žè£œ) “《基本法》â€?(Basic Law) 指《ä¸è? ¯äººæ°‘共和國香港特別行政å? €åŸºæœ¬æ³•ã€‹ï¼› (ç”±1998年第26號第4æ¢?
Å¢žè£œ) “行政長官â€?(Chief Executive) 指─''' for terms, english in find_all_definitions(text): print (', '. Join(terms), "\t", english).
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.