Best way to convert a Unicode URL to ASCII (UTF-8 percent-escaped) in Python?

Code: import urlparse, urllib def fixurl(url): # turn string into unicode if not isinstance(url,unicode): url = url. Decode('utf8') # parse it parsed = urlparse. Urlsplit(url) # divide the netloc further userpass,at,hostport = parsed.netloc.

Partition('@') user,colon1,pass_ = userpass. Partition(':') host,colon2,port = hostport. Partition(':') # encode each component scheme = parsed.scheme.

Encode('utf8') user = urllib. Quote(user. Encode('utf8')) colon1 = colon1.

Encode('utf8') pass_ = urllib. Quote(pass_. Encode('utf8')) at = at.

Encode('utf8') host = host. Encode('idna') colon2 = colon2. Encode('utf8') port = port.

Encode('utf8') path = '/'. Join( # could be encoded slashes! Urllib.

Quote(urllib. Unquote(pce). Encode('utf8'),'') for pce in parsed.path.

Split('/') ) query = urllib. Quote(urllib. Unquote(parsed.

Query). Encode('utf8'),'=&? /') fragment = urllib.

Quote(urllib. Unquote(parsed. Fragment).

Encode('utf8')) # put it back together netloc = ''. Join((user,colon1,pass_,at,host,colon2,port)) return urlparse. Urlunsplit((scheme,netloc,path,query,fragment)) print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5') print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/%2F') print fixurl(u'http://Ã…sa:abc123@âž¡.

Ws:81/admin') Output: http://xn--hgi.ws/%E2%99%A5 http://xn--hgi.ws/%E2%99%A5/%2F http://%C3%85sa:[email protected]:81/admin Read more: urllib.quote() urlparse.urlparse() urlparse.urlunparse() urlparse.urlsplit() urlparse.urlunsplit() Edits: Fixed the case of already quoted characters in the string. Changed urlparse/urlunparse to urlsplit/urlunsplit. Don't encode user and port information with the hostname.

(Thanks Jehiah).

Nice solution, thanks. Good call about using urlparse/unparse, and noticing the case of already-quoted chars in the input. But I'm unsure why you need the split('/') logic, because urllib.quote() already considers slashes safe.

See also my new, doctested, and somewhat more complete solution below. – benhoyt Apr 30 '09 at 2:39 1 The problem is that '/' is considered a path separator, while '%2F' is not. If I just unquote the string, they become one and the same.

Maybe it would be better to never unquote the path at all, and encode all existing '%' as '%25'..? – MizardX Apr 30 '09 at 7:52 2 netloc! = domain, so you should parse the domain out from user:pass@domain:port first then convert to idna – Jehiah May 21 '10 at 21:32.

There's some RFC-3896 url parsing work underway (e.g. As part of the Summer Of Code) but nothing in the standard library yet AFAIK -- and nothing much on the uri encoding side of things either, again AFAIK. So you might as well go with MizardX's elegant approach.

See see bugs.python. Org/issue1712522 for the current state of affairs – mdorseif Oct 24 '10 at 16:32.

You might use urlparse. Urlsplit instead, but otherwise you seem to have a very straightforward solution, there. Protocol, domain, path, query, fragment = urlparse.

Urlsplit(url) (You can access the domain and port separately by accessing the returned value's named properties, but as port syntax is always in ASCII it is unaffected by the IDNA encoding process. ).

Okay, with these comments and some bug-fixing in my own code (it didn't handle fragments at all), I've come up with the following canonurl() function -- returns a canonical, ASCII form of the URL: import re import urllib import urlparse def canonurl(url): r"""Return the canonical, ASCII-encoded form of a UTF-8 encoded URL, or '' if the URL looks invalid. >>> canonurl(' ') '' >>> canonurl('google.com') 'google.com/' >>> canonurl('bad-utf8. Com/path\xff/file') '' >>> canonurl('svn://blah.Com/path/file') 'svn://blah.

Com/path/file' >>> canonurl('1234://badscheme. Com') '' >>> canonurl('bad$scheme://google.Com') '' >>> canonurl('site. Badtopleveldomain') '' >>> canonurl('site.

Com:badport') '' >>> canonurl('http://123.24.8.240/blah') 'http://123.24.8.240/blah' >>> canonurl('http://123.24.8.240:1234/blah? Q#f') 'http://123.24.8.240:1234/blah? Q#f' >>> canonurl('\xe2\x9e\xa1.Ws') # tinyarro.

Ws 'http://xn--hgi.ws/' >>> canonurl(' http://www.google.com:80/path/file;params? Query#fragment ') 'http://www.google.com:80/path/file;params? Query#fragment' >>> canonurl('google.com')0 'google.com')1 >>> canonurl('http://\xe2\x9e\www.google.

Com1') 'http://xn--hgi.ws/%E2%99%A5/pa/th' >>> canonurl('http://\xe2\x9e\www.google. Com1;par%2Fams? Que%2Fry=a&b=c') 'http://xn--hgi.ws/%E2%99%A5/pa/th;par/ams?

Que/ry=a&b=c' >>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5?\xe2\x99\xa5#\xe2\x99\xa5') 'http://xn--hgi.ws/%E2%99%A5? %E2%99%A5#%E2%99%A5' >>> canonurl('http://\xe2\x9e\xa1.ws/%e2%99%a5?%E2%99%A5#%E2%99%A5') 'http://xn--hgi.ws/%E2%99%A5? %E2%99%A5#%E2%99%A5' >>> canonurl('google.com/'0 'google.com/'1 >>> len(canonurl('google.

Com/' + 'a' * 16384)) 4096 """ # strip spaces at the ends and ensure it's prefixed with 'scheme://' url = url.strip() if not url: return '' if not urlparse. Urlsplit(url). Scheme: url = 'google.com/'2 + url # turn it into Unicode try: url = unicode(url, 'utf-8') except UnicodeDecodeError: return '' # bad UTF-8 chars in URL # parse the URL into its components parsed = urlparse.

Urlsplit(url) scheme, netloc, path, query, fragment = parsed # ensure scheme is a letter followed by letters, digits, and '+-.' chars if not re. Match(r'a-z-+. A-z0-9*$', scheme, flags=re.

I): return '' scheme = str(scheme) # ensure domain and port are valid, eg: sub.domain. :port match = re. Match(r'(.+\.

A-z0-9{1,6})(:\d{1,5})? $', netloc, flags=re. I) if not match: return '' domain, port = match.groups() netloc = domain + (port if port else '') netloc = netloc.

Encode('idna') # ensure path is valid and convert Unicode chars to %-encoded if not path: path = '/' # eg: 'http://www.google.com/2' -> 'http://www.google.com/2/' path = urllib. Quote(urllib. Unquote(path.

Encode('utf-8')), safe='/;') # ensure query is valid query = urllib. Quote(urllib. Unquote(query.

Encode('utf-8')), safe='=&? /') # ensure fragment is valid fragment = urllib. Quote(urllib.

Unquote(fragment. Encode('utf-8'))) # piece it all back together, truncating it to a maximum of 4KB url = urlparse. Urlunsplit((scheme, netloc, path, query, fragment)) return url:4096 if __name__ == '__main__': import doctest doctest.testmod().

Just cutting it off at 4096 characters could leave partial quoted characters. You could use the regular expression r'%.?$' to match any trailing partial escapes. – MizardX Apr 30 '09 at 12:50.

The code given by MizardX isnt 100% correct. This example won't work: example. Com/folder/?

Page=2 check out django.utils.encoding. Iri_to_uri() to convert unicode URL to ASCII urls. docs.djangoproject.com/en/dev/ref/unicode.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Best way to convert a Unicode URL to ASCII (UTF-8 percent-escaped) in Python?

Related Questions

Difficulties inherent in ASCII and Extended ASCII, and Unicode Compatibility?

Unicode, UTF, ASCII, ANSI format differences?

Manually converting unicode codepoints into UTF-8 and UTF-16?

Convert Unicode to UTF-8 Python?

Erlang has been slow to adopt Unicode. Is Unicode or UTF-8 a problem with CouchDB?

UTS #10 Unicode Collation Algorithm is defined with a particular base version of the Unicode Standard, but I am using characters from a later version of Unicode. What shall I do?