unicode - Converting domain names to idn in python -


i have long list of domain names need generate reports on. list contains idn domains, , although know how convert them in python on command line:

>>> domain = u"pfarmerü.com" >>> domain u'pfarmer\xfc.com' >>> domain.encode("idna") 'xn--pfarmer-t2a.com' >>>  

i'm struggling work small script reading data text file.

#!/usr/bin/python  import sys  infile = open(sys.argv[1])  line in infile:     print line,     domain = unicode(line.strip())     print type(domain)     print "idn:", domain.encode("idna")     print 

i following output:

$ ./idn.py ./test  pfarmer.com <type 'unicode'> idn: pfarmer.com  pfarmerü.com traceback (most recent call last):   file "./idn.py", line 9, in <module>     domain = unicode(line.strip()) unicodedecodeerror: 'ascii' codec can't decode byte 0xfc in position 7: ordinal not in range(128) 

i have tried:

#!/usr/bin/python  import sys import codecs  infile = codecs.open(sys.argv[1], "r", "utf8")  line in infile:     print line,     domain = line.strip()     print type(domain)     print "idn:", domain.encode("idna")     print 

which gave me:

$ ./idn.py ./test        traceback (most recent call last):   file "./idn.py", line 8, in <module>     line in infile:   file "/usr/lib/python2.6/codecs.py", line 679, in next     return self.reader.next()   file "/usr/lib/python2.6/codecs.py", line 610, in next     line = self.readline()   file "/usr/lib/python2.6/codecs.py", line 525, in readline     data = self.read(readsize, firstline=true)   file "/usr/lib/python2.6/codecs.py", line 472, in read     newchars, decodedbytes = self.decode(data, self.errors) unicodedecodeerror: 'utf8' codec can't decode bytes in position 0-5: unsupported unicode code range 

here test data file:

pfarmer.com pfarmerü.com 

i'm aware of need understand unicode now.

thanks,

peter

you need know in encoding file saved. 'utf-8' (which not unicode) or 'iso-8859-1' or 'cp1252' or alike.

then can (assuming 'utf-8'):

 infile = open(sys.argv[1])  line in infile:     print line,     domain = line.strip().decode('utf-8')     print type(domain)     print "idn:", domain.encode("idna")     print 

convert encoded strings unicode decode. convert unicode string encode. if try encode encoded, python tries decode first, default codec 'ascii' fails non-ascii-values.


Comments

Popular posts from this blog

c# - Better 64-bit byte array hash -

webrtc - Which ICE candidate am I using and why? -

php - Zend Framework / Skeleton-Application / Composer install issue -