unicode - Converting domain names to idn in python -
i have long list of domain names need generate reports on. list contains idn domains, , although know how convert them in python on command line:
>>> domain = u"pfarmerü.com" >>> domain u'pfarmer\xfc.com' >>> domain.encode("idna") 'xn--pfarmer-t2a.com' >>>
i'm struggling work small script reading data text file.
#!/usr/bin/python import sys infile = open(sys.argv[1]) line in infile: print line, domain = unicode(line.strip()) print type(domain) print "idn:", domain.encode("idna") print
i following output:
$ ./idn.py ./test pfarmer.com <type 'unicode'> idn: pfarmer.com pfarmerü.com traceback (most recent call last): file "./idn.py", line 9, in <module> domain = unicode(line.strip()) unicodedecodeerror: 'ascii' codec can't decode byte 0xfc in position 7: ordinal not in range(128)
i have tried:
#!/usr/bin/python import sys import codecs infile = codecs.open(sys.argv[1], "r", "utf8") line in infile: print line, domain = line.strip() print type(domain) print "idn:", domain.encode("idna") print
which gave me:
$ ./idn.py ./test traceback (most recent call last): file "./idn.py", line 8, in <module> line in infile: file "/usr/lib/python2.6/codecs.py", line 679, in next return self.reader.next() file "/usr/lib/python2.6/codecs.py", line 610, in next line = self.readline() file "/usr/lib/python2.6/codecs.py", line 525, in readline data = self.read(readsize, firstline=true) file "/usr/lib/python2.6/codecs.py", line 472, in read newchars, decodedbytes = self.decode(data, self.errors) unicodedecodeerror: 'utf8' codec can't decode bytes in position 0-5: unsupported unicode code range
here test data file:
pfarmer.com pfarmerü.com
i'm aware of need understand unicode now.
thanks,
peter
you need know in encoding file saved. 'utf-8' (which not unicode) or 'iso-8859-1' or 'cp1252' or alike.
then can (assuming 'utf-8'):
infile = open(sys.argv[1]) line in infile: print line, domain = line.strip().decode('utf-8') print type(domain) print "idn:", domain.encode("idna") print
convert encoded strings unicode decode
. convert unicode string encode
. if try encode encoded, python tries decode first, default codec 'ascii' fails non-ascii-values.
Comments
Post a Comment