How to download any(!) webpage with correct charset in python?

Question

When you download a file with urllib or urllib2, you can find out whether a charset header was transmitted:

fp = urllib2.urlopen(request)
charset = fp.headers.getparam('charset')

You can use BeautifulSoup to locate a meta element in the HTML:

soup = BeatifulSoup.BeautifulSoup(data)
meta = soup.findAll('meta', {'http-equiv':lambda v:v.lower()=='content-type'})

If neither is available, browsers typically fall back to user configuration, combined with auto-detection. As rajax proposes, you could use the chardet module. If you have user configuration available telling you that the page should be Chinese (say), you may be able to do better.

Leave a Comment Cancel reply