How do you install lxml on OS X Leopard without using MacPorts or Fink?

Thanks to @jessenoller on Twitter I have an answer that fits my needs – you can compile lxml with static dependencies, hence avoiding messing with the libxml2 that ships with OS X. Here’s what worked for me: cd /tmp curl -O http://lxml.de/files/lxml-3.6.0.tgz tar -xzvf lxml-3.6.0.tgz cd lxml-3.6.0 python setup.py build –static-deps –libxml2-version=2.7.3 –libxslt-version=1.1.24 sudo python … Read more

How to Pretty Print HTML to a file, with indentation

I ended up using BeautifulSoup directly. That is something lxml.html.soupparser uses for parsing HTML. BeautifulSoup has a prettify method that does exactly what it says it does. It prettifies the HTML with proper indents and everything. BeautifulSoup will NOT fix the HTML, so broken code, remains broken. But in this case, since the code is … Read more

How to use regular expression in lxml xpath?

You can do this (although you don’t need regular expressions for the example). Lxml supports regular expressions from the EXSLT extension functions. (see the lxml docs for the XPath class, but it also works for the xpath() method) doc.xpath(“//a[re:match(text(), ‘some text’)]”, namespaces={“re”: “http://exslt.org/regular-expressions”}) Note that you need to give the namespace mapping, so that it … Read more

parsing xml containing default namespace to get an element value using lxml

This is a common error when dealing with XML having default namespace. Your XML has default namespace, a namespace declared without prefix, here : <sitemapindex xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″> Note that not only element where default namespace declared is in that namespace, but all descendant elements inherit ancestor default namespace implicitly, unless otherwise specified (using explicit namespace prefix … Read more

python – lxml: enforcing a specific order for attributes

It looks like lxml serializes attributes in the order you set them: >>> from lxml import etree as ET >>> x = ET.Element(“x”) >>> x.set(‘a’, ‘1’) >>> x.set(‘b’, ‘2’) >>> ET.tostring(x) ‘<x a=”1″ b=”2″/>’ >>> y= ET.Element(“y”) >>> y.set(‘b’, ‘2’) >>> y.set(‘a’, ‘1’) >>> ET.tostring(y) ‘<y b=”2″ a=”1″/>’ Note that when you pass attributes using … Read more

Beautiful Soup and Table Scraping – lxml vs html parser

Short answer. If you already installed lxml, just use it. html.parser – BeautifulSoup(markup, “html.parser”) Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.) Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2) lxml – BeautifulSoup(markup, “lxml”) Advantages: Very fast, Lenient Disadvantages: External C dependency html5lib – BeautifulSoup(markup, “html5lib”) Advantages: Extremely lenient, Parses … Read more