lxml – Make Me Engineer

How do you install lxml on OS X Leopard without using MacPorts or Fink?

June 14, 2023 by Tarik

Thanks to @jessenoller on Twitter I have an answer that fits my needs – you can compile lxml with static dependencies, hence avoiding messing with the libxml2 that ships with OS X. Here’s what worked for me: cd /tmp curl -O http://lxml.de/files/lxml-3.6.0.tgz tar -xzvf lxml-3.6.0.tgz cd lxml-3.6.0 python setup.py build –static-deps –libxml2-version=2.7.3 –libxslt-version=1.1.24 sudo python … Read more

How to Pretty Print HTML to a file, with indentation

June 8, 2023 by Tarik

I ended up using BeautifulSoup directly. That is something lxml.html.soupparser uses for parsing HTML. BeautifulSoup has a prettify method that does exactly what it says it does. It prettifies the HTML with proper indents and everything. BeautifulSoup will NOT fix the HTML, so broken code, remains broken. But in this case, since the code is … Read more

Installing lxml module in python

June 7, 2023 by Tarik

Just do: sudo apt-get install python-lxml For Python 2 (e.g., required by Inkscape): sudo apt-get install python2-lxml If you are planning to install from source, then albertov’s answer will help. But unless there is a reason, don’t, just install it from the repository.

Parse SGML with Open Arbitrary Tags in Python 3

May 30, 2023 by Tarik

If you can find an SGML DTD for the documents that you work with, a solution could be to use the osx SGML to XML converter from the OpenSP SGML toolkit to turn the documents into XML. Here is a simple example. Let’s say that we have the following SGML document (company.sgml; with a root … Read more

Parsing broken XML with lxml.etree.iterparse

May 30, 2023 by Tarik

Edit: This is an older answer and I would have done it differently today. And I’m not just referring to the dumb snark … since then BeutifulSoup4 is available and it’s really quite nice. I recommend that to anyone who stumbles over here. The currently accepted answer is, well, not what one should do. The … Read more

How to use regular expression in lxml xpath?

May 29, 2023 by Tarik

You can do this (although you don’t need regular expressions for the example). Lxml supports regular expressions from the EXSLT extension functions. (see the lxml docs for the XPath class, but it also works for the xpath() method) doc.xpath(“//a[re:match(text(), ‘some text’)]”, namespaces={“re”: “http://exslt.org/regular-expressions”}) Note that you need to give the namespace mapping, so that it … Read more

parsing xml containing default namespace to get an element value using lxml

May 18, 2023 by Tarik

This is a common error when dealing with XML having default namespace. Your XML has default namespace, a namespace declared without prefix, here : <sitemapindex xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″> Note that not only element where default namespace declared is in that namespace, but all descendant elements inherit ancestor default namespace implicitly, unless otherwise specified (using explicit namespace prefix … Read more

python – lxml: enforcing a specific order for attributes

May 16, 2023 by Tarik

It looks like lxml serializes attributes in the order you set them: >>> from lxml import etree as ET >>> x = ET.Element(“x”) >>> x.set(‘a’, ‘1’) >>> x.set(‘b’, ‘2’) >>> ET.tostring(x) ‘<x a=”1″ b=”2″/>’ >>> y= ET.Element(“y”) >>> y.set(‘b’, ‘2’) >>> y.set(‘a’, ‘1’) >>> ET.tostring(y) ‘<y b=”2″ a=”1″/>’ Note that when you pass attributes using … Read more

Beautiful Soup and Table Scraping – lxml vs html parser

May 15, 2023 by Tarik

Short answer. If you already installed lxml, just use it. html.parser – BeautifulSoup(markup, “html.parser”) Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.) Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2) lxml – BeautifulSoup(markup, “lxml”) Advantages: Very fast, Lenient Disadvantages: External C dependency html5lib – BeautifulSoup(markup, “html5lib”) Advantages: Extremely lenient, Parses … Read more

Why is lxml.etree.iterparse() eating up all my memory?

May 13, 2023 by Tarik

As iterparse iterates over the entire file a tree is built and no elements are freed. The advantage of doing this is that the elements remember who their parent is, and you can form XPaths that refer to ancestor elements. The disadvantage is that it can consume a lot of memory. In order to free … Read more