BeautifulSoup: Get the contents of a specific table

This is not the specific code you need, just a demo of how to work with BeautifulSoup. It finds the table who’s id is “Table1” and gets all of its tr elements. html = urllib2.urlopen(url).read() bs = BeautifulSoup(html) table = bs.find(lambda tag: tag.name==’table’ and tag.has_attr(‘id’) and tag[‘id’]==”Table1″) rows = table.findAll(lambda tag: tag.name==’tr’)

How to handle IncompleteRead: in python

The link you included in your question is simply a wrapper that executes urllib’s read() function, which catches any incomplete read exceptions for you. If you don’t want to implement this entire patch, you could always just throw in a try/catch loop where you read your links. For example: try: page = urllib2.urlopen(urls).read() except httplib.IncompleteRead, … Read more

Don’t put html, head and body tags automatically, beautifulsoup

In [35]: import bs4 as bs In [36]: bs.BeautifulSoup(‘<h1>FOO</h1>’, “html.parser”) Out[36]: <h1>FOO</h1> This parses the HTML with Python’s builtin HTML parser. Quoting the docs: Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a <body> tag. Unlike lxml, it doesn’t even bother to add an <html> tag. Alternatively, you … Read more

Extract content within a tag with BeautifulSoup

The contents operator works well for extracting text from <tag>text</tag> . <td>My home address</td> example: s=”<td>My home address</td>” soup = BeautifulSoup(s) td = soup.find(‘td’) #<td>My home address</td> td.contents #My home address <td><b>Address:</b></td> example: s=”<td><b>Address:</b></td>” soup = BeautifulSoup(s) td = soup.find(‘td’).find(‘b’) #<b>Address:</b> td.contents #Address:

Beautifulsoup – nextSibling

The problem is that you have found a NavigableString, not the <td>. Also nextSibling will find the next NavigableString or Tag so even if you had the <td> it wouldn’t work the way you expect. This is what you want: address = soup.find(text=”Address:”) b_tag = address.parent td_tag = b_tag.parent next_td_tag = td_tag.findNext(‘td’) print next_td_tag.contents[0] Or … Read more

How to change tag name with BeautifulSoup?

I don’t know how you’re accessing tag but the following works for me: import BeautifulSoup if __name__ == “__main__”: data = “”” <html> <h2 class=”someclass”>some title</h2> <ul> <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li> <li>Aliquam tincidunt mauris eu risus.</li> <li>Vestibulum auctor dapibus neque.</li> </ul> </html> “”” soup = BeautifulSoup.BeautifulSoup(data) h2 = soup.find(‘h2’) h2.name=”h1″ print … Read more

Matching partial ids in BeautifulSoup

You can pass a function to findAll: >>> print soupHandler.findAll(‘div’, id=lambda x: x and x.startswith(‘post-‘)) [<div id=”post-45″>…</div>, <div id=”post-334″>…</div>] Or a regular expression: >>> print soupHandler.findAll(‘div’, id=re.compile(‘^post-‘)) [<div id=”post-45″>…</div>, <div id=”post-334″>…</div>]