beautifulsoup – Make Me Engineer

BeautifulSoup: Get the contents of a specific table

June 14, 2023 by Tarik

This is not the specific code you need, just a demo of how to work with BeautifulSoup. It finds the table who’s id is “Table1” and gets all of its tr elements. html = urllib2.urlopen(url).read() bs = BeautifulSoup(html) table = bs.find(lambda tag: tag.name==’table’ and tag.has_attr(‘id’) and tag[‘id’]==”Table1″) rows = table.findAll(lambda tag: tag.name==’tr’)

How to handle IncompleteRead: in python

June 14, 2023 by Tarik

The link you included in your question is simply a wrapper that executes urllib’s read() function, which catches any incomplete read exceptions for you. If you don’t want to implement this entire patch, you could always just throw in a try/catch loop where you read your links. For example: try: page = urllib2.urlopen(urls).read() except httplib.IncompleteRead, … Read more

Don’t put html, head and body tags automatically, beautifulsoup

June 13, 2023 by Tarik

In [35]: import bs4 as bs In [36]: bs.BeautifulSoup(‘<h1>FOO</h1>’, “html.parser”) Out[36]: <h1>FOO</h1> This parses the HTML with Python’s builtin HTML parser. Quoting the docs: Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a <body> tag. Unlike lxml, it doesn’t even bother to add an <html> tag. Alternatively, you … Read more

Extract content within a tag with BeautifulSoup

June 12, 2023 by Tarik

The contents operator works well for extracting text from <tag>text</tag> . <td>My home address</td> example: s=”<td>My home address</td>” soup = BeautifulSoup(s) td = soup.find(‘td’) #<td>My home address</td> td.contents #My home address <td><b>Address:</b></td> example: s=”<td><b>Address:</b></td>” soup = BeautifulSoup(s) td = soup.find(‘td’).find(‘b’) #<b>Address:</b> td.contents #Address:

Beautifulsoup – nextSibling

June 10, 2023 by Tarik

The problem is that you have found a NavigableString, not the <td>. Also nextSibling will find the next NavigableString or Tag so even if you had the <td> it wouldn’t work the way you expect. This is what you want: address = soup.find(text=”Address:”) b_tag = address.parent td_tag = b_tag.parent next_td_tag = td_tag.findNext(‘td’) print next_td_tag.contents[0] Or … Read more

Python BeautifulSoup give multiple tags to findAll

June 9, 2023 by Tarik

You could pass a list, to find any of the given tags: tags = soup.find_all([‘hr’, ‘strong’])

How to scrape dynamic webpages by Python

June 3, 2023 by Tarik

you can use selenium like below sample: from selenium import webdriver driver = webdriver.Firefox() driver.get(‘http://example.com’) element = driver.find_element_by_class_name(“yourClassName”) #or find by text or etc element.click()

How to change tag name with BeautifulSoup?

May 29, 2023 by Tarik

I don’t know how you’re accessing tag but the following works for me: import BeautifulSoup if __name__ == “__main__”: data = “”” <html> <h2 class=”someclass”>some title</h2> <ul> <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li> <li>Aliquam tincidunt mauris eu risus.</li> <li>Vestibulum auctor dapibus neque.</li> </ul> </html> “”” soup = BeautifulSoup.BeautifulSoup(data) h2 = soup.find(‘h2’) h2.name=”h1″ print … Read more

BeautifulSoup – modifying all links in a piece of HTML?

May 29, 2023 by Tarik

Maybe something like this would work? (I don’t have a Python interpreter in front of me, unfortunately) from bs4 import BeautifulSoup soup = BeautifulSoup(‘<p>Blah blah blah <a href=”http://google.com”>Google</a></p>’) for a in soup.findAll(‘a’): a[‘href’] = a[‘href’].replace(“google”, “mysite”) result = str(soup)

Matching partial ids in BeautifulSoup

May 27, 2023 by Tarik

You can pass a function to findAll: >>> print soupHandler.findAll(‘div’, id=lambda x: x and x.startswith(‘post-‘)) [<div id=”post-45″>…</div>, <div id=”post-334″>…</div>] Or a regular expression: >>> print soupHandler.findAll(‘div’, id=re.compile(‘^post-‘)) [<div id=”post-45″>…</div>, <div id=”post-334″>…</div>]