Nokogiri, open-uri, and Unicode Characters

Summary: When feeding UTF-8 to Nokogiri through open-uri, use open(…).read and pass the resulting string to Nokogiri. Analysis: If I fetch the page using curl, the headers properly show Content-Type: text/html; charset=UTF-8 and the file content includes valid UTF-8, e.g. “Genealogía de Jesucristo”. But even with a magic comment on the Ruby file and setting … Read more

How to download any(!) webpage with correct charset in python?

When you download a file with urllib or urllib2, you can find out whether a charset header was transmitted: fp = urllib2.urlopen(request) charset = fp.headers.getparam(‘charset’) You can use BeautifulSoup to locate a meta element in the HTML: soup = BeatifulSoup.BeautifulSoup(data) meta = soup.findAll(‘meta’, {‘http-equiv’:lambda v:v.lower()==’content-type’}) If neither is available, browsers typically fall back to user … Read more

Scraping contents of multi web pages of a website using BeautifulSoup and Selenium

If you want to get the last page number of the above the link for proceeding, which is 499 you can use either Selenium or Beautifulsoup as follows : Selenium : from selenium import webdriver driver = webdriver.Firefox(executable_path=r’C:\Utility\BrowserDrivers\geckodriver.exe’) url = “http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061” driver.get(url) element = driver.find_element_by_xpath(“//div[@class=”row pagination”]//p/span[contains(.,’Reviews on Reliance Jio’)]”) driver.execute_script(“return arguments[0].scrollIntoView(true);”, element) print(driver.find_element_by_xpath(“//ul[@class=”pagination table”]/li/ul[@class=”pages table”]//li[last()]/a”).get_attribute(“innerHTML”)) … Read more

Looping over urls to do the same thing

PhantomJS is asynchronous. By calling page.open() multiple times using a loop, you essentially rush the execution of the callback. You’re overwriting the current request before it is finished with a new request which is then again overwritten. You need to execute them one after the other, for example like this: page.open(url, function () { waitFor(function() … Read more

Screen Scraping from a web page with a lot of Javascript [closed]

You may consider using HTMLunit It’s a java class library made to automate browsing without having to control a browser, and it integrates the Mozilla Rhino Javascript engine to process javascript on the pages it loads. There’s also a JRuby wrapper for that, named Celerity. Its javascript support is not really perfect right now, but … Read more

Scrape a dynamic website

This is a difficult problem because you either have to reverse engineer the javascript on a per-site basis, or implement a javascript engine and run the scripts (which has its own difficulties and pitfalls). It’s a heavy weight solution, but I’ve seen people doing this with greasemonkey scripts – allow Firefox to render everything and … Read more