screen-scraping – Make Me Engineer

Nokogiri, open-uri, and Unicode Characters

May 28, 2023 by Tarik

Summary: When feeding UTF-8 to Nokogiri through open-uri, use open(…).read and pass the resulting string to Nokogiri. Analysis: If I fetch the page using curl, the headers properly show Content-Type: text/html; charset=UTF-8 and the file content includes valid UTF-8, e.g. “Genealogía de Jesucristo”. But even with a magic comment on the Ruby file and setting … Read more

How to download any(!) webpage with correct charset in python?

May 27, 2023 by Tarik

When you download a file with urllib or urllib2, you can find out whether a charset header was transmitted: fp = urllib2.urlopen(request) charset = fp.headers.getparam(‘charset’) You can use BeautifulSoup to locate a meta element in the HTML: soup = BeatifulSoup.BeautifulSoup(data) meta = soup.findAll(‘meta’, {‘http-equiv’:lambda v:v.lower()==’content-type’}) If neither is available, browsers typically fall back to user … Read more

Scrape web pages in real time with Node.js

May 23, 2023 by Tarik

Node.io seems to take the cake 🙂

Scraping contents of multi web pages of a website using BeautifulSoup and Selenium

May 19, 2023 by Tarik

If you want to get the last page number of the above the link for proceeding, which is 499 you can use either Selenium or Beautifulsoup as follows : Selenium : from selenium import webdriver driver = webdriver.Firefox(executable_path=r’C:\Utility\BrowserDrivers\geckodriver.exe’) url = “http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061” driver.get(url) element = driver.find_element_by_xpath(“//div[@class=”row pagination”]//p/span[contains(.,’Reviews on Reliance Jio’)]”) driver.execute_script(“return arguments[0].scrollIntoView(true);”, element) print(driver.find_element_by_xpath(“//ul[@class=”pagination table”]/li/ul[@class=”pages table”]//li[last()]/a”).get_attribute(“innerHTML”)) … Read more

Looping over urls to do the same thing

May 19, 2023 by Tarik

PhantomJS is asynchronous. By calling page.open() multiple times using a loop, you essentially rush the execution of the callback. You’re overwriting the current request before it is finished with a new request which is then again overwritten. You need to execute them one after the other, for example like this: page.open(url, function () { waitFor(function() … Read more

Screen Scraping from a web page with a lot of Javascript [closed]

May 16, 2023 by Tarik

You may consider using HTMLunit It’s a java class library made to automate browsing without having to control a browser, and it integrates the Mozilla Rhino Javascript engine to process javascript on the pages it loads. There’s also a JRuby wrapper for that, named Celerity. Its javascript support is not really perfect right now, but … Read more

unable to call firefox from selenium in python on AWS machine

May 13, 2023 by Tarik

CasperJS passing data back to PHP

May 5, 2023 by Tarik

I think the best way to transfer data from CasperJS to another language such as PHP is running CasperJS script as a service. Because CasperJS has been written over PhantomJS, CasperJS can use an embedded web server module of PhantomJS called Mongoose. For information about how works the embedded web server see here Here an … Read more

Scraping javascript website in R

November 22, 2022 by Tarik

So, RSelenium is not the only answer (anymore). If you can install the PhantomJS binary (grab phantomjs binaries from here: http://phantomjs.org/) then you can use it to render the HTML and scrape it with rvest (similar to the RSelenium approach but doesn’t require java): library(rvest) # render HTML from the site with phantomjs url <- … Read more

Scrape a dynamic website

November 22, 2022 by Tarik

This is a difficult problem because you either have to reverse engineer the javascript on a per-site basis, or implement a javascript engine and run the scripts (which has its own difficulties and pitfalls). It’s a heavy weight solution, but I’ve seen people doing this with greasemonkey scripts – allow Firefox to render everything and … Read more