web-scraping – Page 12 – Make Me Engineer

Is it ok to scrape data from Google results? [closed]

May 8, 2022 by Tarik

Google disallows automated access in their TOS, so if you accept their terms you would break them. That said, I know of no lawsuit from Google against a scraper. Even Microsoft scraped Google, they powered their search engine Bing with it. They got caught in 2011 red handed 🙂 There are two options to scrape … Read more

How to use Python requests to fake a browser visit a.k.a and generate User Agent?

May 8, 2022 by Tarik

Provide a User-Agent header: import requests url=”http://www.ichangtou.com/#company:data_000008.html” headers = {‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36’} response = requests.get(url, headers=headers) print(response.content) FYI, here is a list of User-Agent strings for different browsers: List of all Browsers As a side note, there is a pretty useful third-party package called … Read more

Scraping html tables into R data frames using the XML package

May 7, 2022 by Tarik

…or a shorter try: library(XML) library(RCurl) library(rlist) theurl <- getURL(“https://en.wikipedia.org/wiki/Brazil_national_football_team”,.opts = list(ssl.verifypeer = FALSE) ) tables <- readHTMLTable(theurl) tables <- list.clean(tables, fun = is.null, recursive = FALSE) n.rows <- unlist(lapply(tables, function(t) dim(t)[1])) the picked table is the longest one on the page tables[[which.max(n.rows)]]

How can I efficiently parse HTML with Java?

May 6, 2022 by Tarik

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after. Its party trick is a CSS selector syntax to find elements, e.g.: String html = “<html><head><title>First parse</title></head>” + “<body><p>Parsed HTML into a doc.</p></body></html>”; Document doc = Jsoup.parse(html); Elements links … Read more

selenium with scrapy for dynamic page

May 5, 2022 by Tarik

It really depends on how do you need to scrape the site and how and what data do you want to get. Here’s an example how you can follow pagination on ebay using Scrapy+Selenium: import scrapy from selenium import webdriver class ProductSpider(scrapy.Spider): name = “product_spider” allowed_domains = [‘ebay.com’] start_urls = [‘http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40’] def __init__(self): self.driver = … Read more

Difference between text and innerHTML using Selenium

April 29, 2022 by Tarik

To start with, text is a property where as innerHTML is an attribute. Fundamentally there are some differences between a property and an attribute. get_attribute(“innerHTML”) get_attribute(innerHTML) gets the innerHTML of the element. This method will first try to return the value of a property with the given name. If a property with that name doesn’t … Read more

retrieve links from web page using python and BeautifulSoup [closed]

April 28, 2022 by Tarik

Here’s a short snippet using the SoupStrainer class in BeautifulSoup: import httplib2 from bs4 import BeautifulSoup, SoupStrainer http = httplib2.Http() status, response = http.request(‘http://www.nytimes.com’) for link in BeautifulSoup(response, parse_only=SoupStrainer(‘a’)): if link.has_attr(‘href’): print(link[‘href’]) The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Edit: Note that I used the SoupStrainer class … Read more

Web-scraping JavaScript page with Python

April 16, 2022 by Tarik

We are not getting the correct results because any javascript generated content needs to be rendered on the DOM. When we fetch an HTML page, we fetch the initial, unmodified by javascript, DOM. Therefore we need to render the javascript content before we crawl the page. As selenium is already mentioned many times in this … Read more

WebDriverWait not working as expected

April 13, 2022 by Tarik

Once you wait for the element and moving forward as you are trying to invoke click() method instead of using presence_of_element_located() method you need to use element_to_be_clickable() as follows : try: myElem = WebDriverWait(self.browser, delay).until(EC.element_to_be_clickable((By.XPATH , xpath))) Update As per your counter question in the comments here are the details of the three methods : … Read more

How to download and save all PDF from a dynamic web?

April 12, 2022 by Tarik

You have to make a post http requests with appropriate json parameter. Once you get the response, you have to parse two fields objectId and nombreFichero to use them to build right links to the pdf’s. The following should work: import os import json import requests url=”https://bancaonline.bankinter.com/publico/rs/documentacionPrix/list” base=”https://bancaonline.bankinter.com/publico/DocumentacionPrixGet?doc={}&nameDoc={}” payload = {“cod_categoria”: 2,”cod_familia”: 3,”divisaDestino”: None,”vencimiento”: None,”edadActuarial”: … Read more