How to use Python requests to fake a browser visit a.k.a and generate User Agent?

Provide a User-Agent header: import requests url=”http://www.ichangtou.com/#company:data_000008.html” headers = {‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36’} response = requests.get(url, headers=headers) print(response.content) FYI, here is a list of User-Agent strings for different browsers: List of all Browsers As a side note, there is a pretty useful third-party package called … Read more

Scraping html tables into R data frames using the XML package

…or a shorter try: library(XML) library(RCurl) library(rlist) theurl <- getURL(“https://en.wikipedia.org/wiki/Brazil_national_football_team”,.opts = list(ssl.verifypeer = FALSE) ) tables <- readHTMLTable(theurl) tables <- list.clean(tables, fun = is.null, recursive = FALSE) n.rows <- unlist(lapply(tables, function(t) dim(t)[1])) the picked table is the longest one on the page tables[[which.max(n.rows)]]

How can I efficiently parse HTML with Java?

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after. Its party trick is a CSS selector syntax to find elements, e.g.: String html = “<html><head><title>First parse</title></head>” + “<body><p>Parsed HTML into a doc.</p></body></html>”; Document doc = Jsoup.parse(html); Elements links … Read more

selenium with scrapy for dynamic page

It really depends on how do you need to scrape the site and how and what data do you want to get. Here’s an example how you can follow pagination on ebay using Scrapy+Selenium: import scrapy from selenium import webdriver class ProductSpider(scrapy.Spider): name = “product_spider” allowed_domains = [‘ebay.com’] start_urls = [‘http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40’] def __init__(self): self.driver = … Read more

Difference between text and innerHTML using Selenium

To start with, text is a property where as innerHTML is an attribute. Fundamentally there are some differences between a property and an attribute. get_attribute(“innerHTML”) get_attribute(innerHTML) gets the innerHTML of the element. This method will first try to return the value of a property with the given name. If a property with that name doesn’t … Read more

retrieve links from web page using python and BeautifulSoup [closed]

Here’s a short snippet using the SoupStrainer class in BeautifulSoup: import httplib2 from bs4 import BeautifulSoup, SoupStrainer http = httplib2.Http() status, response = http.request(‘http://www.nytimes.com’) for link in BeautifulSoup(response, parse_only=SoupStrainer(‘a’)): if link.has_attr(‘href’): print(link[‘href’]) The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Edit: Note that I used the SoupStrainer class … Read more

WebDriverWait not working as expected

Once you wait for the element and moving forward as you are trying to invoke click() method instead of using presence_of_element_located() method you need to use element_to_be_clickable() as follows : try: myElem = WebDriverWait(self.browser, delay).until(EC.element_to_be_clickable((By.XPATH , xpath))) Update As per your counter question in the comments here are the details of the three methods : … Read more

How to download and save all PDF from a dynamic web?

You have to make a post http requests with appropriate json parameter. Once you get the response, you have to parse two fields objectId and nombreFichero to use them to build right links to the pdf’s. The following should work: import os import json import requests url=”https://bancaonline.bankinter.com/publico/rs/documentacionPrix/list” base=”https://bancaonline.bankinter.com/publico/DocumentacionPrixGet?doc={}&nameDoc={}” payload = {“cod_categoria”: 2,”cod_familia”: 3,”divisaDestino”: None,”vencimiento”: None,”edadActuarial”: … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)