Web scraping in PHP

I recommend you consider simple_html_dom for this. It will make it very easy. Here is a working example of how to pull the title, and first image. <?php require ‘simple_html_dom.php’; $html = file_get_html(‘http://www.google.com/’); $title = $html->find(‘title’, 0); $image = $html->find(‘img’, 0); echo $title->plaintext.”<br>\n”; echo $image->src; ?> Here is a second example that will do the … Read more

Problems submitting a login form with Jsoup

Besides the username, password and the cookies, the site requeires two additional values for the login – VIEWSTATE and EVENTVALIDATION. You can get them from the response of the first Get request, like this – Document doc = loginForm.parse(); Element e = doc.select(“input[id=__VIEWSTATE]”).first(); String viewState = e.attr(“value”); e = doc.select(“input[id=__EVENTVALIDATION]”).first(); String eventValidation = e.attr(“value”); And … Read more

Detect when a web page is loaded without using sleep

Try conventional method: Set objIE = CreateObject(“InternetExplorer.Application”) objIE.Visible = True objIE.Navigate “https://www.yahoo.com/” Do While objIE.ReadyState <> 4 WScript.Sleep 10 Loop ‘ your code here ‘ … UPD: this one should check for errors: Set objIE = CreateObject(“InternetExplorer.Application”) objIE.Visible = True objIE.Navigate “https://www.yahoo.com/” On Error Resume Next Do If objIE.ReadyState = 4 Then If Err = … Read more

“SSL: certificate_verify_failed” error when scraping https://www.thenewboston.com/

The problem is not in your code but in the web site you are trying to access. When looking at the analysis by SSLLabs you will note: This server’s certificate chain is incomplete. Grade capped to B. This means that the server configuration is wrong and that not only python but several others will have … Read more

Reading dynamically generated web pages using python

You need JavaScript Engine to parse and run JavaScript code inside the page. There are a bunch of headless browsers that can help you http://code.google.com/p/spynner/ http://phantomjs.org/ http://zombie.labnotes.org/ http://github.com/ryanpetrello/python-zombie http://jeanphix.me/Ghost.py/ http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/

BeautifulSoup webscraping find_all( ): finding exact match

In BeautifulSoup 4, the class attribute (and several other attributes, such as accesskey and the headers attribute on table cell elements) is treated as a set; you match against individual elements listed in the attribute. This follows the HTML standard. As such, you cannot limit the search to just one class. You’ll have to use … Read more

‘list’ object has no attribute ‘get_attribute’ while iterating through WebElements

Let us see what’s happening in your code : Without any visibility to the concerned HTML it seems the following line returns two WebElements in to the List find_href which are inturn are appended to the all_trails List : find_href = browser.find_elements_by_xpath(‘//div[@class=”text truncate trail-name”]/a[1]’) Hence when we print the List all_trails both the WebElements are … Read more

Web scraping program cannot find element which I can see in the browser

The element you’re interested in is dynamically generated, after the initial page load, which means that your browser executed JavaScript, made other network requests, etc. in order to build the page. Requests is just an HTTP library, and as such will not do those things. You could use a tool like Selenium, or perhaps even … Read more