web-scraping – Page 8 – Make Me Engineer

Web scraping in PHP

July 20, 2022 by Tarik

I recommend you consider simple_html_dom for this. It will make it very easy. Here is a working example of how to pull the title, and first image. <?php require ‘simple_html_dom.php’; $html = file_get_html(‘http://www.google.com/’); $title = $html->find(‘title’, 0); $image = $html->find(‘img’, 0); echo $title->plaintext.”<br>\n”; echo $image->src; ?> Here is a second example that will do the … Read more

Problems submitting a login form with Jsoup

July 18, 2022 by Tarik

Besides the username, password and the cookies, the site requeires two additional values for the login – VIEWSTATE and EVENTVALIDATION. You can get them from the response of the first Get request, like this – Document doc = loginForm.parse(); Element e = doc.select(“input[id=__VIEWSTATE]”).first(); String viewState = e.attr(“value”); e = doc.select(“input[id=__EVENTVALIDATION]”).first(); String eventValidation = e.attr(“value”); And … Read more

Detect when a web page is loaded without using sleep

July 17, 2022 by Tarik

Try conventional method: Set objIE = CreateObject(“InternetExplorer.Application”) objIE.Visible = True objIE.Navigate “https://www.yahoo.com/” Do While objIE.ReadyState <> 4 WScript.Sleep 10 Loop ‘ your code here ‘ … UPD: this one should check for errors: Set objIE = CreateObject(“InternetExplorer.Application”) objIE.Visible = True objIE.Navigate “https://www.yahoo.com/” On Error Resume Next Do If objIE.ReadyState = 4 Then If Err = … Read more

“SSL: certificate_verify_failed” error when scraping https://www.thenewboston.com/

July 16, 2022 by Tarik

The problem is not in your code but in the web site you are trying to access. When looking at the analysis by SSLLabs you will note: This server’s certificate chain is incomplete. Grade capped to B. This means that the server configuration is wrong and that not only python but several others will have … Read more

Reading dynamically generated web pages using python

July 16, 2022 by Tarik

You need JavaScript Engine to parse and run JavaScript code inside the page. There are a bunch of headless browsers that can help you http://code.google.com/p/spynner/ http://phantomjs.org/ http://zombie.labnotes.org/ http://github.com/ryanpetrello/python-zombie http://jeanphix.me/Ghost.py/ http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/

BeautifulSoup webscraping find_all( ): finding exact match

July 16, 2022 by Tarik

In BeautifulSoup 4, the class attribute (and several other attributes, such as accesskey and the headers attribute on table cell elements) is treated as a set; you match against individual elements listed in the attribute. This follows the HTML standard. As such, you cannot limit the search to just one class. You’ll have to use … Read more

Python – Download Images from google Image search?

July 14, 2022 by Tarik

I have modified my code. Now the code can download 100 images for a given query, and images are full high resolution that is original images are being downloaded. I am downloading the images using urllib2 & Beautiful soup from bs4 import BeautifulSoup import requests import re import urllib2 import os import cookielib import json … Read more

‘list’ object has no attribute ‘get_attribute’ while iterating through WebElements

July 12, 2022 by Tarik

Let us see what’s happening in your code : Without any visibility to the concerned HTML it seems the following line returns two WebElements in to the List find_href which are inturn are appended to the all_trails List : find_href = browser.find_elements_by_xpath(‘//div[@class=”text truncate trail-name”]/a[1]’) Hence when we print the List all_trails both the WebElements are … Read more

How to import a table from web page (with “div class”) to excel?

July 12, 2022 by Tarik

You don’t need a browser to be opened. You can do this with XHR. The url I am using can be found in the network tab via F12 (Dev tools) If you search that tab after making your request you will find that url and the response has a layout such as: image link: https://i.stack.imgur.com/C8oLj.png … Read more

Web scraping program cannot find element which I can see in the browser

July 12, 2022 by Tarik

The element you’re interested in is dynamically generated, after the initial page load, which means that your browser executed JavaScript, made other network requests, etc. in order to build the page. Requests is just an HTTP library, and as such will not do those things. You could use a tool like Selenium, or perhaps even … Read more