screen-scraping – Page 2 – Make Me Engineer

file_get_contents() give me 403 Forbidden

November 21, 2022 by Tarik

This is not a problem in your script, its a feature in you partners web server security. It’s hard to say exactly whats blocking you, most likely its some sort of block against scraping. If your partner has access to his web servers setup it might help pinpoint. What you could do is to “fake … Read more

How I can get web page’s content and save it into the string variable

November 8, 2022 by Tarik

You can use the WebClient Using System.Net; WebClient client = new WebClient(); string downloadString = client.DownloadString(“http://www.gooogle.com”);

web scraping dynamic content with python

October 11, 2022 by Tarik

Instead of trying to reverse engineer it, you can use ghost.py to directly interact with JavaScript on the page. If you run the following query in a chrome console, you’ll see it returns everything you want. document.getElementsByClassName(‘inline-text-org’); Returns [<div class=”inline-text-org” title=”University of Manchester”>University of Manchester</div>, <div class=”inline-text-org” title=”University of California Irvine”>University of California …</div> etc… … Read more

Simple Screen Scraping using jQuery

October 8, 2022 by Tarik

Use $.ajax to load the other page into a variable, then create a temporary element and use .html() to set the contents to the value returned. Loop through the element’s children of nodeType 1 and keep their first children’s nodeValues. If the external page is not on your web server you will need to proxy … Read more

Scraping ajax pages using python

October 1, 2022 by Tarik

First of all, scrapy docs are available at https://scrapy.readthedocs.org/en/latest/. Speaking about handling ajax while web scraping. Basically, the idea is rather simple: open browser developer tools, network tab go to the target site click submit button and see what XHR request is going to the server simulate this XHR request in your spider Also see: … Read more

Download image file from the HTML page source

September 4, 2022 by Tarik

Here is some code to download all the images from the supplied URL, and save them in the specified output folder. You can modify it to your own needs. “”” dumpimages.py Downloads all the images on the supplied URL, and saves them to the specified output file (“/test/” by default) Usage: python dumpimages.py http://example.com/ [output] … Read more

How to scroll down with Phantomjs to load dynamic content

July 25, 2022 by Tarik

Found a way to do it and tried to adapt to your situation. I didn’t test the best way of finding the bottom of the page because I had a different context, but check the solution below. The thing here is that you have to wait a little for the page to load and javascript … Read more

HTML Scraping in Php [duplicate]

July 20, 2022 by Tarik

I would recomend PHP Simple HTML DOM Parser after you have scraped the HTML from the page. It supports invalid HTML, and provides a very easy way to handle HTML elements.

What’s a good tool to screen-scrape with Javascript support? [closed]

July 20, 2022 by Tarik

You could use Selenium or Watir to drive a real browser. Ther are also some JavaScript-based headless browsers: PhantomJS is a headless Webkit browser. pjscrape is a scraping framework based on PhantomJS and jQuery. CasperJS is a navigation scripting & testing utility bsaed on PhantomJS, if you need to do a little more than point … Read more

How do I prevent site scraping? [closed]

July 8, 2022 by Tarik

Note: Since the complete version of this answer exceeds Stack Overflow’s length limit, you’ll need to head to GitHub to read the extended version, with more tips and details. In order to hinder scraping (also known as Webscraping, Screenscraping, Web data mining, Web harvesting, or Web data extraction), it helps to know how these scrapers … Read more