Fetch a Wikipedia article with Python

You need to use the urllib2 that superseedes urllib in the python std library in order to change the user agent. Straight from the examples import urllib2 opener = urllib2.build_opener() opener.addheaders = [(‘User-agent’, ‘Mozilla/5.0’)] infile = opener.open(‘http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes’) page = infile.read()

How to get the Infobox data from Wikipedia?

Use the Mediawiki API through this Python library: https://github.com/siznax/wptools Usage: import wptools so = wptools.page(‘Stack Overflow’).get_parse() infobox = so.data[‘infobox’] print(infobox) Output: {‘alexa’: ‘{{Increase}} 34 ( {{as of|2019|12|15|lc|=|y}} )’, ‘author’: ‘[[Jeff Atwood]] and [[Joel Spolsky]]’, ‘caption’: ‘Screenshot of Stack Overflow in February 2017’, ‘commercial’: ‘Yes’, ‘content_license’: ‘[[Creative Commons license|CC-BY-SA]] 4.0’, ‘current_status’: ‘Online’, ‘language’: ‘English, Spanish, Russian, … Read more

Extract the first paragraph from a Wikipedia article (Python)

I wrote a Python library that aims to make this very easy. Check it out at Github. To install it, run $ pip install wikipedia Then to get the first paragraph of an article, just use the wikipedia.summary function. >>> import wikipedia >>> print wikipedia.summary(“Albert Einstein”, sentences=2) prints Albert Einstein (/ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪnʃtaɪn] … Read more

Is there a Wikipedia API just for retrieve the content summary?

There’s a way to get the entire “introduction section” without any HTML parsing! Similar to AnthonyS’s answer with an additional explaintext parameter, you can get the introduction section text in plain text. Query Getting Stack Overflow’s introduction in plain text: Using the page title: https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles=Stack%20Overflow Or use pageids: https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&pageids=21721040 JSON Response (warnings stripped) { “query”: … Read more

How to extract information from a Wikipedia infobox?

The wrong way: trying to parse HTML Use (cURL/jQuery/file_get_contents/requests/wget/more jQuery) to fetch the HTML article code of the article, then use a DOM parser to extract table.infobox tr[3] td / use a regex. This is actually a really bad idea most of the time. Wikipedia’s HTML code is not particularly parsing-friendly (especially infoboxes which are … Read more

tech