wikipedia – Make Me Engineer

How to get plain text out of Wikipedia

June 13, 2023 by Tarik

Here are a few different possible approaches; use whichever works for you. All my code examples below use requests for HTTP requests to the API; you can install requests with pip install requests if you have Pip. They also all use the Mediawiki API, and two use the query endpoint; follow those links if you … Read more

Fetch a Wikipedia article with Python

May 12, 2023 by Tarik

You need to use the urllib2 that superseedes urllib in the python std library in order to change the user agent. Straight from the examples import urllib2 opener = urllib2.build_opener() opener.addheaders = [(‘User-agent’, ‘Mozilla/5.0’)] infile = opener.open(‘http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes’) page = infile.read()

How to get the Infobox data from Wikipedia?

May 3, 2023 by Tarik

Use the Mediawiki API through this Python library: https://github.com/siznax/wptools Usage: import wptools so = wptools.page(‘Stack Overflow’).get_parse() infobox = so.data[‘infobox’] print(infobox) Output: {‘alexa’: ‘{{Increase}} 34 ( {{as of|2019|12|15|lc|=|y}} )’, ‘author’: ‘[[Jeff Atwood]] and [[Joel Spolsky]]’, ‘caption’: ‘Screenshot of Stack Overflow in February 2017’, ‘commercial’: ‘Yes’, ‘content_license’: ‘[[Creative Commons license|CC-BY-SA]] 4.0’, ‘current_status’: ‘Online’, ‘language’: ‘English, Spanish, Russian, … Read more

Extract the first paragraph from a Wikipedia article (Python)

May 2, 2023 by Tarik

I wrote a Python library that aims to make this very easy. Check it out at Github. To install it, run $ pip install wikipedia Then to get the first paragraph of an article, just use the wikipedia.summary function. >>> import wikipedia >>> print wikipedia.summary(“Albert Einstein”, sentences=2) prints Albert Einstein (/ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪnʃtaɪn] … Read more

Is there a Wikipedia API just for retrieve the content summary?

July 23, 2022 by Tarik

There’s a way to get the entire “introduction section” without any HTML parsing! Similar to AnthonyS’s answer with an additional explaintext parameter, you can get the introduction section text in plain text. Query Getting Stack Overflow’s introduction in plain text: Using the page title: https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles=Stack%20Overflow Or use pageids: https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&pageids=21721040 JSON Response (warnings stripped) { “query”: … Read more

How to extract information from a Wikipedia infobox?

May 20, 2022 by Tarik

The wrong way: trying to parse HTML Use (cURL/jQuery/file_get_contents/requests/wget/more jQuery) to fetch the HTML article code of the article, then use a DOM parser to extract table.infobox tr[3] td / use a regex. This is actually a really bad idea most of the time. Wikipedia’s HTML code is not particularly parsing-friendly (especially infoboxes which are … Read more

Is there a Wikipedia API?

May 16, 2022 by Tarik

MediaWiki’s API is running on Wikipedia (docs). You can also use the Special:Export feature to dump data and parse it yourself. More information.