maurobio maurobio - 4 months ago 23
Python Question

Parse Wikipedia Wikitext Template Named Parameters to Extract Data from Taxobox

Using Python, I am attempting to extract data from the several "fields" of a Wikipedia Taxobox (an infobox which is usually displayed for each animal or plant species page, see for example here:

The solution provided here (How to use Wikipedia API to get section of sidebar?) is interesting but not useful in my case, since I am interested in data from a lower taxonomic category (species).

What I want is a way (as pythonic as possible) to access every field in a Taxobox and then get the data (as a dictionary, perhaps) of interest.

Thanks in advance for any assistance.

EDIT: Here ( is another good solution which should be what I need, but unfortunately it is a set of command line tools (besides dependent of other command line tools available only on Linux) and not a Python library.

EDIT2: wptools is a (python 2,3) library now.


This is a significant rewrite that includes a (more) proper parser to match the template's closing double braces '}}'. Also makes it easier to request different template names and includes a main() to allow testing from the shell / command line.

import sys
import re
import requests
import json

wikiApiRoot = ''

# returns the position past the requested token or end of string if not found
def FindToken(text, token, start=0):
    pos = text.find(token, start)
    if -1 == pos:
        nextTokenPos = len(text)
        nextTokenPos = pos
    return nextTokenPos + len(token)

# Get the contents of the template as text
def GetTemplateText(wikitext, templateName):
    templateTag = '{{' + templateName

    startPos = FindToken(wikitext, templateTag)
    if (len(wikitext) <= startPos):
        # Template not found
        return None

    openCount = 1
    curPos = startPos
    nextOpenPos = FindToken(wikitext, '{{', curPos)
    nextClosePos = FindToken(wikitext, '}}', curPos)

    # scan for template's matching close braces
    while 0 < openCount:
        if nextOpenPos < nextClosePos:
            openCount += 1
            curPos = nextOpenPos
            nextOpenPos = FindToken(wikitext, '{{', curPos)
            openCount -= 1
            curPos = nextClosePos
            nextClosePos = FindToken(wikitext, '}}', curPos)

    templateText = wikitext[startPos:curPos-2]
    return templateText

def GetTemplateDict(title, templateName='Taxobox'):
    templateDict = None

    # Get data from Wikipedia:

    resp = requests.get(wikiApiRoot + '?action=query&prop=revisions&' +
        'rvprop=content&rvsection=0&format=json&redirects&titles=' +

    # Get the response text into a JSON object:

    rjson = json.loads(resp.text)

    # Pull out the text for the revision:

    wikitext = rjson['query']['pages'].values()[0]['revisions'][0]['*']

    # Parse the text for the template

    templateText = GetTemplateText(wikitext, templateName)

    if templateText:

        # Parse templateText to get named properties

        templateItemIter = re.finditer(
        templateList = [item.groups([0,1]) for item in templateItemIter]
        templateDict = dict(templateList)

    return templateDict

def main():
    import argparse
    import pprint

    parser = argparse.ArgumentParser()
    parser.add_argument('title', nargs='?', default='Okapia_johnstoni', help='title of the desired article')
    parser.add_argument('template', nargs='?', default='Taxobox', help='name of the desired template')
    args = parser.parse_args()

    templateDict = GetTemplateDict(args.title, args.template)

if __name__ == "__main__":

GetTemplateDict returns a dictionary of the page's taxobox entries. For the Okapi page, this includes:

  • binomial
  • binomial_authority
  • classis
  • familia
  • genus
  • genus_authority
  • image
  • image_caption
  • ordo
  • phylum
  • regnum
  • species
  • status
  • status_ref
  • status_system
  • trend

I expect the actual items to vary by page.

The dictionary values are Wikipedia's decorated text:

>>> taxoDict['familia'] 

So additional parsing or filtering may be desired or required.