Souvik Saha Souvik Saha - 5 months ago 14
JSON Question

Converting the output of MediaWiki to plain text

Using the MediaWiki API, this gives me an output like so, for search term Tiger

https://simple.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Tiger&format=json&exintro=1


Response:

{"batchcomplete":"","query":{"pages":{"9796":{"pageid":9796,"ns":0,"title":"Tiger","extract":"<p>The <b>tiger</b> (<i>Panthera tigris</i>) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.</p>\n<p></p>"}}}}


How do I get an output as


The tiger (Panthera tigris) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.


Please can someone also tell me how to store everything in a text file? I'm a beginner here so please be nice. I need this for a project I'm doing in Bash, on a Raspberry Pi 2, with Raspbian

Answer

It's usually recommended to use JSON parser for handling JSON, one that I like is jq

% jq -r '.query.pages[].extract' file
<p>The <b>tiger</b> (<i>Panthera tigris</i>) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.</p>
<p></p>

To remove the HTML tags you can do something like:

... | sed 's/<[^>]*>//g'

Which will remove HTML tags that are not on continues lines:

% jq -r '.query.pages[].extract' file | sed 's/<[^>]*>//g'
The tiger (Panthera tigris) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.

file is the file the JSON is stored in, eg:

curl -so - 'https://simple.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Tiger&format=json&exintro=1' > file
jq '...' file

or

jq '...' <(curl -so - 'https://simple.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Tiger&format=json&exintro=1')

You can install jq with:

sudo apt-get install jq

For your example input you can also use grep with -P (PCRE). But using a proper JSON parser as above is recommended

grep -oP '(?<=extract":").*?(?=(?<!\\)")' file 
<p>The <b>tiger</b> (<i>Panthera tigris</i>) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.</p>\n<p></p>
Comments