Samrat Hasan Samrat Hasan - 20 days ago 12
Java Question

Extract noun mentions from Wikipedia articles

I want to extract the noun mentions (anchor text and the corresponding linked articles) which are actually the in-links from Wikipedia articles. For example, in the article -> https://en.wikipedia.org/wiki/Stack_Overflow,

Stack Exchange Network
is an in-link and hence i am considering a mention and i want to extract its corresponding anchor text and the corresponding linked Wiki article
en.wikipedia.org/wiki/Stack_Exchange
to construct a dataset for my experiment.

I can extract the Wiki in-links from Wikipedia articles and can check whether its a noun or noun phrase by using Stanford CoreNLP library. That is not a problem for me. But i want to extract the anchor text in addition which is not directly available in the Wikipedia dump. How can i extract it efficiently? Since the Wikipedia dump is large, is there any efficient way to make it happen? Any suggestion will be appreciated.

Answer

You can use the wikixmlj to parse Wikipedia articles and extract the anchor text related to in-links. You can write a multi-threaded program to do this since you have to process millions of wikipedia articles and also need to run the StanfordCoreNLP library to check for noun phrases! You need to override the following method to parse Wikipedia page articles.

@Override
public void process(WikiPage page) {
    String title;
    try {
        title = page.getTitle().trim();
        if (page.getRedirectPage() == null) {
            /* write your code here */
        }
    }
}

You can see a complete example from this GitHub repository to get a comprehansive idea of the extraction process. This program also extracts the common noun mentions from Wikipedia articles. I guess this is what you are looking for!