not_a_robot not_a_robot - 1 month ago 9
Python Question

Traversing/navigating downloaded nltk subpackages?

For a particular script I'm running, I need to have installed from

nltk
the following packages:

req_modules = ['punkt', 'stopwords', 'averaged_perceptron_tagger', 'maxent_ne_chunker']


I know I can check whether
stopwords
is downloaded, like this:

import nltk
import os

if 'stopwords' in os.listdir(nltk.data.find('corpora')):
print(True)
else:
print(False)


For me, since I've used
stopwords
before, this works. However, I want to be able to programmatically check if the other three modules are installed, eventually using something like:

if not all(m in os.listdir(nltk.data.find('models')) for m in ['punkt', 'averaged_perceptron_tagger', 'maxent_ne_chunker']:
# download the ones that aren't already downloaded


They are all labeled as modules in the downloader accessed at
nltk.download()
. This should be an easy lookup, so I tried something like this to get all downloaded subpackages in one list:

all_downloaded = os.listdir(nltk.data.find("corpora")) + os.listdir(nltk.data.find("models"))


But I get the
LookupError: Resource 'models' not found
. How can I search the
'models'
tab in
nltk.data
just like I can search
'corpora'
? I assume the naming conventions for finding these resources is the same, as "corpora" is the same name of the tab seen in the downloader below

enter image description here

Edit:

Taking into account the suggestion below, I tried the code below, but still get an
ImportError
, even though I have exception-handling. What is going on there?

req_modules = {'from nltk import punkt': 'punkt', 'from nltk.corpus import stopwords': 'stopwords',
'from nltk import pos_tag': 'averaged_perceptron_tagger',
'from nltk import ne_chunk': 'maxent_ne_chunker',
'from nltk.stem.porter import PorterStemmer': 'porter_test'}

for m in req_modules:
try:
print("Trying: %s" % m)
exec(m)
except LookupError or ImportError:
print("Tried: %s. Resource '%s' was not available and is being downloaded.\n" % (m, req_modules[m]))
nltk.download(req_modules[m])


Edit 2:

I got it to work, nevermind. I changed
from nltk import porter_test
to
from nltk.stem.porter import PorterStemmer
and things work smoothly!

Answer

Looks like you are confusing nltk modules with the files in the nltk_data directory, which the modules use. When you install the nltk, you get all the packages. Various modules and functions require data files which you fetch into nltk_data with the downloader. (Some of them are in the category "Models", which maybe you confuse with "modules"?) To figure out which data file to check for, you could run the corresponding function without an nltk_data folder and inspect the error message. For example:

>>> nltk.ne_chunk("anything")
Traceback (most recent call last):
...
raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource
  'chunkers/maxent_ne_chunker/PY3/english_ace_multiclass.pickle'   
  not found.  Please use the NLTK Downloader to obtain the 
  ...

But if it were me, I would not mess with the data files directly. Instead, just try out the service you want and see if it raises an error:

 try:
     nltk.ne_chunk([])
 except LookupError:
     nltk.download("maxent_ne_chunker")
Comments