Python Question

Docker NLTK Download

I am building a docker container using the following Dockerfile:

FROM ubuntu:14.04

RUN apt-get update

RUN apt-get install -y python python-dev python-pip

ADD . /app

RUN apt-get install -y python-scipy

RUN pip install -r /arrc/requirements.txt



CMD python

Everything goes well until I run the image and get the following error:

Resource u'tokenizers/punkt/english.pickle' not found. Please
use the NLTK Downloader to obtain the resource: >>>
Searched in:
- '/root/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- u''

I have had this problem before and it is discussed here however I am not sure how to approach it using Docker. I have tried:

CMD python
CMD import nltk

as well as:

CMD python -m nltk.downloader -d /usr/share/nltk_data popular

But am still getting the error.

Answer Source

In your Dockerfile, try adding instead:

RUN python -m nltk.downloader punkt

This will run the command and install the requested files to //nltk_data/

The problem is most likely related to using CMD vs. RUN in the Dockerfile. Documentation for CMD:

The main purpose of a CMD is to provide defaults for an executing container.

which is used during docker run <image>, not during build. So other CMD lines probably were overwritten by the last CMD python line.

