student student - 1 month ago 7
Python Question

String indices must be integers, not str exception while working with several files?

I am extracting text from a directory full of pdfs. For this task I am using python's textract module:

In:

for filename in glob.glob(os.path.join(input_directory, '*.pdf')):
parsed = process(filename ,method='tesseract', language = 'spa')


Out:

---> 31 get_ipython().magic(u'time transform_files(input_d, out_d)')

/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in magic(self, arg_s)
2156 magic_name, _, magic_arg_s = arg_s.partition(' ')
2157 magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2158 return self.run_line_magic(magic_name, magic_arg_s)
2159
2160 #-------------------------------------------------------------------------

/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in run_line_magic(self, magic_name, line)
2077 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
2078 with self.builtin_trap:
-> 2079 result = fn(*args,**kwargs)
2080 return result
2081

<decorator-gen-59> in time(self, line, cell, local_ns)

/usr/local/lib/python2.7/site-packages/IPython/core/magic.pyc in <lambda>(f, *a, **k)
186 # but it's overkill for just that one bit of state.
187 def magic_deco(arg):
--> 188 call = lambda f, *a, **k: f(*a, **k)
189
190 if callable(arg):

/usr/local/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)
1174 if mode=='eval':
1175 st = clock2()
-> 1176 out = eval(code, glob, local_ns)
1177 end = clock2()
1178 else:

<timed eval> in <module>()

<ipython-input-11-ddedab540f65> in transform_files(input_directory, output_directory)
12
13 filename = os.path.basename(filename)
---> 14 texts = parsed['content']
15 all_texts[filename] = texts
16

TypeError: string indices must be integers, not str


I do not know why this is happening since, as the documentation states,
filename
must be a path, and actually it is just a path. I also tried to make a test with a single file as follows:

path = '/pathTo/PDF_FILE.pdf/'
text_ocr = textract.process(path, method='tesseract', language = 'spa')


And everything goes well. So my question is, why I am getting this:
TypeError: string indices must be integers, not str
and how to apply
process
to
filename
correctly?.

UPDATE

I also tried to place the content into a dict:

parsed = process(filename ,method='tesseract', language = 'spa', encoding='utf8')
parsed = {"content": parsed}
filename = os.path.basename(filename)

Answer

You seem to be chasing a lot of red herrings about the types of variables (e.g. filename) that are not at all related to your exception. At the bottom of your traceback, Python tells you exactly where the exception is happening:

<ipython-input-11-ddedab540f65> in transform_files(input_directory, output_directory)
     12 
     13         filename = os.path.basename(filename)
---> 14         texts = parsed['content']
     15         all_texts[filename] = texts
     16 

TypeError: string indices must be integers, not str

From the exception message, we can infer that parsed is a string, not a dictionary that has a 'content' key. Looking at the earlier lines in your code, the parsed variable comes from the call to process. The documentation for textract that you linked to doesn't give me any reason to expect process to return anything other than a string. Her's the basic example they give, right up at the top of their page:

import textract
text = textract.process('path/to/file.extension')

The variable name text sure suggests that you get a string back!

So I think you just need to rewrite your loop:

for filename in glob.glob(os.path.join(input_directory, '*.pdf')):    
    texts = process(filename, method='tesseract', language='spa')
    filename = os.path.basename(filename)
    all_texts[filename] = texts
Comments