user61629 user61629 - 1 year ago 227
Python Question

Creating bytesIO object

I am working on a Scrapy spider, trying to extract the text from multiple PDF files in a directory, using slate. I have no interest in saving the actual PDF to disk, and so I've been advised to look into the io.bytesIO subclass at

However I'm not sure how to pass the PDF body to the bytesIO class and then pass the virtual PDF slate to get the text. So far I have:

class Ove_Spider(BaseSpider):

name = "ove"

allowed_domains = ['']
start_urls = ['myurl/hgh/']

def parse(self, response):
for a in response.xpath('//a[@href]/@href'):
link = a.extract()
if link.endswith('.pdf'):
link = urlparse.urljoin(base_url, link)
yield Request(link, callback=self.save_pdf)

def save_pdf(self, response):

in_memory_pdf = BytesIO() # Trying to read in PDF which is in response body

I'm getting:
TypeError: integer argument expected, got 'str'

How can I get this working?

Answer Source

When you do you are supposed to pass the number of bytes to read. You want to initialize the buffer, not read into it.

In python 2, just initialize BytesIO as:

 in_memory_pdf = BytesIO(response.body)

In Python 3, you cannot use BytesIO with a string because it expects bytes. The error message shows that response.body is of type str: we have to encode it.

 in_memory_pdf = BytesIO(bytes(response.body,'ascii'))

But as a pdf can be binary data, I suppose that response.body would be bytes, not str. In that case, the simple in_memory_pdf = BytesIO(response.body) works.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download