I am working on a Scrapy spider, trying to extract the text from multiple PDF files in a directory, using slate. I have no interest in saving the actual PDF to disk, and so I've been advised to look into the io.bytesIO subclass at https://docs.python.org/2/library/io.html#buffered-streams.
However I'm not sure how to pass the PDF body to the bytesIO class and then pass the virtual PDF slate to get the text. So far I have:
name = "ove"
allowed_domains = ['myurl.com']
start_urls = ['myurl/hgh/']
def parse(self, response):
for a in response.xpath('//a[@href]/@href'):
link = a.extract()
link = urlparse.urljoin(base_url, link)
yield Request(link, callback=self.save_pdf)
def save_pdf(self, response):
in_memory_pdf = BytesIO()
in_memory_pdf.read(response.body) # Trying to read in PDF which is in response body
TypeError: integer argument expected, got 'str'
When you do
in_memory_pdf.read(response.body) you are supposed to pass the number of bytes to read. You want to initialize the buffer, not read into it.
In python 2, just initialize
in_memory_pdf = BytesIO(response.body)
In Python 3, you cannot use
BytesIO with a string because it expects bytes. The error message shows that
response.body is of type
str: we have to encode it.
in_memory_pdf = BytesIO(bytes(response.body,'ascii'))
But as a pdf can be binary data, I suppose that
response.body would be
str. In that case, the simple
in_memory_pdf = BytesIO(response.body) works.