Winklevoss333 Winklevoss333 - 3 months ago 9
Python Question

Clean unicode before adding to Dictionary

When parsing a page, I am pulling:

'label_value': [u'\n\t\t\t\t\t\t\t\t\t\tabc123\n\t\t\t\t\t\t\t\t\t']}


My goal is to just pull the relevant "abc123" from that xpath when it writes to the CSV. Currently, due to the "\n\t" in the string, it isn't writing anything. Looking around, I found several methods how to accomplish this, but I have been unable to properly place it within my own code and have it execute properly.

I've been playing with regex and .translate() to remove the instances of \n\t and clean up the code to cleanly add it to a csv. I didn't have much success with regex since these are pullings as lists, so I ceded to using .translate().

Below, I added my code for defining the xpaths and the actual page parsing. There is a step between that kicks off the spider and parses an initial page, but I didn't find that relevant to this question so omitted it from the code.

Of the sections below, where would I want to add this code? Would it be when I define the label_value's xpath, in the initial spider, or when I'm actually extracting it to my ResultsDict?

label_value = './/*[@class="lorem-ipsum"]


instead use...

label_value = './/*[@class="lorem-ipsum"].translate(None, '\t\n ')


or...

def parsepage(self, response)
time.sleep(2)
self.driver.get(response.url)
selectable_page = Selector(text=self.driver.page_source)
ResultsDict = scraperpageitems()
ResultsDict['label_value'] = selectable_page.xpath(label_value).extract()


instead use...

ResultsDict['label_value'] = selectable_page.xpath(label_value).extract().translate(None, '\t\n ')

Jan Jan
Answer

Aren't you simply looking for strip() ?
Consider this example (see it working on ideone.com)

label_value = '''


                                abc123


'''
print(label_value)
print(label_value.strip())


For the records, this did the trick:

[x.strip() for x in selectable_page.xpath(label_value).extract()]
Comments