Cosimo Curiale Cosimo Curiale - 2 months ago 12
Python Question

How to save multiple output in multiple file where each file has a different title coming from an object in python?

I'm scraping rss feed from a web site (http://www.gfrvitale.altervista.org/index.php/autismo-in?format=feed&type=rss).
I have wrote down a script to extract and purifie the text from every of the feed. My main problem is to save each text of each item in a different file, I also need to name each file with it's proper title exctractet from the item.
My code is:

for item in myFeed["items"]:
time_structure=item["published_parsed"]
dt = datetime.fromtimestamp(mktime(time_structure))

if dt>t:

link=item["link"]
response= requests.get(link)
doc=Document(response.text)
doc.summary(html_partial=False)

# extracting text
h = html2text.HTML2Text()

# converting
h.ignore_links = True #ignoro i link
h.skip_internal_links=True #ignoro i link esterni
h.inline_links=True
h.ignore_images=True #ignoro i link alle immagini
h.ignore_emphasis=True
h.ignore_anchors=True
h.ignore_tables=True

testo= h.handle(doc.summary()) #testo estratto

s = doc.title()+"."+" "+testo #contenuto da stampare nel file finale

tit=item["title"]

# save each file with it's proper title
with codecs.open("testo_%s", %tit "w", encoding="utf-8") as f:
f.write(s)
f.close()


The error is:

File "<ipython-input-57-cd683dec157f>", line 34 with codecs.open("testo_%s", %tit "w", encoding="utf-8") as f:
^
SyntaxError: invalid syntax

Answer

You need to put the comma after %tit

should be:

#save each file with it's proper title
with codecs.open("testo_%s" %tit, "w", encoding="utf-8") as f:
     f.write(s)
     f.close()

However, if your file name has invalid characters it will return an error (i.e [Errno 22])

You can try this code:

...
tit = item["title"]
tit = tit.replace(' ', '').replace("'", "").replace('?', '') # Not the best way, but it could help for now (will be better to create a list of stop characters)

with codecs.open("testo_%s" %tit, "w", encoding="utf-8") as f:
     f.write(s)
     f.close()

Other way using nltk:

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tit = item["title"]
tit = tokenizer.tokenize(tit)
tit = ''.join(tit)
with codecs.open("testo_%s" %tit, "w", encoding="utf-8") as f:
     f.write(s)
     f.close()
Comments