Konstantin Rusanov Konstantin Rusanov - 3 months ago 15
HTML Question

Python BeautifulSoup replace img src

I'm trying to parse HTML content from site, change a href and img src. A href changed successful, but img src don't.

It changed in variable but not in HTML (post_content):

<p><img alt="alt text" src="https://lifehacker.ru/wp-content/uploads/2016/08/15120903sa_d2__1471520915-630x523.jpg" title="Title"/></p>


Not _http://site.ru...

<p><img alt="alt text" src="http://site.ru/wp-content/uploads/2016/08/15120903sa_d2__1471520915-630x523.jpg" title="Title"/></p>


My code

if "app-store" not in url:
r = requests.get("https://lifehacker.ru/2016/08/23/kak-vybrat-trimmer/")
soup = BeautifulSoup(r.content)

post_content = soup.find("div", {"class", "post-content"})
for tag in post_content():
for attribute in ["class", "id", "style", "height", "width", "sizes"]:
del tag[attribute]

for a in post_content.find_all('a'):
a['href'] = a['href'].replace("https://lifehacker.ru", "http://site.ru")

for img in post_content.find_all('img'):
img_urls = img['src']
if "https:" not in img_urls:
img_urls="http:{}".format(img_urls)
thumb_url = img_urls.split('/')
urllib.urlretrieve(img_urls, "/Users/kr/PycharmProjects/education_py/{}/{}".format(folder_name, thumb_url[-1]))

file_url = "/Users/kr/PycharmProjects/education_py/{}/{}".format(folder_name, thumb_url[-1])
data = {
'name': '{}'.format(thumb_url[-1]),
'type': 'image/jpeg',
}

with open(file_url, 'rb') as img:
data['bits'] = xmlrpc_client.Binary(img.read())


response = client.call(media.UploadFile(data))

attachment_url = response['url']


img_urls = img_urls.replace(img_urls, attachment_url)



[s.extract() for s in post_content('script')]
post_content_insert = bleach.clean(post_content)
post_content_insert = post_content_insert.replace('&lt;', '<')
post_content_insert = post_content_insert.replace('&gt;', '>')

print post_content_insert

Answer

Looks like you're never assigning img_urls back to img['src']. Try doing that at the end of the block.

img_urls = img_urls.replace(img_urls, attachment_url)
img['src'] = img_urls

... But first, you need to change your with statement so it uses some name other than img for your file object. Right now you're overshadowing the dom element and you can no longer access it.

        with open(file_url, 'rb') as some_file:
            data['bits'] = xmlrpc_client.Binary(some_file.read())