Vincent Hahn Vincent Hahn - 1 year ago 195
HTML Question

Scrape Twitter Embedded URL via Python

I currently try to extract URLs embedded in a Call-To-Action button within videos on Twitter. An example:

Twitter Video

When utilising Chrome Inspect, I can relatively easily spot what I'm after:

enter image description here

Now I'm trying to scrape that highlighted link in Python.
I couldn't find any way to get it from the Twitter API, therefore I switched to BeautifulSoup. But when searching for any link it doesn't show it to me:

In[23]: url = ""
In[24]: resp = requests.get(url).content
In[25]: soup = BeautifulSoup(resp, 'lxml')
In[26]: soup.find_all('a')
[<a href="" target="_blank">@unibet</a>,
<a class="download-btn" id="app-download"><img id="whiteLogo"

Any idea what I could do to extract that embedded URL? Any help is much appreciated!

Answer Source

The data is dynamically created via a ajax request, you can pull the url for the xml from the original pages meta tag with the name="twitter:amplify:vmap" then request that data which is xml like:

?xml version="1.0" encoding="utf-8"?>
<vmap:VMAP xmlns:esi="" xmlns:tw="" xmlns:vmap="" xmlns:xsi="" xsi:noNamespaceSchemaLocation="vast3.xsd">
<tw:content contentId="745543706946658305" ownerId="143820595" stitched="false">
<tw:cta_watch_now url=";affiliateId=52&amp;affId=5211000020&amp;adID=LINC_E2_T9&amp;unibetTarget=/luckisnocoincidence"/>
<tw:videoVariant content_type="application/x-mpegURL" url=";hmac=cb919c7cbe840ad38f8892f430695245991b19022d3359a68f724754171a5874"/>
<tw:videoVariant bit_rate="320000" content_type="video/mp4" url=";hmac=0dc8d5a53cba3228ad6b01d766bf0ad0b8c8504b9cba5db93dd62e379cdad9dc"/>
<tw:videoVariant content_type="application/dash+xml" url=";hmac=74a2b83bdc0020957b7d8603a66ae514425e25c05b546108d7667fe7345afbfb"/>
<tw:videoVariant bit_rate="2176000" content_type="video/mp4" url=";hmac=5207d3904cb34b9fc21a584e2f47247e6e0f9a97cacb0ae5721b5f1fd9167916"/>
<tw:videoVariant bit_rate="832000" content_type="video/mp4" url=";hmac=fd736bdd53b487f2a881b583cd2e39610365d82970a9a0ed6c695c5eb44476b2"/>
<!-- We only support linear start (preroll) for now -->
<vmap:AdBreak breakId="preroll3" breakType="linear" timeOffset="start">
<vmap:AdSource allowMultipleAds="false" followRedirects="false" id="0">

So we just need to pull the url from that:

from bs4 import BeautifulSoup
import requests

url = ""
resp = requests.get(url).content
soup = BeautifulSoup(resp, 'lxml')

xml = soup.select_one("meta[name=twitter:amplify:vmap]")["content"]
soup2 = BeautifulSoup(requests.get(xml).content,"xml")


That then gives us the link:
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download