Bread Bread - 2 months ago 10
Python Question

Web scrape Weibo Follower count using python

Hi I am a beginner to python, and I am trying to get the number of followers for some Weibo accounts. I've tried using the Weibo API, but I could not get the information of the Weibo accounts (not my account/ dont have the credentials). From what I have looked up, Weibo requires users to submit the application for review in order to get access to more API (including obtaining follower count)

Hence, I decided to try to use web scraping instead of using Weibo API. However, I have not much idea of doing so. I know I could use libraries like json and requests to get the content from the website. I am stucked with obtaining the content



from json import loads
import requests
username_weibo = ['kupono','xxx','etc']

def get_weibo_followers(username):
output = ['Followers']
for user in username:
r = requests.get('https://www.weibo.com/'+user).content
html = r.encode('utf-8')

return r





I tried to print out what it looks like for the code up till now, and what I've gotten is a messy bunch of words/characters. In addition, there are too many FM.views (from page source) which confuses me.

here is what I have done so far, but I have no idea how to continue. Could anyone help out? Thank you.

Answer

Hi I am a beginner to python and English :).I was doing the same thing and got it done yesterday. The pages of Weibo you see is created by script in your browser. You can extract everysing from script like "FM.view( ...." by library re.

After login, you can do this:

import re
from urllib import parse
reponse = session.get('http://weibo.com/u/xxxxxxxxx')
#xxxxxxx is the account's ID.    
html_raw_data = parse.unquote(reponse.content.decode())
#url decode
html_data = re.sub(r'\\'r'',html_raw_data)
#backslash has Escaped two times,get the raw code
follows_fans_articles_data = re.search(r'\[\'page_id\'\]=\'(\d+)',html_data,re.M)
#follows_fans_articles_data.group(1)  follows number    (2)  fans number  (3) articles number
Comments