Ke Tian Ke Tian - 6 months ago 33
HTML Question

Python to parse html data and store into the postgresql database

a
This is trouble me for two days, I am new one to python, I want to Parse the html data as the following link:http://movie.walkerplus.com/list/2015/12/

and then store the data into the postgresql database named movie_db, and there is table named films which is created by the following command:

CREATE TABLE films (
title varchar(128) NOT NULL,
description varchar(256) NOT NULL,
directors varchar(128)[],
roles varchar(128)[]
);


Some part of html code is shown as following
.....

</div>, <div class="movie">
<h3><a href="/mv59421/">REVIVAL これが日本の総合格闘技だ</a></h3>
<div class="movieInner">
<div class="info">
<div class="mitai">
<a href="javascript:void(0);" onclick="add_content('59421');"><img
alt="見たい映画に登録" height="19"
src="/mypage/images/btn_outregisted_off.gif" style="float:left;"
width="130">
<p>1人</p>
</img></a></div>
<div class="displaynone" id="add_hidden_59421"> =&gt;
<a href="/mypage/view_mypage.cgi?
ctl=registed&amp;content_id=59421">
<span id="add_result_59421"></span></a>
</div>
<p class="clearboth">総合格闘技団体「DEEP」と「パンクラス」の対抗戦を追ったドキュ
メンタリー</p>
<dl class="directorList">
<dt>監督</dt>
<dd>
<a href="/person/306686/" title="">佐伯繁</a>
</dd>
</dl>
<dl class="roleList">
<dt>出演キャスト</dt>
<dd>
<a href="/person/306691/" title="">大塚隆史</a>
<a href="/person/306692/" title="">北田俊亮</a>
<a href="/person/306693/" title="">堀内佑馬</a>
</dd>
</dl>
</div>
</div>
.......
.


I mainly extracted four data:

1)the movie tile is included in this html:

<h3><a href="/mv59421/">REVIVAL これが日本の総合格闘技だ</a></h3>

2) the description about this movie(you should notice,some movies have
this tag, but have not text))

<p class="clearboth">総合格闘技団体「DEEP」と「パンクラス」の対抗戦を追ったド
キュ メンタリー</p>



3) the director (you should notice,some movies have not this tag)

<a href="/person/306686/" title="">佐伯繁</a>

3) the roles ((you should notice,some movies have not this tag, and movie has not just one actor))

<a href="/person/306691/" title="">大塚隆史</a>
<a href="/person/306692/" title="">北田俊亮</a>
<a href="/person/306693/" title="">堀内佑馬</a>


My code is displayed as following, I have parsed data, but I don't know how to insert the data into the database table films. I know simply insert the data: insert table name values (....). When encounter this issue(list or dict data), I don't know how to insert or store the data. Maybe the code is difficult to understand, I will edit again, thanks!

from bs4 import BeautifulSoup
from urllib.request import url open
import pandas as pd
import psycopg2
from sqlalchemy import create_engine


html = urlopen("http://movie.walkerplus.com/list/2015/12/")
bsObj_movie = BeautifulSoup(html, "html.parser")
movies = bsObj_movie.findAll("div", {"class": "movie"})


conn = psycopg2.connect(database="movie_db", user="postgres",
password="1234tk", host="127.0.0.1", port="5432")
print("Opened database successfully")
cur = conn.cursor()
t = []
des = []
dirt = []
r = []

def insert_data():
cur.execute("INSERT INTO movie VALUES('dq','dd',{'x','d','qw'})")
conn.commit()
print("insert data successfully!")
for movie in movies:
movie_title = movie.find({"h3"})
try:
movie_description = movie.findAll("p", {"class": "clearboth"})
except AttributeError as e:
movie_description = None
try:
movie_director = movie.find("dl", {"class":
"directorList"}).find("a")
except AttributeError as e:
movie_director = None
try:
movie_role = movie.find("dl", {"class": "roleList"})
except AttributeError as e:
movie_role = None

for title in movie_title:
t.append(title.get_text())

if movie_description is None:
print("None1")

else:

for description in movie_description:
if description.get_text() is '':
des.append(None)
# print("Des: ", None)
else:
des.append(description.get_text())
# print("desc:", description.get_text())
if movie_director is None:
dirt.append(None)
# print("director: ", None)
else:
for director in movie_director:
dirt.append(director)
# print("director: ", director)
if movie_role is None:
r.append(None)
# print("roles: ", None)
else:
r1=[]
for role in movie_role.findAll('a'):
r1.append(role.get_text())
r.append(r1)
sql = "INSERT INTO films (title, description, directors, roles) VALUES
(%s, %s, %s, %s);"
for obj in zip(t, des, dirt, r):
cur.execute(cur.mogrify(sql, obj))
conn.commit()


There is error:

psycopg2.DataError: malformed array literal: "サム・メンデス"

LINE 1: ...ームズ・ボンドの戦いを描く『007』シリーズ第24作', 'サム・メ...
^
DETAIL: Array value must start with "{" or dimension information.

Answer

I know this error. It means you are trying to insert string values into array columns. You can verify the SQL as below.

sql2 = cur.mogrify(SQL, obj)
print sql2

Your directors and roles fetched from html are list of strings. So after zip function the obj contains dir and roles as strings.

For your case you are trying to insert only 1 row. So there is probably no need to zip.

I am not familiar with this API you used, but can you try to print the values received from html before inserting? I can provide you the exact SQL required.