tumbleweed tumbleweed - 27 days ago 10
Python Question

How to reindex malformed columns retrived from pandas read_html?

I am retrieving some content from a website which has several tables with the same number of columns, with pandas

. When I read a single link that actually has several tables with the same number of columns, pandas effectively read all the tables as one (something like a flat/normalized table). However, I am interested in do the same for a list of links from a website (i.e. a single flat table for several links), so I tried the following:

In:

import multiprocessing
def process(url):
df_url = pd.read_html(url)
df = pd.concat(df_url, ignore_index=False)
return df_url

links = ['link1.com','link2.com','link3.com',...,'linkN.com']

pool = multiprocessing.Pool(processes=6)
df = pool.map(process, links)
df


Nevertheless, I guess I am not specifiying corecctly to
read_html()
which are the columns, so I am getting this malformed list of lists:

Out:

[[ Form Disponibility \
0 290090 01780-500-01) Unavailable - no product available for release.

Relation \
0 API manufacturer discontinued the material.

drawbacks
0 Shortage of an active ingredient. ],
[ Form \
0 5 mEq/mL; 20 mL vial (RMB 0409-6043-01)

Disponibility \
0 Next delivery and estimated recovery TBD

Relation drawbacks
0 : Manufacturing delay Removed ],
[ Form \
0 AGRYLIN® (anagrelide hydrochloride) Dosage For...

Disponibility Relation \
0 Available NaN

drawbacks
0 NaN ,
Form Disponibility \
0 1mg 100 (RMB 00172-5240-60) Product is currently available.
1 0.5 mg 10 (RMB 00172-5241-60) Product is currently available.

Relation drawbacks
0 NaN Removed
1 NaN Removed ],
[ Form \
0 ERWINAZE 10,000 IU lyophilized powder supplied...
1 ERWINAZE 10,000 IU lyophilized powder supplied...

Disponibility \
0 for new rs may be lim...
1 Product for new orders may be lim...

Relation \
0 Please see Dear Healthcare Provider Letter, wh...
1 Please see Dear Healthcare Provider Letter, wh...

drawbacks
0 Removed
1 Removed ],
[ Form \
0 0.4 mg/mL, 1 mL single-dose vial, package of 2...
1 1 mg/mL, 1 mL single-dose vial, package of 25 ...

Disponibility \
0 Available for RMB 00517-0401-25.
1 Available

Relation \
0 American Regent is currently releasing the 0.4...
1 American Regent is currently releasing the 1mg...

drawbacks
0 Demand increase for the drug
1 Removed ,
Form \
0 0.1 mg/mL; 10 mL Luer-Jet Prefilled Syringe (N...

Disponibility Relation \
0 Product available NaN

drawbacks
0 Demand increase for the drug ,
Form \
0 0.1 mg/mL; 10 mL Ansyr syringe (RMB 0409-1630-10)
1 0.05 mg/mL; 5 mL Ansyr syringe (RMB 0409-9630-05)
2 0.1 mg/mL; 5 mL Lifeshield syringe (RMB 0409-4...
3 0.1 mg/mL; 10 mL Lifeshield syringe (RMB 0409-...

Disponibility \
0 Next delivery: Late October. Estimated recover...
1 Next delivery: TBD Estimated recovery: TBD
2 Available
3 Available

Relation \
0 Please check with your wholesaler for availabl...
1 Please check with your wholesaler for availabl...
2 Shortage per Manufacturer: Available
3 Shortage per Manufacturer: Available

drawbacks
0 Removed
1 Removed
2 Removed
3 Removed ]]


So my question which parameter should I move in order to get a flat pandas dataframe from the above nested list?. I tried to
header=0
,
index_col=0
,
match='"columns"'
, none of them worked or do I need to do the flatting when I create the pandas dataframe with
pd.Dataframe()
?. My main objective is to have a pandas dataframe like with this columns:

form, Disponibility, Relation, drawbacks
1
2
...
n

Answer

IIUC you can do it this way:

first you want to return concatenated DF, instead of list of DFs (as read_html returns a list of DFs):

def process(url):
    return pd.concat(pd.read_html(url), ignore_index=False) 

and then concatenate them for all URLs:

df = pd.concat(pool.map(process, links), ignore_index=True)
Comments