chilliq chilliq - 5 days ago 6
Python Question

Extracting URL parameters into Pandas DataFrame

There is a list containing URL adresses with parameters:

http://example.com/?param1=apple&param2=tomato&param3=carrot
http://sample.com/?param1=banana&param3=potato&param4=berry
http://example.org/?param2=apple&param3=tomato&param4=carrot


Each URL may contain any of 4 parameters.

I want to extract URL parameters and add them into Pandas DataFrame. The DataFrame should have a URL column and 4 columns with parameters. If a parameter is not present in the URL, the cell is empty:

URL param1 param2 param3 param4
... apple tomato carrot
... banana potato berry
... apple tomato carrot


I was planning to use python built-in urlparse module, which allows to extract parameters easily:

import urlparse
url = 'http://example.com/?param1=apple&param2=tomato&param3=carrot'
par = urlparse.parse_qs(urlparse.urlparse(url).query)
print par['param1'], par['param2']

Out: ['apple'] ['tomato']


With urlparse I can get the list of parameters in URLs:

import pandas as pd

urls = ['http://example.com/?param1=apple&param2=tomato&param3=carrot',
'http://sample.com/?param1=banana&param3=potato&param4=berry',
'http://example.org/?param2=apple&param3=tomato&param4=carrot']

df = pd.DataFrame(urls, columns=['url'])
params = [urlparse.parse_qs(urlparse.urlparse(url).query) for url in urls]
print params

Out: [{'param1': ['apple'], 'param2': ['tomato'], 'param3': ['carrot']},
{'param1': ['banana'], 'param3': ['potato'], 'param4': ['berry']},
{'param2': ['apple'], 'param3': ['tomato'], 'param4': ['carrot']}]
...


I don't know how to add extracted parameters into the DataFrame. Maybe there is a better way of doing it? The original file is ~1m URLs.

Answer

You can use a dictionary comprehension to extract the data in the parameters per parameter. I'm not sure if you wanted the final values in list form. If not, it would be easy to extract it.

>>> pd.DataFrame({p: [d.get(p) for d in params] 
                  for p in ['param1', 'param2', 'param3', 'param4']})
     param1    param2    param3    param4
0   [apple]  [tomato]  [carrot]      None
1  [banana]      None  [potato]   [berry]
2      None   [apple]  [tomato]  [carrot]

or...

>>> pd.DataFrame({p: [d[p][0] if p in d else None for d in params] 
                  for p in ['param1', 'param2', 'param3', 'param4']})
   param1  param2  param3  param4
0   apple  tomato  carrot    None
1  banana    None  potato   berry
2    None   apple  tomato  carrot
Comments