flybonzai flybonzai - 7 months ago 17
Python Question

Is it better to use globals or pass in arguments to map functions?

I'm using

pyspark
to do some processing of server logs, and I'm quite new to functional programming concepts. I have a lookup table that I'm using in my function to select from a number of options like so:

user_agent_vals = {
'CanvasAPI': 'api',
'candroid': 'mobile_app_android',
'iCanvas': 'mobile_app_ios',
'CanvasKit': 'mobile_app_ios',
'Windows NT': 'desktop',
'MacBook': 'desktop',
'iPhone': 'mobile',
'iPod Touch': 'mobile',
'iPad': 'mobile',
'iOS': 'mobile',
'CrOS': 'desktop',
'Android': 'mobile',
'Linux': 'desktop',
'Mac OS': 'desktop',
'Macintosh': 'desktop'
}

def parse_requests(line):
"""
Expects an input list, which is then mapped to the correct fieldnames in
a dict.

:param line: A list of values.
:return: A list containing the values for writing to a file.
"""
values = dict(zip(requests_fieldnames, line))
print(values)
values['request_timestamp'] = values['request_timestamp'].split('-')[1]
found = False
for key, value in user_agent_vals.items():
if key in values['user_agent']:
found = True
values['user_agent'] = value
if not found:
values['user_agent'] = 'other_unknown'
return [
values['user_id'],
values['context_id'],
values['request_timestamp'],
values['user_agent']
]


I don't want to re-define the dictionary every time I call the function (which will be millions of times), but it seems somehow 'dirty' to just use Python's LEGB lookup to let it find the dictionary in the module namespace. Should I pass in an argument (and if so, how?) to the map function that calls
parse_requests
, or what would be the best practice way to handle this?

For reference, here is my map call:

parsed_data = course_data.map(parse_requests)

Answer

It is a convention to use all upper case for such global "constants":

USER_AGENT_VALS

For example, the default settings of pylint only allow all upper case names for variables (other than functions and classes) on the module level.

Alternately, you can supply user_agent_vals as second argument:

def parse_requests(line, user_agent_vals):

Call with:

parse_requests(line, user_agent_vals)

You can "freeze" an argument to a function with functools.partial():

from functools import partial

parse_requests_for_map = partial(parse_requests, user_agent_vals=user_agent_vals)

Now, you can use it with map:

parsed_data = course_data.map(parse_requests_for_map)