alecxe alecxe - 4 years ago 256
Python Question

Prevent missing comma between list items bug

The Story:

When a list with string items is defined on multiple lines, it is often easy to forget a comma between list items, like in this example case:

test = [
"item1"
"item2"
]


The list
test
would now have a single item
item1item2
.

Quite often the problem appears after rearranging the items in a list.

Sample Stack Overflow posts having the issue:



The Question:

Is there a way to, using preferably static code analysis, issue a warning in cases like this to spot the problem as early as possible?

Answer Source

These are merely probable solutions since I'm not really apt with static-analysis.

With tokenize:

I recently fiddled around with tokenizing python code and I believe it has the information needed to perform these kind of checks when sufficient logic is added. For your given list, the tokens generated with python -m tokenize list1.py are as follows:

python -m tokenize list1.py 

1,0-1,4:    NAME    'test'
1,5-1,6:    OP  '='
1,7-1,8:    OP  '['
1,8-1,9:    NL  '\n'
2,1-2,8:    STRING  '"item1"'
2,8-2,9:    NL  '\n'
3,1-3,8:    STRING  '"item2"'
3,8-3,9:    NL  '\n'
4,0-4,1:    OP  ']'
4,1-4,2:    NEWLINE '\n'
5,0-5,0:    ENDMARKER   ''

This of course is the 'problematic' case where the contents are going to get concatenated. In the case where a , is present, the output slightly changes to reflect this (added only tokens for the list body):

1,7-1,8:    OP  '['
1,8-1,9:    NL  '\n'
2,1-2,8:    STRING  '"item1"'
2,8-2,9:    OP  ','
2,9-2,10:   NL  '\n'
3,1-3,8:    STRING  '"item2"'
3,8-3,9:    NL  '\n'
4,0-4,1:    OP  ']'

Now we have the additional OP ',' token signifying the presence of a second element.

Given this information, we could use the really handy function generate_tokens in the tokenize module. tokenize.generate_tokens() , tokenize.tokenize() in Py3, has a single argument readline, a function which essentially returns the next line for a file like object (relevant answer). This function returns a named tuple with 5 elements in total with information about the token type, the token string along with line number and position in the line.

Using this information, one could theoretically loop through a file and when a OP ',' is absent inside a list initialization (whose begining is detected by checking that the tokens NAME, OP '=' and OP '[' exist on the same line number) one can issue a warning on the lines on which it was detected.

The good thing about this approach is that it is pretty straight-forward to generalize it. To fit all cases where string literal concatenation takes place (namely, inside the 'grouping' operators (), {}, [] ) you simply check that the token is of type = 51 (or 53 for Python 3) or that a value in any of (, [, { exists on the same line (these are coarse, top of the head suggestions atm).

Now, I'm not really sure how other people go about with these sort of problems but it seems like it could be something you can look into. All the information necessary is offered by tokenize, the logic is to detect is the only thing missing.

Implementation Note: These values (for example, for type) do differ between versions and are subject to change so it is something one should be aware of. One could posibly leverage this by only working with constants for the tokens, though.


With parser and ast:

Another probable solution which would probably be more tedious could involve the parser and ast modules. The concatenation of strings is actually performed during the creation of the Abstract Syntax Tree so you could alternatively detect it over there.

I don't really want to dump the full ouput of the methods for parser and ast that I'm going to mention, but, just to make sure we're on the same page, I'm going to be using the following list initialization statement:

l_init = """
test = [
    "item1"
    "item2",
    "item3"
]
"""

In order to get the parse tree generated, use p = parser.suite(l_init). After this is done, you can get a view of it with p.tolist() (output is too large to add it). What you notice is that there will be three entries for the three different str objects item1, item2, item3.

On the other hand, when the AST is created with node = ast.parse(l_init) and viewed with ast.dump(node) there are only two entries: one for the concatenated strs item1item2 and one for the other entry item3.

So, this is another probable way to do it but it is way more tedious. I'm not sure if line information is available and you deal with two different modules. Just have it as a backthough if you maybe want to play around with internal objects.


Note: Of course, I might be completely over-analyzing it and a simpler 'check for white-space or newline' solution as you guys suggested would suffice. :-)

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download