LouYu LouYu - 4 months ago 8
Python Question

Extracting comments from Python Source Code

I'm trying to write a program to extract comments in code that user enters. I tried to use regex, but found it difficult to write.

Then I found a post here. The answer suggests to use

to analyze the grammar, but the documentation says:

generator requires one argument,
, which
must be a callable object which provides the same interface as the
method of built-in file objects (see section File Objects).

But a string object does not have

Then I found another post here, suggesting to use
to get a
method. So I wrote the following code:

import tokenize
import io
import StringIO

def extract(code):
res = []
comment = None
stringio = StringIO.StringIO(code)
for toktype, tokval, begin, end, line in tokenize.generate_tokens(stringio):
# print(toknum,tokval)
if toktype != tokenize.COMMENT:
res.append((toktype, tokval))
print tokenize.untokenize(toktype)
return tokenize.untokenize(res)

And entered the following code:
extract('a = 1+2#A Comment')

But got:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "ext.py", line 10, in extract
for toktype, tokval, begin, end, line in tokenize.generate_tokens(stringio):
File "C:\Python27\lib\tokenize.py", line 294, in generate_tokens
line = readline()
AttributeError: StringIO instance has no `__call__` method

I know I can write a new class, but is there any better solution?

Jim Jim

Answer for function extract (Update):

You've created an object with StringIO that provides the interface but have you haven't passed that intereface (readline) to tokenize.generate_tokens, instead, you passed the full object (stringio).

Additionally, in your else clause a TypeError is going to be raised because untokenize expects an iterable as input. Making the following changes, your function works fine:

def extract(code):
    res = []
    comment = None
    stringio = StringIO.StringIO(code)
    # pass in stringio.readline to generate_tokens
    for toktype, tokval, begin, end, line in tokenize.generate_tokens(stringio.readline):
        if toktype != tokenize.COMMENT:
            res.append((toktype, tokval))
            # wrap (toktype, tokval) tupple in list
            print tokenize.untokenize([(toktype, tokval)])
    return tokenize.untokenize(res)

Supplied with input of the form expr = extract('a=1+2#A comment') the function will print out the comment and retain the expression in expr:

In [10]: expr = extract('a=1+2#A comment')
#A comment

In [11]: expr
Out[11]: 'a =1 +2 '

Furthermore, as I later mention io houses StringIO for Python3 so in this case the import is thankfully not required.

Answer for more general cases (extracting from modules, functions):


The documentation specifies that one needs to provide a callable which exposes the same interface as the readline() method of built-in file objects. This hints to: create an object that provides that method.

In the case of module, we can just open a new module as a normal file and pass in it's readline method. This is the key, the argument you pass is the method readline().

Given a small scrpt.py file with:

# My amazing foo function.
def foo():
    """ docstring """
    # I will print
    print "Hello"
    return 0   # Return the value

# Maaaaaaain
if __name__ == "__main__":
    # this is main
    print "Main" 

We will open it as we do all files:

fileObj = open('scrpt.py', 'r')

This file object now has a method called readline (because it is a file object) which we can safely pass to tokenize.generate_tokens and create a generator.

tokenize.generate_tokens (simply tokenize.tokenize in Py3) returns a named tuple of elements which contain information about the elements tokenized. Here's a small demo:

for toktype, tok, Start, end, line in tokenize.generate_tokens(fobj.readline):
    # we can also use token.tok_name[toktype] instead of 'COMMENT'
    # from the token module 
    if toktype == tokenize.COMMENT:
        print 'COMMENT' + " " + tok

Notice how we pass the fileObj.readline method to it. This will now print:

COMMENT # My amazing foo function
COMMENT # I will print
COMMENT # Return the value
COMMENT # Maaaaaaain
COMMENT # this is main 

So all comments regardless of position are detected. Docstrings of course are excluded.


You could achieve a similar result without open for cases which I really can't think of. Nonetheless, I'll present another way of doing it for completeness sake. In this scenario you'll need two additional modules, inspect and StringIO (io.StringIO in Python3):

Let's say you have the following function:

def bar():
    # I am bar
    print "I really am bar"
    # bar bar bar baaaar
    # (bar)
    return "Bar"

You need a file-like object which has a readline method to use it with tokenize. Well, you can create a file-like object from an str using StringIO.StringIO and you can get an str representing the source of the function with inspect.getsource(func). In code:

funcText = inpsect.getsource(bar)
funcFile = StringIO.StringIO(funcText)

Now we have a file-like object representing the function which has the wanted readline method. We can just re-use the loop we previously performed replacing fileObj.readline with funcFile.readline. The output we get now is of similar nature:

COMMENT # I am bar
COMMENT # bar bar bar baaaar
COMMENT # (bar)

As an aside, if you really want to create a custom way of doing this with re take a look at the source for the tokenize.py module. It defines certain patters for comments, (r'#[^\r\n]*') names et cetera, loops through the lines with readline and searches within the line list for pattterns. Thankfully, it's not too complex after you look at it for a while :-).