GT96 GT96 - 2 months ago 20
Python Question

Text Parser in Python

I have to write a code to read data in text file. This text file has a specific format. It is like comma-separated values (CSV) file that stores tabular data. And, I must be able to perform calculations on the data of that file.

Here's the format instruction of that file:

A dataset has to start with a declaration of its name:

@relation name

followed by a list of all the attributes in the dataset

@attribute attribute_name specification

If an attribute is nominal, specification contains a list of the possible attribute values in curly brackets:

@attribute nominal_attribute {first_value, second_value, third_value}

If an attribute is numeric, specification is replaced by the keyword

@attribute numeric_attribute numeric

After the attribute declarations, the actual data is introduced by a

@data

tag, which is followed by a list of all the instances. The instances are listed in comma-separated format, with a question mark representing a missing value.

Comments are lines starting with % and are ignored.

I must be able to make calculations on this data separated by comma, and must know which data is associated to which attribute.

Example dataset file:
1: https://drive.google.com/open?id=0By6GDPYLwp2cSkd5M0J0ZjczVW8
2: https://drive.google.com/open?id=0By6GDPYLwp2cejB5SVlhTFdubnM

I have no experience with parsing and very little experience with Python. So, I felt to ask the experts for the easy way to do it.

Thanks

Answer

Here is a simple solution that I came up with:

The idea is to read the file line by line and apply rules depending on the type of line encountered.

As you see in the sample input, there could be broadly 5 types of input you may encounter.

  1. A comment which could start with '%' -> no action is needed here.

  2. A blank line i.e. '\n' -> no action needed here.

  3. A line that starts with @, which indicates it could be an attribute or name of the relation.

  4. If not any of these, then it is the data itself.

The code follows a simple if-else logic taking actions at every step. based on the above 4 rules.

with open("../Downloads/Reading_Data_Files.txt","r") as dataFl:
    lines = [line for line in dataFl]

attribute = []
data = []
for line in lines:
    if line.startswith("%") or 'data' in line or line=='\n': # this is a comment or the data line
        pass
    elif line.startswith("@"):
        if "relation" in line:
            relationName = line.split(" ")[1]
        elif "attribute" in line:
            attribute.append(line.split(" ")[1])
    else:
        data.append(list(map(lambda x : x.strip(),line.split(","))))

print("Relation Name is : %s" %relationName)
print("Attributes are " + ','.join(attribute))
print(data)

If you want to see which attribute is what here is a solution, which is essentially the same solution as above but with a minor tweak. The only issue with solution above is that the output is a list of lists and to tell which attribute is which is an issue. Hence, a rather better solution would be annotate each data element with the corresponding attribute name. The output will be of the form: {'distance': '45', 'temperature': '75', 'BusArrival': 'on_time', 'Students': '25'}

with open("/Users/sreejithmenon/Downloads/Reading_Data_Files.txt","r") as dataFl:
    lines = [line for line in dataFl]

attribute = []
data = []
for line in lines:
    if line.startswith("%") or 'data' in line or line=='\n': # this is a comment or the data line
        pass
    elif line.startswith("@"):
        if "relation" in line:
            relationName = line.split(" ")[1]
        elif "attribute" in line:
            attribute.append(line.split(" ")[1])
    else:
        dataLine = list(map(lambda x : x.strip(),line.split(",")))
        dataDict = {attribute[i] : dataLine[i] for i in range(len(attribute))} # each line of data is now a dictionary.
        data.append(dataDict)

print("Relation Name is : %s" %relationName)
print("Attributes are " + ','.join(attribute))
print(data)

You could use pandas Data frames to do more analysis, slicing, querying etc. Here is a link that should help you get started with http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

Comments