dizzyLife dizzyLife - 21 days ago 8
Python Question

Creating a matrix from CSV file

I've been working on Python for around 2 months now so I have a OK understanding of it.

My goal is to create a matrix using CSV data, then populating that matrix from the data in the 3rd column of that CSV file.

I came up with this code thus far:

import csv

import csv
def readcsv(csvfile_name):
with open(csvfile_name) as csvfile:
file=csv.reader(csvfile, delimiter=",")

#remove rubbish data in first few rows

skiprows = int(input('Number of rows to skip? '))
for i in range(skiprows):
_ = next(file)

#change strings into integers/floats

for z in file:
z[:2]=map(int, z[:2])
z[2:]=map(float, z[2:])
print(z[:2])
return


After removing the rubbish data with the above code, the data in the CSV file looks like this:

Input:
1 1 51 9 3
1 2 39 4 4
1 3 40 3 9
1 4 60 2 .
1 5 80 2 .
2 1 40 6 .
2 2 28 4 .
2 3 40 2 .
2 4 39 3 .
3 1 10 . .
3 2 20 . .
3 3 30 . .
3 4 40 . .
. . . . .


The output should look like this:

1 2 3 4 . .
1 51 39 40 60
2 40 28 40 39
3 10 20 30 40
.
.


There are about a few thousand rows and columns in this CSV file, however I'm only interested is the first 3 columns of the CSV file. So the first and second columns are basically like co-ordinates for the matrix, and then populating the matrix with data in the 3rd column.

After lots of trial and error, I realised that numpy was the way to go with matrices. This is what I tried thus far with example data:

left_column = [1, 2, 1, 2, 1, 2, 1, 2]
middle_column = [1, 1, 3, 3, 2, 2, 4, 4]
right_column = [1., 5., 3., 7., 2., 6., 4., 8.]

import numpy as np
m = np.zeros((max(left_column), max(middle_column)), dtype=np.float)
for x, y, z in zip(left_column, middle_column, right_column):
x -= 1 # Because the indicies are 1-based
y -= 1 # Need to be 0-based
m[x, y] = z
print(m)

#: array([[ 1., 2., 3., 4.],
#: [ 5., 6., 7., 8.]])


However, it is unrealistic for me to specify all of my data in my script to generate the matrix. I tried using the generators to pull the data out of my CSV file but it didn't work well for me.

I learnt as much numpy as I could, however it appears like it requires my data to already be in matrix form, which it isn't.

Answer

This is my solution using only the csv library, and working with the index\position in the csv (using an offset which I use to mantain memory on the current row)

import csv

with open('test.csv', 'r') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',')
    list_of_list = []
    j=0
    lines = [line for line in spamreader]
    for i in range(len(lines)):
        list_ = []
        if(len(lines)<=i+j):
            break;
        first = lines[i+j][0]
        while(first == lines[i+j][0]):
            list_.append(lines[i+j][2])
            j+=1
            if(len(lines)<=i+j):
                break;
        j-=1
        list_of_list.append(list(map(float,list_)))

maxlen = len(max(list_of_list))
print("\t"+"\t".join([str(el) for el in range(1,maxlen+1)])+"\n")
for i in range(len(list_of_list)):
    print(str(i+1)+"\t"+"\t".join([str(el) for el in list_of_list[i]])+"\n")

Anyway the solution posted by Saullo is more elegant

This is my output:

        1       2       3       4       5

1       51.0    39.0    40.0    60.0    80.0

2       40.0    28.0    40.0    39.0

3       10.0    20.0    30.0    40.0

I wrote a new version of the code with an iterator, because the csv is too big to fit in memory

import csv

with open('test.csv', 'r') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',')
    list_of_list = []

    line1 = next(spamreader)
    first = line1[0]
    list_ = [line1[2]]
    for line in spamreader:
        while(line[0] == first):
            list_.append(line[2])
            try:
                line = next(spamreader)
            except :
                break;
        list_of_list.append(list(map(float,list_)))
        list_ = [line[2]]
        first = line[0]

maxlen = len(max(list_of_list))
print("\t"+"\t".join([str(el) for el in range(1,maxlen+1)])+"\n")
for i in range(len(list_of_list)):
    print(str(i+1)+"\t"+"\t".join([str(el) for el in list_of_list[i]])+"\n")

Anyway probably you need to work on the Matrix in chunks (and doing swaps) because probably the data wan't fit in a 2d array

Comments