Frederik Frederik - 1 month ago 10
C++ Question

Read specific row of data from tsv with 100k+ lines, using standard libraries only

Edit2 see below, now working but I guess a bit clumsy... perhaps someone knows how to make do without closing and reopening the file.

I have a data file of the following structure (It is a Design Point Matrix for Simulation Analysis):

+----------+----------------------+-----------------+----------+----------+----------+-----------+-------------+
| ConfigID | k_StrategiesPerAgent | K_StrategySpace | l_Lambda | m_Memory | n_Agents | p_crowded | s_Seed |
+----------+----------------------+-----------------+----------+----------+----------+-----------+-------------+
| 0.0 | 0.0 | 0.0 | 0.5 | 12.0 | 10.0 | 0.2 | 353756906.0 |
| 1.0 | 0.0 | 0.2 | 0.5 | 12.0 | 10.0 | 0.2 | 923055597.0 |
| 2.0 | 0.0 | 0.4 | 0.5 | 12.0 | 10.0 | 0.2 | 616881203.0 |
+----------+----------------------+-----------------+----------+----------+----------+-----------+-------------+


The file "DPM.tsv" is tab-separated and contains no spaces or free lines, etc., i.e.:

ConfigID k_StrategiesPerAgent K_StrategySpace l_Lambda m_Memory n_Agents p_crowded s_Seed
0.0 0.0 0.0 0.5 12.0 10.0 0.2 353756906.0
1.0 0.0 0.2 0.5 12.0 10.0 0.2 923055597.0
2.0 0.0 0.4 0.5 12.0 10.0 0.2 616881203.0


It may contain over 100k rows and also a large number of columns. The first column is a unique Identifier (Integer, [0,...]) which I would like to use in order to access the parameter values associated with it. In general, the numbering of "ConfigID" should be consequtive. I do not know the number of columns in advance.

I am searching for a function that will read in the header into a string vector and the according data, corresponding to the key, into a double vector (same sorting). This should be done without any special libraries as I would not know how to link them... Also I would appreciate a very simple structure that does work without a class/template etc. Something like

vector<string> Labels; //Hold the parameter labels
vector<double> Parameters; //Hold the parameter values
bool readPars(char * FilePath, int ConfigID); //load the label and value
//return [false] on error, else [true]


Small Follow-Up: I will then want to access the data in the vectors through a loop, passing the values to some macro from the "language" I am using for my simulations (laboratory for simulation development). Therefore I will also want to "turn" the string into a char. This can then be done via adding ".c_str()" to the string, correct? e.g:

for (int i=0;i<Labels.size();i++){
const char * lab = Labels[i].c_str();
double par = Parameters[i];
LSD_MACRO(lab,par)//do something
}


It is fine that the ConfigID is also part of the Labels[] and Parameters[]

Given my lack of programming experience my past way of "solving" this was to write a python script that hard-codes an array holding ALL the data, which I then included via #include ... but there are limitations to such a procedure.

Many thanks! -Frederik




Edit (see also example input edit): So far, following Jonathan Mee's answer and support, I have the following. Edit2: Now working via "reopen".



using namespace std;
#include <fstream>
#include <string>
#include <vector>
#include <iterator>
#include <sstream>
#include <iostream>
#include <limits>

int main()
{
ifstream filePL("DPM.tsv"); //Read File in tsv format, with header-line

string label;
getline(filePL, label, '\n');

//Load the string vector with it
istringstream gccNeedsThisOnASeperateLine{ label }; //http://www.cplusplus.com/reference/sstream/istringstream/istringstream/
const vector<string> Labels{ istream_iterator<string>{ gccNeedsThisOnASeperateLine }, istream_iterator<string>{} };

//close file and reopen it, else the first item is skipped.
filePL.close();
ifstream fileP("DPM.tsv"); //Read File in tsv format, with header-line

//Read the remainer and parse it to a 2d vector
vector<vector<double>> Parameters;
int check = 0;
string garbage;
while(fileP.ignore(numeric_limits<streamsize>::max(), '\t')) {
vector<double> input(Labels.size());

for(int i = 0; i < input.size(); i++){
if (check==0){
fileP >> garbage;
} else {
fileP >> input[i];
}
}
if (check>0)
Parameters.push_back(input);

check++;
}

//Test:
for (int i=0;i<Labels.size();i++){
cout << Labels[i] << "\t" << Parameters[0][i] << endl;
}

return 0;
}


Without closing & reopening the file, the output was shifted (?due to getline?):
screen-shot from gdb

The first entry from the parameters, here the config id "0", was skipped.

Now (edit2) it's working.

Answer

Let's talk about your data structures:

  1. The labels must be held separately from a structure containing your doubles otherwise you'll end up with n-copies of the same labels. So we'll put the labels in the container vector<string> Labels
  2. If your key column is contiguous starting at 1 simply place your doubles in a vector<vector<double>> Parameters and the index will serve as the zero-based key, if not you'll need to use a map<int, vector<double>> Parameters, since it's simpler we'll assume that the numbers are contiguous and use vector<vector<double>> Parameters

Given that you've successfully opened the file into ifstream fileP you can get your Labels like this:

string label;

getline(fileP, label, '\n');   

const vector<string> Labels{ istream_iterator<string>{ istringstream{ label } }, istream_iterator<string>{} };

Thought there are fancier methods, we can simply use a nested for-loop to extract vector<vector<double>> Parameters:

while(fileP.ignore(std::numeric_limits<std::streamsize>::max(), '\t')) {
    vector<double> input(size(Labels) - 1);

    for(int i = 0; i < size(input) && fileP >> input[i]; ++i);
    Parameters.push_back(input);
}

Live Example