peter.murray.rust peter.murray.rust - 3 months ago 19
Java Question

indexing and searching CSV table

I am reading medium sized CSV files (up to 100K rows and 50 columns), and currently storing as


headers: List<String>
data: List<List<String>>


I'd like to search this by cell values in a given column, returning
{irow, jcol}
. I have looked at guava
HashBasedTable
but this doesn't have a concept of numeric row index. Before writing my own (based on hastable indexing), I would be grateful to know of lightweight Open Source Java table tools that work with CSV structure.

Answer

H2 Database Engine

Why not use a relational database rather than twist your tabular data into non-tabular Java structures?

The H2 Database Engine is written in pure Java. It can be embedded in your Java app.

H2 can directly read in CSV files. See this tutorial on using CSVREAD and CSVWRITE. Or use the Apache Commons CSV library to read in the CSV files.

Add an extra column for an incrementing integer number if you want the rows to have a sequential number, apparently what you mean by "row index".

You can specify the database be kept in memory rather than persisted to storage if you want to maximize performance.

Concurrency

Your comment mentions this is a read-write situation with addition/deletion of data. That raises possible concurrency issues around multiple threads updating data structures in memory and synching file writes to storage.

That makes a database solution even more appropriate as concurrency is a very tricky complicated problem already handled well by a database.

Be sure to understand your database’s concurrency strategy. There is no magic perfect solution to concurrency handling, trade-offs are always required. The H2 database by default uses MVCC as its strategy.