Sander_M Sander_M - 6 months ago 23
Java Question

Reading, comparing and merging multiple files in Java

Given there are some files Customer-1.txt, Customer-2.txt and Customer-3.txt and these files have the following content:

Customer-1.txt

1|1|MARY|SMITH
2|1|PATRICIA|JOHNSON
4|2|BARBARA|JONES


Customer-2.txt

1|1|MARY|SMITH
2|1|PATRICIA|JOHNSON
3|1|LINDA|WILLIAMS
4|2|BARBARA|JONES


Customer-3.txt

2|1|PATRICIA|JOHNSON
3|1|LINDA|WILLIAMS
5|2|ALEXANDER|ANDERSON


These files have a lot of duplicate data, but it is possible that each file contains some data that is unique.

And given that the actual files are sorted, big (a few GB each file) and there are many files...

Then what is the:

a) memory cheapest

b) cpu cheapest

c) fastest

way in Java to create one file out of these three files that will contain all the unique data of each file sorted and concatenated like such:


Customer-final.txt

1|1|MARY|SMITH
2|1|PATRICIA|JOHNSON
3|1|LINDA|WILLIAMS
4|2|BARBARA|JONES
5|2|ALEXANDER|ANDERSON


I looked into the following solution https://github.com/upcrob/spring-batch-sort-merge , but I would like to know if its possible to perhaps do it with the FileInputStream and/or a non spring batch solution.

A solution to use an in memory or real database to join them is not viable for my use case due to the size of the files and the absence of an actual database.

Answer

Since the input files are already sorted, a simple parallel iteration of the files, merging their content, is the memory cheapest, cpu cheapest, and fastest way to do it.

This is a multi-way merge join, i.e. a sort-merge join without the "sort", with elimination of duplicates, similar to a SQL DISTINCT.

Here is a version that can do unlimited number of input files (well, as many as you can have open files anyway). It uses a helper class to stage the next line from each input file, so the leading ID value only has to be parsed once per line.

private static void merge(StringWriter out, BufferedReader ... in) throws IOException {
    CustomerReader[] customerReader = new CustomerReader[in.length];
    for (int i = 0; i < in.length; i++)
        customerReader[i] = new CustomerReader(in[i]);
    merge(out, customerReader);
}

private static void merge(StringWriter out, CustomerReader ... in) throws IOException {
    List<CustomerReader> min = new ArrayList<>(in.length);
    for (;;) {
        min.clear();
        for (CustomerReader reader : in)
            if (reader.hasData()) {
                int cmp = (min.isEmpty() ? 0 : reader.compareTo(min.get(0)));
                if (cmp < 0)
                    min.clear();
                if (cmp <= 0)
                    min.add(reader);
            }
        if (min.isEmpty())
            break; // all done
        // optional: Verify that lines that compared equal by ID are entirely equal
        out.write(min.get(0).getCustomerLine());
        out.write(System.lineSeparator());
        for (CustomerReader reader : min)
            reader.readNext();
    }
}

private static final class CustomerReader implements Comparable<CustomerReader> {
    private BufferedReader in;
    private String         customerLine;
    private int            customerId;
    CustomerReader(BufferedReader in) throws IOException {
        this.in = in;
        readNext();
    }
    void readNext() throws IOException {
        if ((this.customerLine = this.in.readLine()) == null)
            this.customerId = Integer.MAX_VALUE;
        else
            this.customerId = Integer.parseInt(this.customerLine.substring(0, this.customerLine.indexOf('|')));
    }
    boolean hasData() {
        return (this.customerLine != null);
    }
    String getCustomerLine() {
        return this.customerLine;
    }
    @Override
    public int compareTo(CustomerReader that) {
        // Order by customerId only. Inconsistent with equals()
        return Integer.compare(this.customerId, that.customerId);
    }
}

TEST

String file1data = "1|1|MARY|SMITH\n" +
                   "2|1|PATRICIA|JOHNSON\n" +
                   "4|2|BARBARA|JONES\n";
String file2data = "1|1|MARY|SMITH\n" +
                   "2|1|PATRICIA|JOHNSON\n" +
                   "3|1|LINDA|WILLIAMS\n" +
                   "4|2|BARBARA|JONES\n";
String file3data = "2|1|PATRICIA|JOHNSON\n" +
                   "3|1|LINDA|WILLIAMS\n" +
                   "5|2|ALEXANDER|ANDERSON\n";
try (
    BufferedReader in1 = new BufferedReader(new StringReader(file1data));
    BufferedReader in2 = new BufferedReader(new StringReader(file2data));
    BufferedReader in3 = new BufferedReader(new StringReader(file3data));
    StringWriter out = new StringWriter();
) {
    merge(out, in1, in2, in3);
    System.out.print(out);
}

OUTPUT

1|1|MARY|SMITH
2|1|PATRICIA|JOHNSON
3|1|LINDA|WILLIAMS
4|2|BARBARA|JONES
5|2|ALEXANDER|ANDERSON

The code merges purely by ID value, and doesn't verify that rest of line is actually equal. Insert code at the optional comment to check for that, if needed.

Comments