OneSolitaryNoob OneSolitaryNoob - 3 months ago 5x
Perl Question

split 10 billion line file into 5,000 files by column value in Perl or Python

I have a 10 billion line tab-delimited file that I want to split into 5,000 sub-files, based on a column (first column). How can I do this efficiently in Perl or Python?

This has been asked here before but all the approaches open a file for each row read, or they put all the data in memory.


This program will do as you ask. It expects the input file as a parameter on the command line, and writes output files whose names are taken from the first column of the input file records

It keeps a hash %fh of file handles and a parallel hash %opened of flags that indicate whether a given file has ever been opened before. A file is opened for append if it appear in the %opened hash, or for write if it has never been opened before. If the limit on open files is hit then a (random) selection of 1,000 file handles is closed. There is no point in keeping track of when each handle was last used and closing the most out of date handles: if the data in the input file is randomly ordered then every handle in the hash has the same chance of being the next to be used, alternatively if the data is already sorted then none of the file handles will ever be used again

use strict;
use warnings 'all';

my %fh;
my %opened;

while ( <> ) {

    my ($tag) = split;

    if ( not exists $fh{$tag} ) {

        my $mode = $opened{$tag} ? '>>' : '>';

        while () {

            eval {
                open $fh{$tag}, $mode, $tag or die qq{Unable to open "$tag" for output: $!};

            if ( not $@ ) {
                $opened{$tag} = 1;

            die $@ unless $@ =~ /Too many open files/;

            my $n;
            for my $tag ( keys %fh ) {
                my $fh = delete $fh{$tag};
                close $fh or die $!;
                last if ++$n >= 1_000 or keys %fh == 0;

    print { $fh{$tag} } $_;

close $_ or die $! for values %fh;