Dominic Fagan Dominic Fagan - 1 month ago 15
MySQL Question

PHP - Optimising preg_match of thousands of patterns

So I wrote a script to extract data from raw genome files, heres what the raw genome file looks like:

# rsid chromosome position genotype
rs4477212 1 82154 AA
rs3094315 1 752566 AG
rs3131972 1 752721 AG
rs12124819 1 776546 AA
rs11240777 1 798959 AG
rs6681049 1 800007 CC
rs4970383 1 838555 AC
rs4475691 1 846808 CT
rs7537756 1 854250 AG
rs13302982 1 861808 GG
rs1110052 1 873558 TT
rs2272756 1 882033 GG
rs3748597 1 888659 CT
rs13303106 1 891945 AA
rs28415373 1 893981 CC
rs13303010 1 894573 GG
rs6696281 1 903104 CT
rs28391282 1 904165 GG
rs2340592 1 910935 GG


The raw text file has hundreds of thousands of these rows, but I only need specific ones, I need about 10,000 of them. I have a list of rsids. I just need the genotype from each line. So I loop through the rsid list and use preg_match to find the line I need:

$rawData = file_get_contents('genome_file.txt');
$rsids = $this->get_snps();

while ($row = $rsids->fetch_assoc()) {

$searchPattern = "~rs{$row['rsid']}\t(.*?)\t(.*?)\t(.*?)\n~i";

if (preg_match($searchPattern,$rawData,$matchedGene)) {

$genotype = $matchedGene[3]);

// Do something with genotype

}

}


NOTE: I stripped out a lot of code to just show the regexp extraction I'm doing. I'm also inserting each row into a database as I go along. Heres the code with the database work included:

$rawData = file_get_contents('genome_file.txt');
$rsids = $this->get_snps();

$query = "INSERT INTO wp_genomics_results (file_id,snp_id,genotype,reputation,zygosity) VALUES (?,?,?,?,?)";
$stmt = $ngdb->prepare($query);

$stmt->bind_param("iissi", $file_id,$snp_id,$genotype,$reputation,$zygosity);

$ngdb->query("START TRANSACTION");

while ($row = $rsids->fetch_assoc()) {

$searchPattern = "~rs{$row['rsid']}\t(.*?)\t(.*?)\t(.*?)\n~i";

if (preg_match($searchPattern,$rawData,$matchedGene)) {

$genotype = $matchedGene[3]);

$stmt->execute();
$insert++;

}


}

$stmt->close();
$ngdb->query("COMMIT");

$snps->free();
$ngdb->close();

}


So unfortunately my script runs very slowly. Running 50 iterations takes 17 seconds. So you can imagine how long running 18,000 iterations is gonna take. I'm looking into ways to optimise this.

Is there a faster way to extract the data I need from this huge text file? What if I explode it into an array of lines, and use preg_grep(), would that be any faster?

Something I tried is combining all 18,000 rsids into a single expression (i.e. (rs123|rs124|rs125) like this:

$rsids = get_rsids();

$rsid_group = implode('|',$rsids);
$pattern = "~({$rsid_group })\t(.*?)\t(.*?)\t(.*?)\n~i";
preg_match($pattern,$rawData,$matches);


But unfortunately it gave me some error message about exceeding the PCRE expression limit. The needle was way too big. Another thing I tried is adding the S modifier to the expression. I read that this analyses the pattern in order to increase performance. It didn't speed things up at all. Maybe maybe pattern isn't compatible with it?

So then the second thing I need to try and optimise is the database inserts. I added a transaction hoping that would speed things up but it didn't speed it up at all. So I'm thinking maybe I should group the inserts together, so that I insert multiple rows at once, rather than inserting them individually.

Then another idea is something I read about, using LOAD DATA INFILE to load rows from a text file. In that case, I just need to generate a text file first. Would it work out faster to generate a text file in this case I wonder.

EDIT: It seems like whats taking up most time is the regular expressions. Running that part of the program by itself, it takes a really long time. 10 rows takes 4 seconds.

Answer

This is slow because you're searching a vast array of data over and over again.

It looks like you have a text file, not a dbms table, containing lines like these:

rs4477212   1   82154   AA
rs3094315   1   752566  AG
rs3131972   1   752721  AG
rs12124819  1   776546  AA

It looks like you have some other data structure containing a list of values like rs4477212. I think that's already in a table in the dbms.

I think you want exact matches for the rsxxxx values, not prefix or partial matches.

I think you want to process many different files of raw data, and extract the same batch of rsxxxx values from each of them.

So, here's what you do, in pseudocode. Don't load the whole raw data file into memory, rather process it line by line.

  1. Read your rows of rsid values from the dbms, just once, and store them in an associative array.
  2. for each file of raw data....
    1. for each line of data in the file...
      1. split the line of data to obtain the rsid. In php, $array = explode(" ", $line, 2); will yield your rsid in $array[0], and do it fast.
      2. Look in your array of rsid values for this value. In php, if ( array_key_exists( $array[0], $rsid_array )) { ... will do this.
      3. If the key does exist, you have a match.
      4. extract the last column from the raw text line ('GC or whatever)
      5. write it to your dbms.

Notice how this avoids regular expressions, and how it processes your raw data line by line. You only have to touch each line of raw data once. That's good, because your raw data is also your largest quantity of data. It exploits php's associative array feature to do the matching. All that will be much faster than your method.

To speed the process of inserting tens of thousands of rows into a table, read this. Optimizing InnoDB Insert Queries

Comments