zelusp zelusp - 29 days ago 8
Python Question

Profile Millions of Text Files In Parallel Using An Sqlite Counter?

A mountain of text files (of types A, B and C) is sitting on my chest, slowly, coldly refusing me desperately needed air. Over the years each type spec has had enhancements such that yesterday's typeA file has many more properties than last year's typeA. To build a parser that can handle the decade's long evolution of these file types it makes sense to inspect all 14 million of them iteratively, calmly, but before dying beneath their crushing weight.

I built a running counter such that every time I see properties (familiar or not) I add 1 to its tally. The

sqlite
tally board looks like this:

enter image description here

In the special event I see an unfamiliar property I add them to the tally. On a typeA file that looks like:

enter image description here

I've got this system down! But it's slow @ 3M files/36 hours in one process. Originally I was using this trick to pass
sqlite
a list of properties needing increment.

placeholder= '?' # For SQLite. See DBAPI paramstyle.
placeholders= ', '.join(placeholder for dummy_var in properties)
sql = """UPDATE tally_board
SET %s = %s + 1
WHERE property IN (%s)""" %(type_name, type_name, placeholders)
cursor.execute(sql, properties)


I learned that's a bad idea because


  1. sqlite
    string search is much slower than indexed search

  2. several hundreds of properties (some 160 characters long) make for really long sql queries

  3. using
    %s
    instead of
    ?
    is bad security practice... (not a concern ATM)



A "fix" was to maintain a script side
property
-
rowid
hash of the tally used in this loop:


  1. Read file for
    new_properties

  2. Read
    tally_board
    for
    rowid
    ,
    property

  3. Generate script side
    client_hash
    from 2's read

  4. Write rows to
    tally_board
    for every
    new_property
    not in
    property
    (nothing incremented yet). Update
    client_hash
    with new properties

  5. Lookup
    rowid
    for every row in
    new_properties
    using the
    client_hash

  6. Write increment to every
    rowid
    (now a proxy for
    property
    ) to
    tally_board



Step 6. looks like

sql = """UPDATE tally_board
SET %s = %s + 1
WHERE rowid IN %s""" %(type_name, type_name, tuple(target_rows))
cur.execute


The problem with this is


  • It's still slow!

  • It manifests a race condition in parallel processing that introduces duplicates in the
    property
    column whenever threadA starts step 2 right before threadB completes step 6.



A solution to the race condition is to give steps 2-6 an exclusive lock on the db though it doesn't look like reads can get those Lock A Read.

Another attempt uses a genuine
UPSERT
to increment preexisting
property
rows AND insert (and increment) new
property
rows in one fell swoop.

There may be luck in something like this but I'm unsure how to rewrite it to increment the tally.

Answer

The following sped things up immensely:

  • Wrote less often to SQLite. Holding most of my intermediate results in memory then updating the DB with them every 50k files resulted in about a third of the execution time (35 hours to 11.5 hours)
  • Moving data onto my PC (for some reason my USB3.0 port was transferring data well below USB2.0 rates). This resulted in about a fifth of the execution time (11.5 hours to 2.5 hours).