Miniman Miniman - 16 days ago 5
Java Question

Most efficient way to determine if a row EXISTS and INSERT into MySQL using java JDBC

I'm looking at trying to query a table in a MySQL database (I have the primary key, which is comprised of two categories, a name and a number but string comparison), such that this table could have anywhere from very few rows to upwards of hundreds of millions. Now, for efficiency, I'm not exactly sure how costly it is to actually do an

INSERT
query but I have a few options as to go about it:


  1. I could query the database to see if the element
    EXISTS
    and then call an
    INSERT
    query if it doesn't.

  2. I could try to brute force
    INSERT
    into the database and if it succeeds or fails, so be it.

  3. I could initially on program execution, create a cache/store, grab the primary key columns and store them in a
    Map<String, List<Integer>>
    and then search the key for if the name exists, then if it does, does the key and value combination in the
    List<Integer>
    exists, if it doesn't, then
    INSERT
    query the database.

  4. ?



Option one really isn't on the table for what I would really implement, just on the list of possible choices. Option two would most likely average better for unique occurrences such that it isn't in the table already. Option three would favour if common occurrences are the case such that a lot are in the cache.

Bearing in mind that option chosen will be iterated over potentially millions of times. Memory usage aside (From option 3), from my calculations it's nothing significant in respect to the capacity available.

Answer

Let the database do the work.

You should do the second method. If you don't want to get a failure, you can use on duplicate key update:

insert into t(pk1, pk2, . . . )
    values ( . . . )
    on duplicate key update set pk1 = values(pk1);

The only purpose of on duplicate key update is to do nothing useful but not return an error.

Why is this the best solution? In a database, a primary key (or columns declared unique) have an index structure. This is efficient for the database to use.

Second, this requires only one round-trip to the database.

Third, there are no race conditions, if you have multiple threads or applications that might be attempting to insert the same record(s).

Fourth, the method with on duplicate key update will work for inserting multiple rows at once. (Without on duplicate key insert, then a multi-value statement would fail if a single row is duplicated.) Combining multiple inserts into a single statement can be another big efficiency.

Your second option is really the right way to go.