CaptainStiggz CaptainStiggz - 2 months ago 22
MySQL Question

Doctrine Paginator selects entire table (very slow)?

This is related to a previous question here: Doctrine/Symfony query builder add select on left join

I want to perform a complex join query using Doctrine ORM. I want to select 10 paginated blog posts, left joining a single author, like value for current user, and hashtags on the post. My query builder looks like this:

$query = $em->createQueryBuilder()
->select('p')
->from('Post', 'p')
->leftJoin('p.author', 'a')
->leftJoin('p.hashtags', 'h')
->leftJoin('p.likes', 'l', 'WITH', 'l.post_id = p.id AND l.user_id = 10')
->where("p.foo = bar")
->addSelect('a AS post_author')
->addSelect('l AS post_liked')
->addSelect('h AS post_hashtags')
->orderBy('p.time', 'DESC')
->setFirstResult(0)
->setMaxResults(10);

// FAILS - because left joined hashtag collection breaks LIMITS
$result = $query->getQuery()->getResult();

// WORKS - but is extremely slow (count($result) shows over 80,000 rows)
$result = new \Doctrine\ORM\Tools\Pagination\Paginator($query, true);


Strangely, count($result) on the paginator shows the total number of rows in my table (over 80,000) but traversing the $result with foreach outputs 10 Post entities, as expected. Do I need to do some additional configuration to properly limit my paginator?

If this is a limitation of the paginator class what other options do I have? Writing custom paginator code or other paginator libraries?

(bonus): How can I hydrate an array, like $query->getQuery()->getArrayResult();?

EDIT: I left out a stray orderBy in my function. It looks like including both groupBy and orderBy causes the slowdown (using groupBy rather than the paginator). If I omit one or the other, the query is fast. I tried adding an index on the "time" column in my table, but didn't see any improvement.

Things I Tried

// works, but makes the query about 50x slower
$query->groupBy('p.id');
$result = $query->getQuery()->getArrayResult();

// adding an index on the time column (no improvement)
indexes:
time_idx:
columns: [ time ]

// the above two solutions don't work because MySQL ORDER BY
// ignores indexes if GROUP BY is used on a different column
// e.g. "ORDER BY p.time GROUP BY p.id is" slow

Answer

At the end of the day, many of the queries used in my application are too complex to make proper use of the Paginator, and I wasn't able to use array hydration mode with the Paginator.

According to MySQL documentation, ORDER BY cannot be resolved by indexes if GROUP BY is used on a different column. Thus, I ended up using a couple post-processing queries to populate my base results (ORDERed and LIMITed) with one-to-many relations (like hashtags).

For joins that load a single row from the joined table, I was able to join the desired values in the base ordered query. For example, when loading the "like status" for a current user, only one like from the set of likes needs to be loaded to indicate whether or not the current post has been liked. Similarly, the presence of only one author for a given post produces a single joined author row. e.g.

$query = $em->createQueryBuilder()
        ->select('p')              
        ->from('Post', 'p')
        ->leftJoin('p.author', 'a')
        ->leftJoin('p.likes', 'l', 'WITH', 'l.post_id = p.id AND l.user_id = 10')
        ->where("p.foo = bar")
        ->addSelect('a AS post_author')
        ->addSelect('l AS post_liked')
        ->orderBy('p.time', 'DESC')
        ->setFirstResult(0)
        ->setMaxResults(10);

// SUCCEEDS - because joins only join a single author and single like
// no collections are joined, so LIMIT applies only the the posts, as intended
$result = $query->getQuery()->getArrayResult(); 

This produces a result in the form:

[
  [0] => [
    ['id'] => 1
    ['text'] => 'foo',
    ['author'] => [
       ['id'] => 10,
       ['username'] => 'username',
    ],
    ['likes'] => [
       [0] => [
         ['post_id'] => 1,
         ['user_id'] => 10,
       ]
    ],
  ], 
  [1] => [...],
  ...
  [9] => [...]
]

Then in a second query I load the hashtags for posts loaded in the previous query. e.g.

// we don't care about orders or limits here, we just want all the hashtags
$query = $em->createQueryBuilder()
        ->select('p, h')              
        ->from('Post', 'p')
        ->leftJoin('p.hashtags', 'h')
        ->where("p.id IN :post_ids")
        ->setParameter('post_ids', $pids);

Which produces the following:

[
  [0] => [
    ['id'] => 1
    ['text'] => 'foo',
    ['hashtags'] => [
       [0] => [
         ['id'] => 1,
         ['name'] => '#foo',
       ],
       [2] => [
         ['id'] => 2,
         ['name'] => '#bar',
       ],
       ...
    ],
  ], 
  ...
]

Then I just traverse the results containing hashtags and append them to the original (ordered and limited) results. This approach ends up being much faster (even though it uses more queries), as it avoids GROUP BY and COUNT, fully leverages MySQL indexes, and allows for more complex queries, such as the one I posted here.