Renato Dinhani Conceição - 4 months ago 22

SQL Question

I want to do some statistical analysis in a table in my database.

First, I need to know the number of rows in that table, if the total count is greather than some predefined constant, I will use the constant value, otherwise, I will use the number of rows.

I can discover using the

`SELECT count(*) FROM table`

What I want to know if is possible at the moment the count pass my constant value, it will stop the counting (and not wait to finish the counting to inform the row count is greater).

I need to know the exact number of rows retrieved if below limit to calculate which is the percentual of each item returned. Otherwise, if is above limit, I want get the any answer soon as possible because I will use the limit value.

This is what I want to do:

`SELECT text,count(*), percentual_calculus()`

FROM token

GROUP BY text

ORDER BY count DESC

Answer

Counting rows in *big* tables is known to be slow in PostgreSQL. To get a precise number it has to do a full count of rows due to the nature of MVCC. There is a way to **speed this up dramatically** if the count does **not** have to be **exact** like it seems to be in your case.

Instead of getting the *exact* count (**slow** with big tables):

```
SELECT count(*) AS exact_count FROM myschema.mytable;
```

You get a close estimate like this (**extremely fast**):

```
SELECT reltuples::bigint AS estimate FROM pg_class where relname='mytable';
```

How close the estimate is depends on whether you run `ANALYZE`

enough. It is usually very close.

See the PostgreSQL Wiki FAQ.

Or the dedicated wiki page for count(*) performance.

The article in the PostgreSQL Wiki ~~is~~ was **a bit sloppy**. It ignored the possibility that there can be multiple tables of the same name in one database - in different schemas. To account for that:

```
SELECT c.reltuples::bigint AS estimate
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = 'mytable'
AND n.nspname = 'myschema'
```

```
SELECT reltuples::bigint AS estimate
FROM pg_class
WHERE oid = 'myschema.mytable'::regclass;
```

Faster, simpler, safer, more elegant. See the manual on Object Identifier Types.

Use `to_regclass('myschema.mytable')`

in Postgres 9.4+ to avoid exceptions for invalid table names:

`TABLESAMPLE SYSTEM (n)`

in Postgres 9.5+```
SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);
```

Like @a_horse commented, the newly added clause for the `SELECT`

command might be useful if statistics in `pg_class`

are not current enough for some reason. For example:

- No
`autovacuum`

running. - Immediately after a big
`INSERT`

or`DELETE`

. `TEMPORARY`

tables (which are not covered by`autovacuum`

).

This only looks at a random *n* % (`1`

in the example) selection of blocks and counts rows in it. A bigger sample increases the cost and reduces the error, your pick. Accuracy depends on more factors:

- Distribution of row size. If a given block happens to hold wider than usual rows, the count is lower than usual etc.
- Dead tuples or a
`FILLFACTOR`

occupy space per block. If unevenly distributed across the table, the estimate may be off. - General rounding errors.

In most cases the estimate from `pg_class`

will be faster and more accurate.

First, I need to know the number of rows in that table, if the total count is greater than some predefined constant,

And whether it ...

... is possible at the moment the count pass my constant value, it will stop the counting (and not wait to finish the counting to inform the row count is greater).

**Yes.** You can use a **subquery with LIMIT**:

```
SELECT count(*) FROM (SELECT 1 FROM token LIMIT 500000) t;
```

Postgres *actually stops counting* beyond the given limit, you get an *exact and current* count for up to *n* rows (500000 in the example), and *n* otherwise. Not nearly as fast as the estimate in `pg_class`

, though.