Karthik Murugan Karthik Murugan - 7 months ago 21
SQL Question

Connect by query

I'm storing hierarchical data in a table. When a resource is accessed by its hierarchical path (grantParent/parent/resource), I need to locate the resource using a CONNECT BY query.

Note: SQL commands are exported from EnterpriseDB, but it should work in Oracle too.

Table structure:

CREATE TABLE resource_hierarchy
(
resource_id character varying(100) NOT NULL,
resource_type integer NOT NULL,
resource_name character varying(100),
parent_id character varying(100)
)
WITH (
OIDS=FALSE
);


Data:

INSERT INTO "resource_hierarchy" (resource_id,resource_type,resource_name,parent_id) VALUES ('36d27991', 3, 'areaName', 'a616f392');
INSERT INTO "resource_hierarchy" (resource_id,resource_type,resource_name,parent_id) VALUES ('a616f392', 3, 'townName', 'fcc1ebb7');
INSERT INTO "resource_hierarchy" (resource_id,resource_type,resource_name,parent_id) VALUES ('fcc1ebb7', 2, 'stateName', '8369cc88');
INSERT INTO "resource_hierarchy" (resource_id,resource_type,resource_name,parent_id) VALUES ('8369cc88', 5, 'countryName', null);


Now, when I receive a path like

countryName/stateName/townName/areaName


I'm executing a query like,

select LEVEL,* from resource_hierarchy
WHERE resource_name = (
CASE LEVEL
WHEN 1 THEN 'areaName'
WHEN 2 THEN 'townName'
WHEN 3 THEN 'stateName'
WHEN 4 THEN 'countryName'
ELSE ''
END
)
connect by prior parent_id = resource_id
start with resource_name = 'areaName';


My expected results are:

LEVEL resource_id resource_type resource_name parent_id
-------------------------------------------------------------
1 36d27991 3 areaName a616f392
2 a616f392 3 townName fcc1ebb7
3 fcc1ebb7 2 stateName 8369cc88
4 8369cc88 5 countryName <null>


This query works fine, but I'm not sure if it would run faster, when my table is big like hundreds of thousands of entries.

Can you optimize this query for my requirement?

Edited:

EXPLAIN for the above query: I've defined two indices - one on resource_id (primary key) and another on parent_id

Sort (cost=66.85..66.86 rows=1 width=694)
Sort Key: connectby_cte.siblingssortcol
CTE prior
-> Recursive Union (cost=0.00..65.83 rows=31 width=151)
-> WindowAgg (cost=0.00..3.12 rows=1 width=83)
-> Seq Scan on resource_hierarchy (cost=0.00..3.11 rows=1 width=83)
Filter: ((resource_name)::text = 'areaName'::text)
-> WindowAgg (cost=0.33..6.21 rows=3 width=151)
-> Hash Join (cost=0.33..6.15 rows=3 width=151)
Hash Cond: ((resource_hierarchy_1.resource_id)::text = (prior.parent_id)::text)
Join Filter: connectby_cyclecheck(prior.recursionpath, (resource_hierarchy_1.parent_id)::text)
-> Seq Scan on resource_hierarchy resource_hierarchy_1 (cost=0.00..2.89 rows=89 width=83)
-> Hash (cost=0.20..0.20 rows=10 width=286)
-> WorkTable Scan on prior (cost=0.00..0.20 rows=10 width=286)
-> CTE Scan on prior connectby_cte (cost=0.00..1.01 rows=1 width=694)
Filter: ((resource_name)::text = CASE level WHEN 1 THEN 'areaName'::text WHEN 2 THEN 'townName'::text WHEN 3 THEN 'stateName'::text WHEN 4 THEN 'countryName'::text ELSE ''::text END)

Answer

Disclaimer: My primary experience belongs to Oracle DBMS, so pay attention to details if applying solution to Postgres.


Where clause applied after full hierarchy already built, therefore in original query database engine started retrieving data with specified resource_name at any level and building a full tree for each found record. Filtering occurs only on the next step.
Documentation:

  1. Oracle selects the root row(s) of the hierarchy—those rows that satisfy the START WITH condition.

  2. Oracle selects the child rows of each root row. Each child row must satisfy the condition of the CONNECT BY condition with respect to one of the root rows.

  3. Oracle selects successive generations of child rows. Oracle first selects the children of the rows returned in step 2, and then the children of those children, and so on. Oracle always selects children by evaluating the CONNECT BY condition with respect to a current parent row.

  4. If the query contains a WHERE clause without a join, then Oracle eliminates all rows from the hierarchy that do not satisfy the condition of the WHERE clause. Oracle evaluates this condition for each row individually, rather than removing all the children of a row that does not satisfy the condition.

To optimize this situation query must be changed as follows(hierarchy reversed to more natural top-down order):

select 
  level, rh.* 
from 
  resource_hierarchy rh
start with 
  (resource_name = 'countryName')
  and 
  (parent_id is null) -- roots only
connect by 
  prior resource_id = parent_id
  and          
  -- at each step get only required records
  resource_name = (
    case level 
      when 1 then 'countryName'
      when 2 then 'stateName'
      when 3 then 'townName'
      when 4 then 'areaName'
      else null
    end
  )

Same query, but based on CTE syntax (recursive subquery factoring):

with hierarchy_query(lvl, resource_id) as (
    select
      1               lvl, 
      rh.resource_id  resource_id
    from
      resource_hierarchy rh
    where
     (resource_name = 'countryName') and (parent_id is null) 

  union all

    select
      hq.lvl+1        lvl,
      rh.resource_id  resource_id
    from
      hierarchy_query    hq,
      resource_hierarchy rh
    where
      rh.parent_id = hq.resource_id
      and
      -- at each step get only required records
      resource_name = (
        case (hq.lvl + 1)
          when 2 then 'stateName'
          when 3 then 'townName'
          when 4 then 'areaName'
          else null
        end
      )
)
select
  hq.lvl, rh.*
from
  hierarchy_query    hq,
  resource_hierarchy rh
where
  rh.resource_id = hq.resource_id
order by
  hq.lvl

It's only half of the work because we need to help database engine to locate records by creating appropriate indexes.
Query above contains two search actions:
1. Locate records to start with;
2. Choose records on each next level.

For the first action, we need to index resource_name field and, possible, parent_id field.
For the second action fields parent_id and resource_name must be indexed.

create index X_RESOURCE_HIERARCHY_ROOT on RESOURCE_HIERARCHY (resource_name);
create index X_RESOURCE_HIERARCHY_TREE on RESOURCE_HIERARCHY (parent_id, resource_name);

Maybe creating only X_RESOURCE_HIERARCHY_TREE index is enough. It depends on characteristics of data stored in a table.

P.S. String for each level can be constructed from full path by using substr and instr functions like in this example for Oracle:

with prm as (
  select 
    '/countryName/stateName/townName/areaName/' location_path 
  from dual
)
select 
  substr(location_path,
    instr(location_path,'/',1,level)+1,
    instr(location_path,'/',1,level+1)-instr(location_path,'/',1,level)-1
  )          
from prm connect by level < 7
Comments