Haziq Azmi Haziq Azmi - 7 months ago 44
SQL Question

How to count number of words in delimited string in Oracle SQL

For example i have the table called 'Table1'. and column called 'country'.
I want to count the value of word in string.below is my data for column 'country':

country:
"japan singapore japan chinese chinese chinese"


expected output: in above data we can see the japan appear two time, singapore once and chinese 3 times.i want to count value of word where japan is count as one, singapore as one and chinese as one. hence the ouput will be 3. please help me

ValueOfWord: 3

Answer

Firstly, it is a bad design to store multiple values in a single column as delimited string. You should consider normalizing the data as a permanent solution.

With the denormalized data, you could do it in a single SQL using REGEXP_SUBSTR:

SELECT COUNT(DISTINCT(regexp_substr(country, '[^ ]+', 1, LEVEL))) as "COUNT"
FROM table_name
  CONNECT BY LEVEL <= regexp_count(country, ' ')+1 
/

Demo:

SQL> WITH sample_data AS
  2    ( SELECT 'japan singapore japan chinese chinese chinese' str FROM dual
  3    )
  4  -- end of sample_data mocking real table
  5  SELECT COUNT(DISTINCT(regexp_substr(str, '[^ ]+', 1, LEVEL))) as "COUNT"
  6  FROM sample_data
  7    CONNECT BY LEVEL <= regexp_count(str, ' ')+1
  8  /

     COUNT
----------
         3

See Split single comma delimited string into rows in Oracle to understand how the query works.


UPDATE

For multiple delimited string rows you need to take care of the number of rows formed by the CONNECT BY clause.

See Split comma delimited strings in a table in Oracle for more ways of doing the same task.

Setup

Let's say you have a table with 3 rows like this:

SQL> CREATE TABLE t(country VARCHAR2(200));

Table created.

SQL> INSERT INTO t VALUES('japan singapore japan chinese chinese chinese');

1 row created.

SQL> INSERT INTO t VALUES('singapore indian malaysia');

1 row created.

SQL> INSERT INTO t VALUES('french french french');

1 row created.

SQL> COMMIT;

Commit complete.

SQL> SELECT * FROM t;

COUNTRY
---------------------------------------------------------------------------
japan singapore japan chinese chinese chinese
singapore indian malaysia
french french french
  • Using REGEXP_SUBSTR and REGEXP_COUNT:

We expect the output as 6 since there are 6 unique strings.

SQL> SELECT COUNT(DISTINCT(regexp_substr(t.country, '[^ ]+', 1, lines.column_value))) count
  2    FROM t,
  3      TABLE (CAST (MULTISET
  4      (SELECT LEVEL FROM dual
  5              CONNECT BY LEVEL <= regexp_count(t.country, ' ')+1
  6      ) AS sys.odciNumberList ) ) lines
  7    ORDER BY lines.column_value
  8  /

     COUNT
----------
         6

There are many other methods to achieve the desired output. Let's see how:

  • Using XMLTABLE
SQL> SELECT COUNT(DISTINCT(country)) COUNT
  2  FROM
  3    (SELECT trim(COLUMN_VALUE) country
  4    FROM t,
  5      xmltable(('"'
  6      || REPLACE(country, ' ', '","')
  7      || '"'))
  8    )
  9  /

     COUNT
----------
         6
  • Using MODEL clause
SQL> WITH
  2       model_param AS
  3       (
  4        SELECT country AS orig_str ,
  5               ' '
  6               || country
  7               || ' '                                 AS mod_str ,
  8               1                                      AS start_pos ,
  9              Length(country)                           AS end_pos ,
 10              (LENGTH(country) -
 11              LENGTH(REPLACE(country, ' '))) + 1        AS element_count ,
 12              0                                      AS element_no ,
 13              ROWNUM                                 AS rn
 14        FROM   t )
 15        SELECT COUNT(DISTINCT(Substr(mod_str, start_pos, end_pos-start_pos))) count
 16        FROM (
 17              SELECT *
 18              FROM   model_param
 19              MODEL PARTITION BY (rn, orig_str, mod_str)
 20              DIMENSION BY (element_no)
 21              MEASURES (start_pos, end_pos, element_count)
 22              RULES ITERATE (2000)
 23              UNTIL (ITERATION_NUMBER+1 = element_count[0])
 24            ( start_pos[ITERATION_NUMBER+1] =
 25                      instr(cv(mod_str), ' ', 1, cv(element_no)) + 1,
 26              end_pos[ITERATION_NUMBER+1] =
 27                      instr(cv(mod_str), ' ', 1, cv(element_no) + 1) )
 28            )
 29         WHERE    element_no != 0
 30    ORDER BY      mod_str , element_no
 31   /

     COUNT
----------
         6
Comments