Data … as usual

All things about data by Laurent Leturgez

Oracle 11g, frequency histograms, and character columns …

Recently I worked on a performance issue on two Oracle 11g instances (11.2.0.3 and 11.2.0.4) and I hit a bug concerning histograms and character columns. Oracle 11g is now about 7 years old but it’s a version that is still used in lots of systems .. that’s why I decided to write this blog post.

Well, in Oracle 11g, you have two kind of histograms :

  • Frequency histograms: they are calculated when the number of distinct values (NDV) in a column is lower than the number of buckets in the histogram. Frequency histograms are the most accurate histogram, because they count the exact number of occurrences for each value.
  • Height balanced histograms: they are calculated when the NDV is greater than the number of buckets in the histogram. Height balanced histograms are less accurate but they give some precious information to the optimizer on data distribution.

Histograms help the cost optimizer to take the good decision when the CBO have to choose for an access path or a join. (You can find very interesting papers on this topic by Jonathan Lewis or Mohamed Houri)

The problem I met was related to frequency histograms on character columns (CHAR, VARCHAR2 and NVARCHAR2).

In the dictionary, there are two views that give interesting information about how data is distributed:

  • DBA_TAB_COL_STATISTICS: this view gives information about column statistics (density, NDV, sample_size etc.) and if there’s an histogram on the column.
  • DBA_HISTOGRAMS: this view gives information on column histograms and how data values are distributed

Let’s took an example:

LAURENT @ orcl >; create table foo (id number);

Table created.

LAURENT @ orcl > begin
 2 for i in 1..10
 3 loop
 4 insert into foo select i from dual connect by level <=i; 5 end loop; 6 end; 7 / PL/SQL procedure successfully completed. LAURENT @ orcl > commit;

Commit complete.

LAURENT @ orcl > select id,count(*) from foo group by id order by 1;

	ID   COUNT(*)
---------- ----------
	 1	    1
	 2	    2
	 3	    3
	 4	    4
	 5	    5
	 6	    6
	 7	    7
	 8	    8
	 9	    9
	10	   10

10 rows selected.

Now, we gather statistics with a frequency histogram (that’s why I chose 11 buckets):

LAURENT @ orcl > exec dbms_stats.gather_table_stats(ownname=>USER, tabname=>'FOO', method_opt=>'FOR COLUMNS ID SIZE 11', estimate_percent=>100);

PL/SQL procedure successfully completed.

LAURENT @ orcl > select table_name,column_name,num_distinct,density,num_nulls,sample_size,histogram
 2 from user_tab_col_statistics
 3 where table_name='FOO';

TABLE_NAME COLUMN_NAM NUM_DISTINCT    DENSITY  NUM_NULLS SAMPLE_SIZE HISTOGRAM
---------- ---------- ------------ ---------- ---------- ----------- ---------------
FOO        ID                   10 .009090909          0          55 FREQUENCY

Now let’s have a look to the DBA_HISTOGRAMS view (or USER or ALL depending on the context)


LAURENT @ orcl > select table_name,column_name,
  2  endpoint_number,endpoint_value,
  3  endpoint_actual_value
  4  from user_histograms
  5  where table_name='FOO';

TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_A
---------- ---------- --------------- -------------- ----------
FOO        ID                       1              1
FOO        ID                       3              2
FOO        ID                       6              3
FOO        ID                      10              4
FOO        ID                      15              5
FOO        ID                      21              6
FOO        ID                      28              7
FOO        ID                      36              8
FOO        ID                      45              9
FOO        ID                      55             10

The ENDPOINT_VALUE represents the column value, for example, and because it’s a frequency histogram, if we’d like to know how many values of 7 there is in the table, we have to take the endpoint_number for this value: 28 and to substract the endpoint_number of the preceding endpoint-value (6) … So there are 28-21=7 lines with the value 7.

Now, you’ve understood this basic … let’s take another example:


LAURENT @ orcl > begin
  2  for i in 1..10000
  3  loop
  4  case mod(i,10)
  5    when 0 then insert into t values(mod(i,10), concat(lpad('X',30,'X'),'01'));
  6    when 1 then insert into t values(mod(i,10), concat(lpad('X',30,'X'),'02'));
  7    when 2 then insert into t values(mod(i,10), concat(lpad('X',30,'X'),'03'));
  8    when 3 then insert into t values(mod(i,10), concat(lpad('X',30,'X'),'04'));
  9    when 4 then insert into t values(mod(i,10), concat(lpad('X',30,'X'),'05'));
 10    when 5 then insert into t values(mod(i,10), concat(lpad('X',30,'X'),'06'));
 11    when 6 then insert into t values(mod(i,10), concat(lpad('X',30,'X'),'07'));
 12    when 7 then insert into t values(mod(i,10), concat(lpad('X',30,'X'),'08'));
 13    when 8 then insert into t values(mod(i,10), concat(lpad('X',30,'X'),'09'));
 14    when 9 then insert into t values(mod(i,10), concat(lpad('X',30,'X'),'10'));
 15  end case;
 16  end loop;
 17  commit;
 18  end;
 19  /

PL/SQL procedure successfully completed.

LAURENT @ orcl > insert into t select 10,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' from dual connect by level <=100000; 

100000 rows created. 

LAURENT @ orcl > commit;

Commit complete.

LAURENT @ orcl > exec dbms_stats.gather_table_stats(ownname=>USER, tabname=>'T', method_opt=>'for all columns size 12', estimate_percent=>100);

PL/SQL procedure successfully completed.

LAURENT @ orcl > select table_name,column_name,density,histogram,sample_size from user_tab_col_statistics where table_name='T';

TABLE_NAME COLUMN_NAM    DENSITY HISTOGRAM       SAMPLE_SIZE
---------- ---------- ---------- --------------- -----------
T          ID         4.5455E-06 FREQUENCY            110000
T          V          4.5455E-06 FREQUENCY            110000

And if we have a look to the data distribution of the column V, no problem …


LAURENT @ orcl > select table_name,column_name,
  2  endpoint_number,endpoint_value,
  3  endpoint_actual_value
  4  from user_histograms
  5  where table_name='T' and column_name='V'
  6  order by ENDPOINT_ACTUAL_VALUE;

TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALUE
---------- ---------- --------------- -------------- --------------------------------
T          V                     1000     4.5871E+35 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX01
T          V                     2000     4.5871E+35 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX02
T          V                     3000     4.5871E+35 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX03
T          V                     4000     4.5871E+35 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX04
T          V                     5000     4.5871E+35 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX05
T          V                     6000     4.5871E+35 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX06
T          V                     7000     4.5871E+35 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX07
T          V                     8000     4.5871E+35 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX08
T          V                     9000     4.5871E+35 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX09
T          V                    10000     4.5871E+35 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX10
T          V                   110000     4.5871E+35 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

11 rows selected.

The problem comes when the size of the character column exceeds 32 bytes:


LAURENT @ orcl > drop table t purge;

Table dropped.

LAURENT @ orcl > create table t (id number, v varchar2(64));

Table created.

LAURENT @ orcl > begin
for i in 1..10000
loop
case mod(i,10)
  when 0 then insert into t values(mod(i,10), concat(lpad('X',32,'X'),'01'));
  when 1 then insert into t values(mod(i,10), concat(lpad('X',32,'X'),'02'));
  when 2 then insert into t values(mod(i,10), concat(lpad('X',32,'X'),'03'));
  when 3 then insert into t values(mod(i,10), concat(lpad('X',32,'X'),'04'));
  when 4 then insert into t values(mod(i,10), concat(lpad('X',32,'X'),'05'));
  when 5 then insert into t values(mod(i,10), concat(lpad('X',32,'X'),'06'));
  when 6 then insert into t values(mod(i,10), concat(lpad('X',32,'X'),'07'));
  when 7 then insert into t values(mod(i,10), concat(lpad('X',32,'X'),'08'));
  when 8 then insert into t values(mod(i,10), concat(lpad('X',32,'X'),'09'));
  when 9 then insert into t values(mod(i,10), concat(lpad('X',32,'X'),'10'));
end case;
end loop;
commit;
end;
/

LAURENT @ orcl > insert into t select 10,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' from dual connect by level <=100000; 100000 rows created. LAURENT @ orcl > commit;

Commit complete.

LAURENT @ orcl > exec dbms_stats.gather_table_stats(ownname=>USER, tabname=>'T', method_opt=>'for all columns size 12', estimate_percent=>100);

PL/SQL procedure successfully completed.

LAURENT @ orcl > select table_name,column_name,density,histogram,sample_size from user_tab_col_statistics where table_name='T';

TABLE_NAME COLUMN_NAM    DENSITY HISTOGRAM       SAMPLE_SIZE
---------- ---------- ---------- --------------- -----------
T          ID         4.5455E-06 FREQUENCY            110000
T          V          4.5455E-06 FREQUENCY            110000

But … if we have a look to the DBA_HISTOGRAMS view …


LAURENT @ orcl > select table_name,column_name,
  2  endpoint_number,endpoint_value,
  3  endpoint_actual_value
  4  from user_histograms
  5  where table_name='T'
  6  order by COLUMN_NAME,ENDPOINT_ACTUAL_VALUE;

TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALUE
---------- ---------- --------------- -------------- --------------------------------
T          ID                    1000              0
T          ID                    2000              1
T          ID                    3000              2
T          ID                    4000              3
T          ID                    5000              4
T          ID                    6000              5
T          ID                    7000              6
T          ID                    8000              7
T          ID                    9000              8
T          ID                   10000              9
T          ID                  110000             10
T          V                   110000     4.5871E+35

LAURENT @ orcl > select count(*) from t where v=lpad('X',32,'X')||'05';

  COUNT(*)
----------
      1000

LAURENT @ orcl > select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS'));

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------
SQL_ID  aass8rg64h38b, child number 0
-------------------------------------
select count(*) from t where v=lpad('X',32,'X')||'05'

Plan hash value: 2966233522

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      2 |        |      2 |00:00:00.01 |    1416 |
|   1 |  SORT AGGREGATE    |      |      2 |      1 |      2 |00:00:00.01 |    1416 |
|*  2 |   TABLE ACCESS FULL| T    |      2 |   110K |   2000 |00:00:00.01 |    1416 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("V"='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX05')

19 rows selected.

LAURENT @ orcl > select count(*) from t where v='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX';

  COUNT(*)
----------
    100000

LAURENT @ orcl > select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS'));

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------
SQL_ID  adnfzsc6buqa3, child number 0
-------------------------------------
select count(*) from t where v='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

Plan hash value: 2966233522

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |      1 |00:00:00.01 |     708 |
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.01 |     708 |
|*  2 |   TABLE ACCESS FULL| T    |      1 |   110K |   100K |00:00:00.01 |     708 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("V"='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')

19 rows selected.

So, with this kind of difference between E-rows and A-rows, you can easily face performance problems, specially if this column has been indexed. So we have a frequency histogram and it doesn’t help us to determine the cardinality in the character column…

This behaviour has been reported in the Bug 18377553 : POOR CARDINALITY ESTIMATE WITH HISTOGRAMS AND VALUES > 32 BYTES.

In my case, the problem occurred with NVARCHAR2 columns (national character set was AL16UTF16), so when the column was containing more that 16 characters, performance problems occurred.

Hopefully, Oracle has published a one-off patch for this bug (available for 11.2.0.3 and 11.2.0.4 on linux x86-64), or as a workaround, you can remove histogram on these kind of columns (with the method_opt: FOR COLUMNS … SIZE 1)

In 12.1, the problem occurs only when the size of the character column exceeds 64 bits (32 characters under AL16UTF16 character set)


SQL> select * from v$version;

BANNER                                                                               CON_ID
-------------------------------------------------------------------------------- ----------
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production              0
PL/SQL Release 12.1.0.2.0 - Production                                                    0
CORE	12.1.0.2.0	Production                                                        0
TNS for Linux: Version 12.1.0.2.0 - Production                                            0
NLSRTL Version 12.1.0.2.0 - Production                                                    0

SQL> create table t (id number, v nvarchar2(64));

Table created.

SQL> begin
for i in 1..10000
loop
case mod(i,10)
  when 0 then insert into t values(mod(i,10), concat(lpad('X',32,'X'),'01'));
  when 1 then insert into t values(mod(i,10), concat(lpad('X',32,'X'),'02'));
  when 2 then insert into t values(mod(i,10), concat(lpad('X',32,'X'),'03'));
  when 3 then insert into t values(mod(i,10), concat(lpad('X',32,'X'),'04'));
  when 4 then insert into t values(mod(i,10), concat(lpad('X',32,'X'),'05'));
  when 5 then insert into t values(mod(i,10), concat(lpad('X',32,'X'),'06'));
  when 6 then insert into t values(mod(i,10), concat(lpad('X',32,'X'),'07'));
  when 7 then insert into t values(mod(i,10), concat(lpad('X',32,'X'),'08'));
  when 8 then insert into t values(mod(i,10), concat(lpad('X',32,'X'),'09'));
  when 9 then insert into t values(mod(i,10), concat(lpad('X',32,'X'),'10'));
end case;
end loop;
commit;
end;
/

PL/SQL procedure successfully completed.

SQL> insert into t select 10,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' from dual connect by level <=100000; 100000 rows created. SQL> commit;

Commit complete.

SQL> exec dbms_stats.gather_table_stats(ownname=>USER, tabname=>'T', method_opt=>'for all columns size 12', estimate_percent=>100);

PL/SQL procedure successfully completed.

SQL> select table_name,column_name,density,histogram,sample_size from user_tab_col_statistics where table_name='T';

TABLE_NAME COLUMN_NAM    DENSITY HISTOGRAM       SAMPLE_SIZE
---------- ---------- ---------- --------------- -----------
T          ID         4.5455E-06 FREQUENCY            110000
T          V          4.5455E-06 FREQUENCY            110000

SQL> select table_name,column_name,
  2  endpoint_number,endpoint_value,
  3  endpoint_actual_value
  4  from user_histograms
  5  where table_name='T'
  6  order by COLUMN_NAME,ENDPOINT_NUMBER;

TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALUE
---------- ---------- --------------- -------------- ----------------------------------
T          ID                    1000              0 0
T          ID                    2000              1 1
T          ID                    3000              2 2
T          ID                    4000              3 3
T          ID                    5000              4 4
T          ID                    6000              5 5
T          ID                    7000              6 6
T          ID                    8000              7 7
T          ID                    9000              8 8
T          ID                   10000              9 9
T          ID                  110000             10 10
T          V                   110000     1.7849E+33  X X X X X X X X X X X X X X X X X
                                                      X X X X X X X X X X X X X X X

12 rows selected.

Problem with 12.1 databases, the one-off patch has not been released yet and It will be fixed in 12.2.

Advertisement

2 responses to “Oracle 11g, frequency histograms, and character columns …

  1. hourim February 17, 2016 at 12:16 PM

    Hello Laurent

    This issue you’ve mentioned here has been solved in 12.1.0.2.0. I have just replayed your test case and the cardinality is correctly estimated in that version. Prior to 12.1.0.2.0 It is worth noticing that the generated frequency histogram has a unique bucket

    select
    column_name
    ,num_distinct
    ,num_buckets
    ,sample_size
    ,histogram
    from
    user_tab_col_statistics
    where table_name = ‘T_L’
    and column_name = ‘VC’;

    COLUMN_NAME NUM_DISTINCT NUM_BUCKETS SAMPLE_SIZE HISTOGRAM
    —————————— ———— ———– ———– ————-
    VC 11 1 110000 FREQUENCY

    While in 12.1.0.2.0 there 11 buckets

    select
    column_name
    ,num_distinct
    ,num_buckets
    ,sample_size
    ,histogram
    from
    user_tab_col_statistics
    where table_name = ‘T_L’
    and column_name = ‘VC’;

    COLUMN_NAME NUM_DISTINCT NUM_BUCKETS SAMPLE_SIZE HISTOGRAM
    —————————— ———— ———– ———– —————
    VC 11 11 110000 FREQUENCY

    Best regards
    Mohamed Houri

    • Laurent February 17, 2016 at 7:40 PM

      Thanks Mohamed.
      Strange behaviour because in user_histograms … my case shows only one bucket.

      I will try to reproduce your test case and have a closer look in a CBO trace for example.

      Feel free to share more information about this on your blog 😉

      Laurent

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: