Data … as usual

All things about data by Laurent Leturgez

Category Archives: Internals

Oracle Database 12c CDB$VIEW function

In Oracle 12c multitenant container, there are some new views that complement the USER_, ALL_ and DBA_Views, the CDB_ views.

For example, CDB_TABLES, CDB_OBJECTS, CDB_VIEWS etc. Those views give information for all containers. For example, CDB_TABLES references all tables information in all containers (CDB$ROOT, PDB$SEED and all plugabble databases). Those views have a new column CON_ID that reference the container id, for example:

[SYS@CDB$ROOT | SID:CDB]> select con_id,count(*) from cdb_tables group by con_id;

    CON_ID   COUNT(*)
---------- ----------
         2       2316
         3       2316
         1       2324
 If you have a closer look to the view definition, you will find that every CDB_ view uses a specific function named CDB$VIEW:
[SYS@CDB$ROOT | SID:CDB]> select dbms_metadata.get_ddl('VIEW','CDB_TABLES') from dual;

DBMS_METADATA.GET_DDL('VIEW','CDB_TABLES')
--------------------------------------------------------------------------------
  CREATE OR REPLACE FORCE NONEDITIONABLE VIEW "SYS"."CDB_TABLES"  CONTAINER_DATA
 ("OWNER", "TABLE_NAME", "TABLESPACE_NAME", "CLUSTER_NAME", "IOT_NAME", "STATUS"
, "PCT_FREE", "PCT_USED", "INI_TRANS", "MAX_TRANS", "INITIAL_EXTENT", "NEXT_EXTE
NT", "MIN_EXTENTS", "MAX_EXTENTS", "PCT_INCREASE", "FREELISTS", "FREELIST_GROUPS
", "LOGGING", "BACKED_UP", "NUM_ROWS", "BLOCKS", "EMPTY_BLOCKS", "AVG_SPACE", "C
HAIN_CNT", "AVG_ROW_LEN", "AVG_SPACE_FREELIST_BLOCKS", "NUM_FREELIST_BLOCKS", "D
EGREE", "INSTANCES", "CACHE", "TABLE_LOCK", "SAMPLE_SIZE", "LAST_ANALYZED", "PAR
TITIONED", "IOT_TYPE", "TEMPORARY", "SECONDARY", "NESTED", "BUFFER_POOL", "FLASH
_CACHE", "CELL_FLASH_CACHE", "ROW_MOVEMENT", "GLOBAL_STATS", "USER_STATS", "DURA
TION", "SKIP_CORRUPT", "MONITORING", "CLUSTER_OWNER", "DEPENDENCIES", "COMPRESSI
ON", "COMPRESS_FOR", "DROPPED", "READ_ONLY", "SEGMENT_CREATED", "RESULT_CACHE",
"CLUSTERING", "ACTIVITY_TRACKING", "DML_TIMESTAMP", "HAS_IDENTITY", "CONTAINER_D
ATA", "CON_ID") AS
  SELECT "OWNER","TABLE_NAME","TABLESPACE_NAME","CLUSTER_NAME","IOT_NAME","STATU
S","PCT_FREE","PCT_USED","INI_TRANS","MAX_TRANS","INITIAL_EXTENT","NEXT_EXTENT",
"MIN_EXTENTS","MAX_EXTENTS","PCT_INCREASE","FREELISTS","FREELIST_GROUPS","LOGGIN
G","BACKED_UP","NUM_ROWS","BLOCKS","EMPTY_BLOCKS","AVG_SPACE","CHAIN_CNT","AVG_R
OW_LEN","AVG_SPACE_FREELIST_BLOCKS","NUM_FREELIST_BLOCKS","DEGREE","INSTANCES","
CACHE","TABLE_LOCK","SAMPLE_SIZE","LAST_ANALYZED","PARTITIONED","IOT_TYPE","TEMP
ORARY","SECONDARY","NESTED","BUFFER_POOL","FLASH_CACHE","CELL_FLASH_CACHE","ROW_
MOVEMENT","GLOBAL_STATS","USER_STATS","DURATION","SKIP_CORRUPT","MONITORING","CL
USTER_OWNER","DEPENDENCIES","COMPRESSION","COMPRESS_FOR","DROPPED","READ_ONLY","
SEGMENT_CREATED","RESULT_CACHE","CLUSTERING","ACTIVITY_TRACKING","DML_TIMESTAMP"
,"HAS_IDENTITY","CONTAINER_DATA","CON_ID" FROM CDB$VIEW("SYS"."DBA_TABLES")
After searching a long time, I didn’t find any definition of this function, so I decided to search more about its behaviour.
First of all, this function seems “to transform” the view or the table given in parameter by adding a CON_ID table. We will see later that it’s not a CBO transformation as we know it but a “low level” transformation performed before SQL parsing.
CDB$VIEW function can be used with static views (DBA_TABLES, DBA_OBJECTS etc.), dictionary tables (OBJ$, USER$, FILE$ etc.), on dynamic performance views and X$ fixed tables, but on those two last items, the result is a little bit different than other.
[SYS@CDB$ROOT | SID:CDB]> select count(*) from cdb$view("DBA_TABLES");

  COUNT(*)
----------
      6956

[SYS@CDB$ROOT | SID:CDB]> select count(*) from cdb$view("OBJ$");

  COUNT(*)
----------
    272326
In a multitenant database, each container has its own dictionary, for example, CDB$ROOT has its OBJ$ table, the PDB$SEED has its own one and every pluggable database has its own OBJ$ table. So, the goal of this function is to give a global view from all containers.
But, as the OBJ$ table has the same name in every container, and as there’s no method to access a specific dictionary table of a pluggable database from the root container, the CDB$VIEW function is used to aggregate data from a specific view or table executed in each container:
[SYS@CDB$ROOT | SID:CDB]> show con_id;

CON_ID
------------------------------
1
[SYS@CDB$ROOT | SID:CDB]> select count(*) from obj$;

  COUNT(*)
----------
     90847

[SYS@CDB$ROOT | SID:CDB]>  select con_id,count(*) from cdb$view("OBJ$") group by con_id;

    CON_ID   COUNT(*)
---------- ----------
         1      90847
         2      90716
         3      90763
Now, let’s have a closer look to the generated plan:
[SYS@CDB$ROOT | SID:CDB]> select count(*) from obj$;

  COUNT(*)
----------
     90847

Execution Plan
----------------------------------------------------------
Plan hash value: 3951003077

------------------------------------------------------------------------
| Id  | Operation             | Name   | Rows  | Cost (%CPU)| Time     |
------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |        |     1 |    91   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE       |        |     1 |            |          |
|   2 |   INDEX FAST FULL SCAN| I_OBJ1 | 90910 |    91   (0)| 00:00:01 |
------------------------------------------------------------------------
[SYS@CDB$ROOT | SID:CDB]>  select count(*) from cdb$view("OBJ$");

  COUNT(*)
----------
    272326

Execution Plan
----------------------------------------------------------
Plan hash value: 2345629731

-----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name             | Rows  | Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                  |     1 |     2 (100)| 00:00:01 |       |       |        |      |            |
|   1 |  SORT AGGREGATE           |                  |     1 |            |          |       |       |        |      |            |
|   2 |   PX COORDINATOR          |                  |       |            |          |       |       |        |      |            |
|   3 |    PX SEND QC (RANDOM)    | :TQ10000         |     1 |            |          |       |       |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE        |                  |     1 |            |          |       |       |  Q1,00 | PCWP |            |
|   5 |      PX PARTITION LIST ALL|                  | 90910 |     2 (100)| 00:00:01 |     1 |   254 |  Q1,00 | PCWC |            |
|   6 |       FIXED TABLE FULL    | X$CDBVW$c2fad3da | 90910 |     2 (100)| 00:00:01 |       |       |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------------------------
There are two interesting things to note:
  • CDB$VIEW transform the statement to query an X$CDBVW$c2fad3da fixed table
  • This fixed table is read in parallel

About the transformation, I noticed many things.

First, if we query another table or view, another X$CDBVW$ is generated:

[SYS@CDB$ROOT | SID:CDB]> select count(*) from cdb$view("TAB$");

  COUNT(*)
----------
      7098

Execution Plan
----------------------------------------------------------
Plan hash value: 111784239

-----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name             | Rows  | Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                  |     1 |     0   (0)| 00:00:01 |       |       |        |      |            |
|   1 |  SORT AGGREGATE           |                  |     1 |            |          |       |       |        |      |            |
|   2 |   PX COORDINATOR          |                  |       |            |          |       |       |        |      |            |
|   3 |    PX SEND QC (RANDOM)    | :TQ10000         |     1 |            |          |       |       |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE        |                  |     1 |            |          |       |       |  Q1,00 | PCWP |            |
|   5 |      PX PARTITION LIST ALL|                  |  2372 |     0   (0)| 00:00:01 |     1 |   254 |  Q1,00 | PCWC |            |
|   6 |       FIXED TABLE FULL    | X$CDBVW$00303d4d |  2372 |     0   (0)| 00:00:01 |       |       |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------------------------
So, it seems that there’s a hash added at the end of this X$CDBVW$. It’s interesting to see that the fixed table X$CDBVW$ exists in the instance but have no rows recorded in:
[SYS@CDB$ROOT | SID:CDB]> desc X$CDBVW$
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 ADDR                                               RAW(8)
 INDX                                               NUMBER
 INST_ID                                            NUMBER
 CON_ID                                             NUMBER

[SYS@CDB$ROOT | SID:CDB]> select * from X$CDBVW$;

no rows selected
Second thing to note is this transformation is not a CBO transformation, indeed if you have a closer look to a 10053 trace file, the query has already been rewrited before parsing:
Stmt: ******* UNPARSED QUERY IS *******
SELECT COUNT(*) "COUNT(*)" FROM "SYS"."X$CDBVW$00303d4d" "TAB$"
Objects referenced in the statement
  X$CDBVW$00303d4d[TAB$] 4, type = 1
Objects in the hash table
  Hash table Object 4, type = 1, ownerid = 795352840147398986:
    Dynamic Sampling Directives at location 1:
       dirid = 11259849835960924452, state = 5, flags = 1, loc = 1 {E(4)[30]}
Return code in qosdInitDirCtx: ENBLD
Concerning the parallel execution, I noticed partition pruning is made upon the CON_ID column:
[SYS@CDB$ROOT | SID:CDB]> select count(*) from cdb$view("TAB$") where con_id=1;

  COUNT(*)
----------
      2372

Execution Plan
----------------------------------------------------------
Plan hash value: 2579428923

----------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name             | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                  |     1 |    13 |     0   (0)| 00:00:01 |       |       |        |      |            |
|   1 |  SORT AGGREGATE              |                  |     1 |    13 |            |          |       |       |        |      |            |
|   2 |   PX COORDINATOR             |                  |       |       |            |          |       |       |        |      |            |
|   3 |    PX SEND QC (RANDOM)       | :TQ10000         |     1 |    13 |            |          |       |       |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE           |                  |     1 |    13 |            |          |       |       |  Q1,00 | PCWP |            |
|   5 |      PX PARTITION LIST SINGLE|                  |    24 |   312 |     0   (0)| 00:00:01 |     1 |     1 |  Q1,00 | PCWC |            |
|   6 |       FIXED TABLE FULL       | X$CDBVW$00303d4d |    24 |   312 |     0   (0)| 00:00:01 |       |       |  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------------------------------
 To sum up, cdb$view function creates a fixed table on the fly and load data taken from all containers, the CON_ID column in this fixed table references the container id from where data are read. This fixed table is partitioned by CON_ID and reading this table is made by default in parallel mode (partition pruning is made on the CON_ID column).
There are two hidden parameters used to change this behaviour:
  • _partition_cdb_view_enabled: this parameter seems to disable the partitioning of the X$CDBVW$ view
[SYS@CDB$ROOT | SID:CDB]> alter session set "_partition_cdb_view_enabled"=FALSE;

Session altered.

[SYS@CDB$ROOT | SID:CDB]> select count(*) from cdb$view("TAB$");

  COUNT(*)
----------
      7098

Execution Plan
----------------------------------------------------------
Plan hash value: 4209588238

------------------------------------------------------------------------------
| Id  | Operation         | Name             | Rows  | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                  |     1 |     0   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |                  |     1 |            |          |
|   2 |   FIXED TABLE FULL| X$CDBVW$00303d4d |  2372 |     0   (0)| 00:00:01 |
------------------------------------------------------------------------------
  • _px_cdb_view_enabled: this parameter seems to disable parallel scan of the X$CDBVW$ view
[SYS@CDB$ROOT | SID:CDB]> alter session set "_px_cdb_view_enabled"=FALSE;

Session altered.

[SYS@CDB$ROOT | SID:CDB]> select count(*) from cdb$view("TAB$");

  COUNT(*)
----------
      7098

Execution Plan
----------------------------------------------------------
Plan hash value: 3630495286

------------------------------------------------------------------------------------------------
| Id  | Operation           | Name             | Rows  | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |                  |     1 |     0   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE     |                  |     1 |            |          |       |       |
|   2 |   PARTITION LIST ALL|                  |  2372 |     0   (0)| 00:00:01 |     1 |   254 |
|   3 |    FIXED TABLE FULL | X$CDBVW$00303d4d |  2372 |     0   (0)| 00:00:01 |       |       |
------------------------------------------------------------------------------------------------

On the CBO side, I tried to have a closer look on how statistics are evaluated for this particular view.

It seems that the CBO uses the cardinality of the dictionary table available in the CDB$ROOT container and ignore the statistics of the other dictionary table (taken from the other containers):

[SYS@CDB$ROOT | SID:CDB]> select con_id,table_name,num_rows from cdb_tables where table_name='TAB$';

    CON_ID TABLE_NAME        NUM_ROWS
---------- --------------- ----------
         1 TAB$                  2372
         2 TAB$                  1462
         3 TAB$                  2363

[SYS@CDB$ROOT | SID:CDB]> alter session set events '10053 trace name context forever, level 1';

Session altered.

[SYS@CDB$ROOT | SID:CDB]> select /* PARSE_ME */ count(*) from cdb$view("TAB$");

  COUNT(*)
----------
      7098
 But If we have a closer look to the generated 10053 trace file, and the cardinalities seems to be a little bit different from reality:
***************************************
BASE STATISTICAL INFORMATION
***********************
Table Stats::
  Table: X$CDBVW$00303d4d  Alias: TAB$
  #Rows: 2372  #Blks:  1487  AvgRowLen:  134.00  ChainCnt:  0.00
  FixedTabRowCost:  1000.00
=======================================
SPD: BEGIN context at query block level
=======================================
Query Block SEL$1 (#0)
Return code in qosdSetupDirCtx4QB: NOCTX
=====================================
SPD: END context at query block level
=====================================
Access path analysis for X$CDBVW$00303d4d
***************************************
SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for X$CDBVW$00303d4d[TAB$]
SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
  Table: X$CDBVW$00303d4d  Alias: TAB$
    Card: Original: 2372.000000  Rounded: 2372  Computed: 2372.00  Non Adjusted: 2372.00
  Access Path: TableScan
    Cost:  0.06  Resp: 0.06  Degree: 0
      Cost_io: 0.00  Cost_cpu: 2372000
      Resp_io: 0.00  Resp_cpu: 2372000
  Best:: AccessPath: TableScan
         Cost: 0.06  Degree: 1  Resp: 0.06  Card: 2372.00  Bytes: 0

***************************************
 .../...
============
Plan Table
============
-----------------------------------------------------+-----------------------------------+-------------------------+---------------+
| Id  | Operation                  | Name            | Rows  | Bytes | Cost  | Time      |  TQ  |IN-OUT|PQ Distrib | Pstart| Pstop |
-----------------------------------------------------+-----------------------------------+-------------------------+---------------+
| 0   | SELECT STATEMENT           |                 |       |       |     1 |           |      |      |           |       |       |
| 1   |  SORT AGGREGATE            |                 |     1 |       |       |           |      |      |           |       |       |
| 2   |   PX COORDINATOR           |                 |       |       |       |           |      |      |           |       |       |
| 3   |    PX SEND QC (RANDOM)     | :TQ10000        |     1 |       |       |           |:Q1000| P->S |QC (RANDOM)|       |       |
| 4   |     SORT AGGREGATE         |                 |     1 |       |       |           |:Q1000| PCWP |           |       |       |
| 5   |      PX PARTITION LIST ALL |                 |  2372 |       |     0 |           |:Q1000| PCWC |           | 1     | 254   |
| 6   |       FIXED TABLE FULL     | X$CDBVW$00303d4d|  2372 |       |     0 |           |:Q1000| PCWP |           |       |       |
-----------------------------------------------------+-----------------------------------+-------------------------+---------------+
Now imagine, you have consolidated a multitenant database with 200 pdbs, each pdb host an ERP with tons of table, you will have a small difference between evaluation and real row count.
To conclude this part, If you perform the same analysis but you query the CDB_TABLES view, this one will be rewrite in CDB$VIEW(“DBA_TABLES”). As the DBA_TABLES is a view and is not analyzed, Oracle seems to set the cardinality to 10000 rows. I have made some tests on many views and it’s always the same cardinality : 10000 rows and 0 block.
***************************************
BASE STATISTICAL INFORMATION
***********************
Table Stats::
  Table: X$CDBVW$73c6ce97  Alias: DBA_TABLES  (NOT ANALYZED)
  #Rows: 10000  #Blks:  0  AvgRowLen:  464.00  ChainCnt:  0.00
  FixedTabRowCost:  1000.00
=======================================
SPD: BEGIN context at query block level
=======================================
Query Block SEL$F5BB74E1 (#0)
Return code in qosdSetupDirCtx4QB: NOCTX
=====================================
SPD: END context at query block level
=====================================
Access path analysis for X$CDBVW$73c6ce97
***************************************
SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for X$CDBVW$73c6ce97[DBA_TABLES]
SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
  Table: X$CDBVW$73c6ce97  Alias: DBA_TABLES
    Card: Original: 10000.000000  Rounded: 10000  Computed: 10000.00  Non Adjusted: 10000.00
  Access Path: TableScan
    Cost:  0.25  Resp: 0.25  Degree: 0
      Cost_io: 0.00  Cost_cpu: 10000000
      Resp_io: 0.00  Resp_cpu: 10000000
  Best:: AccessPath: TableScan
         Cost: 0.25  Degree: 1  Resp: 0.25  Card: 10000.00  Bytes: 0

***************************************
To conclude this part, even if statistics on CDB$VIEW(“DBA_TABLES”) are always set to 10000 rows, it is safe to use CDB_ views directly instead on CDB$VIEW function on a dictionary table, even if I didn’t find any more accurate method.
The last researches I’ve made is on how the Oracle kernel will transform the query before parsing.
As we saw previously, the CBO receives an unparsed query which is already rewritten with the X$CDBVW$ view, so the transformation might be performed at a low level.
In Oracle 12c there’s a new object file fitted in libserver.a library (you can get this file in the 12.1 PSU2 files directory). This object file is named kpdbcv.o and it contains many function named kpdbcv* (I think this means Kernel PDB Cdb View).
[oracle@oel63 libserver12.a]$ pwd
/home/oracle/patch/17552800/files/lib/libserver12.a
[oracle@oel63 libserver12.a]$ readelf -a kpdbcv.o | grep FUNC
     7: 0000000000000270   448 FUNC    LOCAL  DEFAULT    4 kpdbcvRwtToInlineView
    10: 00000000000008f0  1216 FUNC    LOCAL  DEFAULT    4 kpdbcvLoadFragDescr
    13: 0000000000001c60    96 FUNC    LOCAL  DEFAULT    4 kpdbcvLoadBaseInfo
    14: 00000000000003b0    80 FUNC    LOCAL  DEFAULT    3 kpdbcvBuildColNames
    15: 0000000000009d50  1680 FUNC    LOCAL  DEFAULT    3 kpdbcvGetSysCtxCurrUser
    16: 0000000000001940  7616 FUNC    LOCAL  DEFAULT    3 kpdbcvFetchCbkCon
    17: 0000000000003700    96 FUNC    LOCAL  DEFAULT    3 kpdbcvFetchCbkSys
    18: 0000000000004ff0  1680 FUNC    LOCAL  DEFAULT    3 kpdbcvInsBaseViewStats
    19: 0000000000005680  1744 FUNC    LOCAL  DEFAULT    3 kpdbcvGetBaseViewRowCnt
    20: 0000000000006bc0  1792 FUNC    LOCAL  DEFAULT    3 kpdbcvGetBaseViewStats
    21: 00000000000072c0  1136 FUNC    LOCAL  DEFAULT    3 kpdbcvGetFxdViewStats
    22: 0000000000007730   544 FUNC    LOCAL  DEFAULT    3 kpdbcvFxdViewStatsCbk
    23: 0000000000007b30    80 FUNC    LOCAL  DEFAULT    3 kpdbcvGetConIdCnt
    24: 0000000000007b80   448 FUNC    LOCAL  DEFAULT    3 kpdbcvConIdMorph
    25: 0000000000007d40    80 FUNC    LOCAL  DEFAULT    3 kpdbcvGatherBind
    26: 0000000000007d90  2304 FUNC    LOCAL  DEFAULT    3 kpdbcvPrintPredText
    27: 0000000000008690   304 FUNC    LOCAL  DEFAULT    3 kpdbcvVisitPreds
    28: 00000000000087c0   384 FUNC    LOCAL  DEFAULT    3 kpdbcvVisitOpns
    29: 0000000000008940   384 FUNC    LOCAL  DEFAULT    3 kpdbcvReplaceBind
    30: 0000000000008ac0  2944 FUNC    LOCAL  DEFAULT    3 kpdbcvIsValidWhere
    31: 0000000000009640  1648 FUNC    LOCAL  DEFAULT    3 kpdbcvPredicateToText
    37: 0000000000000010    32 FUNC    GLOBAL DEFAULT    3 kpdbcdbvwcbk
    38: 0000000000000030    16 FUNC    GLOBAL DEFAULT    3 kpdbcvComDataCbk
    40: 0000000000000010   608 FUNC    GLOBAL DEFAULT    4 kpdbcvXformFrodef
    52: 0000000000000430  1216 FUNC    GLOBAL DEFAULT    4 kpdbcvLoadPartDescr
    73: 0000000000000db0   320 FUNC    GLOBAL DEFAULT    4 kpdbcvIsPartitioned
    88: 0000000000000050   784 FUNC    GLOBAL DEFAULT    3 kpdbcvIsParallelizable
    93: 0000000000000360    64 FUNC    GLOBAL DEFAULT    3 kpdbcvGetDOP
    99: 0000000000000f00  3232 FUNC    GLOBAL DEFAULT    4 kpdbcvLoad
   135: 0000000000000400  5440 FUNC    GLOBAL DEFAULT    3 kpdbcvFetchStart
   201: 0000000000003760  4560 FUNC    GLOBAL DEFAULT    3 kpdbcvFetch
   216: 0000000000004930   304 FUNC    GLOBAL DEFAULT    3 kpdbcvCleanup
   217: 0000000000001bb0   160 FUNC    GLOBAL DEFAULT    4 kpdbcvIsCDBViewOwner
   219: 0000000000004a70  1408 FUNC    GLOBAL DEFAULT    3 kpdbcvDelALLBaseViewStats
   221: 0000000000005d50  3696 FUNC    GLOBAL DEFAULT    3 kpdbcvInsALLBaseViewStats
   222: 0000000000007950    48 FUNC    GLOBAL DEFAULT    3 kpdbcvCDBViewStatsCbk
   223: 0000000000007980    48 FUNC    GLOBAL DEFAULT    3 kpdbcvComDataStatsCbk
   224: 00000000000079b0   384 FUNC    GLOBAL DEFAULT    3 kpdbcvGetParamValue
   241: 0000000000009cb0   160 FUNC    GLOBAL DEFAULT    3 kpdbcvAllocate
Depending on the context (cdb view partitoning enabled or not, parallel scan enabled or not), you will have a different call stack, but with gdb I found that the common point is the first function called which is kpdbcvXformFrodef function.
If you are curious and want to view the kpdbcv* function call stack during a select on CDB_ view, you can use gdb:
(gdb) rbreak ^kpdbcv
(gdb) commands
 Type commands for breakpoint(s) 1-38, one per line.
 End with a line saying just "end".
 >continue
 >end
(gdb) c
Continuing.

Breakpoint 1, 0x0000000002578620 in kpdbcvXformFrodef ()

Breakpoint 6, 0x0000000002579510 in kpdbcvLoad ()

Breakpoint 5, 0x00000000025793c0 in kpdbcvIsPartitioned ()

Breakpoint 5, 0x00000000025793c0 in kpdbcvIsPartitioned ()

Breakpoint 3, 0x0000000002578a40 in kpdbcvLoadPartDescr ()

Breakpoint 4, 0x0000000002578f00 in kpdbcvLoadFragDescr ()

Breakpoint 4, 0x0000000002578f00 in kpdbcvLoadFragDescr ()

.../...
Breakpoint 4, 0x0000000002578f00 in kpdbcvLoadFragDescr ()

Breakpoint 4, 0x0000000002578f00 in kpdbcvLoadFragDescr ()

Breakpoint 4, 0x0000000002578f00 in kpdbcvLoadFragDescr ()

Breakpoint 7, 0x000000000257a1c0 in kpdbcvIsCDBViewOwner ()

Breakpoint 25, 0x00000000083cd580 in kpdbcvCDBViewStatsCbk ()

Breakpoint 24, 0x00000000083cd360 in kpdbcvFxdViewStatsCbk ()

Breakpoint 23, 0x00000000083ccef0 in kpdbcvGetFxdViewStats ()

Breakpoint 22, 0x00000000083cc7f0 in kpdbcvGetBaseViewStats ()

Breakpoint 22, 0x00000000083cc7f0 in kpdbcvGetBaseViewStats ()

Breakpoint 10, 0x00000000083c5c80 in kpdbcvIsParallelizable ()

Breakpoint 5, 0x00000000025793c0 in kpdbcvIsPartitioned ()

Breakpoint 11, 0x00000000083c5f90 in kpdbcvGetDOP ()

Breakpoint 37, 0x00000000083cf8e0 in kpdbcvAllocate ()

Breakpoint 36, 0x00000000083cf270 in kpdbcvPredicateToText ()
To sum up this post, we have to keep in mind that queries on CDB_ views use an internal function CDB$VIEW that transform the sql text before parsing and as a result, there is a new fixed table X$CDBVW$ concatenated to a hash that might identify the source table or view. The statement that will run by default in parallel mode because the X$CDBVW$ fixed table is build as partitioned (by default).
On the CBO side, the result of the CDB_ view transformation give an object which have a default cardinality that seems to be always set at a value of 10000 rows. The result is quite different if you try to run CDB$VIEW on dictionary tables (OBJ$, TAB$ etc) and can give some trouble if you have a lot of PDBs opened in your container.
All this transformations seem to be hard coded (partially) in an object file kpdbcv.o included in the libserver12.a library.
Advertisements

Oracle 12c invisible columns … behind the scene

In Oracle 12c, there’s a new feature called table’s invisible columns. As indexes in oracle 11g, you can now makes columns visible or invisible.

This feature has a strange behavior that we will see later and see how it really works.

First, I created a table with 3 columns:

SQL> create table test(a number, b number, c number);

Table created.

SQL> insert into test(a,b,c) values(1,2,3);

1 row created.

SQL> desc test
 Name                    Null?    Type
 ----------------------- -------- ----------------
 A                                NUMBER
 B                                NUMBER
 C                                NUMBER
And then, I modified the table to transform the B column as invisible:
SQL> alter table test modify (b invisible);

Table altered.

SQL> desc test
 Name                    Null?    Type
 ----------------------- -------- ----------------
 A                                NUMBER
 C                                NUMBER

SQL> select * from test;

         A          C
---------- ----------
         1          3
If we tried to specifically query the B column, data appear:
SQL> select a,b,c from test;

         A          B          C
---------- ---------- ----------
         1          2          3
If we have a look deep inside the block, we can see that the visibility of a specific column is defined as the dictionary level and not at the block level:
SQL> select dbms_rowid.rowid_relative_fno(rowid) File#,
  2  dbms_rowid.rowid_block_number(rowid) Block#
  3  from test;

     FILE#     BLOCK#
---------- ----------
         1      99241

SQL> alter system dump datafile 1 block 99241;

System altered.

SQL> 
data_block_dump,data header at 0x7f59e1a66a5c
===============
tsiz: 0x1fa0
hsiz: 0x14
pbl: 0x7f59e1a66a5c
     76543210
flag=--------
ntab=1
nrow=1
frre=-1
fsbo=0x14
fseo=0x1f94
avsp=0x1f80
tosp=0x1f80
0xe:pti[0]      nrow=1  offs=0
0x12:pri[0]     offs=0x1f94
block_row_dump:
tab 0, row 0, @0x1f94
tl: 12 fb: --H-FL-- lb: 0x1  cc: 3
col  0: [ 2]  c1 02
col  1: [ 2]  c1 03 <<<<<<< OUR B COLUMN
col  2: [ 2]  c1 04
end_of_block_dump
End dump data blocks tsn: 0 file#: 1 minblk 99241 maxblk 99241

SQL> select dump(a,16),dump(b,16), dump(c,16) from test;

DUMP(A,16)        DUMP(B,16)        DUMP(C,16)
----------------- ----------------- -----------------
Typ=2 Len=2: c1,2 Typ=2 Len=2: c1,3 Typ=2 Len=2: c1,4
Now, let’s modify the B column to be visible:
SQL> alter table test modify (b visible);

Table altered.

SQL> desc test;
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 A                                                  NUMBER
 C                                                  NUMBER
 B                                                  NUMBER
Well, the columns order seems to have changed. What is more funny is when you try now to insert a new row without specifying the column, it takes the new order definition:
SQL> insert into test values(1,2,3);

1 row created.

SQL> select * from test;

         A          C          B
---------- ---------- ----------
         1          3          2
         1          2          3
Now, let’s dump the block (after a necessary checkpoint).
data_block_dump,data header at 0x7f59e1a66a5c
===============
tsiz: 0x1fa0
hsiz: 0x16
pbl: 0x7f59e1a66a5c
     76543210
flag=--------
ntab=1
nrow=2
frre=-1
fsbo=0x16
fseo=0x1f88
avsp=0x1f72
tosp=0x1f72
0xe:pti[0]      nrow=2  offs=0
0x12:pri[0]     offs=0x1f94
0x14:pri[1]     offs=0x1f88
block_row_dump:
tab 0, row 0, @0x1f94
tl: 12 fb: --H-FL-- lb: 0x1  cc: 3
col  0: [ 2]  c1 02
col  1: [ 2]  c1 03
col  2: [ 2]  c1 04
tab 0, row 1, @0x1f88
tl: 12 fb: --H-FL-- lb: 0x2  cc: 3
col  0: [ 2]  c1 02
col  1: [ 2]  c1 04
col  2: [ 2]  c1 03
end_of_block_dump
End dump data blocks tsn: 0 file#: 1 minblk 99241 maxblk 99241
We confirm here that column visibility and column order definition are not defined at the block level. So it may be at the dictionary level. Let’s have a closer look to the col$ system table:
SQL> select o.obj#,col#,segcol#,o.name object_name,c.name col_name
  2  from obj$ o, col$ c
  3  where o.obj#=c.obj#
  4  and o.name='TEST'
  5  /

      OBJ#       COL#    SEGCOL# OBJECT_NAME          COL_NAME
---------- ---------- ---------- -------------------- ------------------------------
     92056          1          1 TEST                 A
     92056          3          2 TEST                 B
     92056          2          3 TEST                 C
Column position in the segment (ie. in the block) is defined by the SEGCOL# column. COL# column defines the rank of the column when you perform a SELECT * or an INSERT without specifying the corresponding column. The COL# value can change depending on visibility modifications made on the column.
So be very careful about apps code that makes inserts without column definition (INSERT INTO t VALUES (val1,val2, … , valN)). This could trigger new errors or worse, involve data integrity errors (like in the previous example).

Write SQL statements on internal structures in multitenant databases.

In a multitenant configuration, if you are querying X$ structures and fixed tables (OBJ$, TAB$ etc.), you will face differents behaviour depending on the fact you are connected to the root container or to a pluggable database.

  • If you are connected to a root container (CDB$ROOT)
    • In fixed tables, for example OBJ$, there’s no CON_ID column defined in this table so this will give you all objects for your root container, and not for all containers.
SQL> show con_name
CON_NAME
------------------------------
CDB$ROOT
SQL> desc obj$
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 OBJ#                                      NOT NULL NUMBER
 DATAOBJ#                                           NUMBER
 OWNER#                                    NOT NULL NUMBER
 NAME                                      NOT NULL VARCHAR2(128)
 NAMESPACE                                 NOT NULL NUMBER
 SUBNAME                                            VARCHAR2(128)
 TYPE#                                     NOT NULL NUMBER
 CTIME                                     NOT NULL DATE
 MTIME                                     NOT NULL DATE
 STIME                                     NOT NULL DATE
 STATUS                                    NOT NULL NUMBER
 REMOTEOWNER                                        VARCHAR2(128)
 LINKNAME                                           VARCHAR2(128)
 FLAGS                                              NUMBER
 OID$                                               RAW(16)
 SPARE1                                             NUMBER
 SPARE2                                             NUMBER
 SPARE3                                             NUMBER
 SPARE4                                             VARCHAR2(1000)
 SPARE5                                             VARCHAR2(1000)
 SPARE6                                             DATE
 SIGNATURE                                          RAW(16)
 SPARE7                                             NUMBER
 SPARE8                                             NUMBER
 SPARE9                                             NUMBER
    • X$ structures will give you information for all containers (ROOT$CDB, SEED and and all pluggable database). And there’s a CON_ID column defined in all those structures.
SQL> show con_name

CON_NAME
------------------------------
CDB$ROOT

SQL> desc x$ksppi
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 ADDR                                               RAW(8)
 INDX                                               NUMBER
 INST_ID                                            NUMBER
 CON_ID                                             NUMBER
 KSPPINM                                            VARCHAR2(80)
 KSPPITY                                            NUMBER
 KSPPDESC                                           VARCHAR2(255)
 KSPPIFLG                                           NUMBER
 KSPPILRMFLG                                        NUMBER
 KSPPIHASH                                          NUMBER

SQL> select c.name,KSPPINM
  2  from x$ksppi x, v$containers c
  3  where x.con_id=c.con_id and  KSPPINM='open_cursors';

NAME                           KSPPINM
------------------------------ ------------------------------
CDB$ROOT                       open_cursors
PDB$SEED                       open_cursors
PDB1                           open_cursors
PDB2                           open_cursors
  • If you are connected to a PDB
    • In fixed tables, for example OBJ$, there’s still no CON_ID, so this will give you all objects on your PDB.
    • In X$ structures, there’s a CON_ID column, but if you are connected to a PDB, you will see only the data related to this PDB.
To summarize, static fixed tables contain data for the actual container (CDB$ROOT or PDB), but X$ structures contain data for all containers if you are connected to the CDB$ROOT container, and contain data related to the PDB you are connected to, in case of a PDB.
If you disassemble static views like CDB_TABLES, you will see a new function used to have data for all containers. This is CDB$VIEW function :
SQL> show con_id

CON_ID
------------------------------
1
SQL> show con_name

CON_NAME
------------------------------
CDB$ROOT
SQL> select count(*) from obj$;

  COUNT(*)
----------
     91005

SQL> select con_id,count(*) from CDB$VIEW("SYS"."OBJ$") group by con_id order by 1;

    CON_ID   COUNT(*)
---------- ----------
         1      91005
         2      90708
         3      90960
         4      90948
 Ok, now we have the information for all our containers. But if we want to join an X$ Structure and a CDB$VIEW transformed object, time for execute this is too long:
select x.con_id,o.name,count(*)
from x$bh x, (select name,dataobj#,con_id from CDB$VIEW("SYS"."OBJ$")) o
where x.con_id=o.con_id
and o.dataobj#=x.obj
and o.name like 'T_PDB%'
group by x.con_id,o.name
/
... never ends :(
To execute it in a better time, I used query factorization with a WITH block, and forced materialization of it:
SQL> with o as (select /*+ MATERIALIZE */ name,dataobj#,con_id from CDB$VIEW("SYS"."OBJ$"))
  2  select x.con_id,o.name,count(*)
  3  from x$bh x,o
  4  where x.con_id=o.con_id
  5  and o.dataobj#=x.obj
  6  and o.name like 'T_PDB%'
  7  group by x.con_id,o.name
  8  /

    CON_ID NAME                   COUNT(*)
---------- -------------------- ----------
         4 T_PDB2                        9
         3 T_PDB1                        9

Elapsed: 00:00:02.40

Trace Oracle Internal triggers

When you create a replication between two tables, a snapshot log for your materialized views, etc. Oracle creates and manages internal triggers that execute SQL statements. For example, when you insert a row in a table with a snapshot log, sql statements are executed in the related snapshot log table (MLOG$ table).

But if you trace it with 10046 event, nothing will appear in the trace file.

SQL> create materialized view log on OBJ$CLONE;

Materialized view log created.

SQL> select * from mlog$_OBJ$CLONE;

no rows selected

SQL> @10046_on.sql

Session altered.

SQL> delete from OBJ$CLONE where obj#=3000;

1 row deleted.

SQL> insert into OBJ$CLONE select * from sys.obj$ where obj#=3000;

1 row created.

SQL> commit;

Commit complete.

SQL> @10046_off.sql
The tracefile doesn’t mention  any DML operation on the table MLOG$_OBJ$CLONE (see the tkprof output below):
SQL ID: 46m19q6258mvs Plan Hash: 4136978372

SELECT /* OPT_DYN_SAMP */ /*+ ALL_ROWS IGNORE_WHERE_CLAUSE
  NO_PARALLEL(SAMPLESUB) opt_param('parallel_execution_enabled', 'false')
  NO_PARALLEL_INDEX(SAMPLESUB) NO_SQL_TUNE */ NVL(SUM(C1),0), NVL(SUM(C2),0)
FROM
 (SELECT /*+ NO_PARALLEL("MLOG$_OBJ$CLONE") FULL("MLOG$_OBJ$CLONE")
  NO_PARALLEL_INDEX("MLOG$_OBJ$CLONE") */ 1 AS C1, 1 AS C2 FROM
  "LAURENT"."MLOG$_OBJ$CLONE" "MLOG$_OBJ$CLONE") SAMPLESUB

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        2      0.00       0.05          0          0          0           0
Execute      2      0.00       0.00          0          0          0           0
Fetch        2      0.00       0.00          0         14          0           2
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        6      0.00       0.05          0         14          0           2

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: SYS   (recursive depth: 2)
Number of plan statistics captured: 2

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  SORT AGGREGATE (cr=7 pr=0 pw=0 time=204 us)
         4          4          5   TABLE ACCESS FULL MLOG$_OBJ$CLONE (cr=7 pr=0 pw=0 time=105 us cost=2 size=0 card=82)

********************************************************************************

SQL ID: 2ya3sb8sagzxw Plan Hash: 3957508444

delete from OBJ$CLONE
where
 obj#=3000

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.01       0.07          0          8          0           0
Execute      1      0.00       0.00          0          3          8           1
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2      0.01       0.07          0         11          8           1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 90
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         0          0          0  DELETE  OBJ$CLONE (cr=3 pr=0 pw=0 time=372 us)
         1          1          1   INDEX UNIQUE SCAN PK_OBJ$CLONE (cr=2 pr=0 pw=0 time=34 us cost=1 size=13 card=1)(object id 127619)

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                       1        0.00          0.00
  SQL*Net message from client                     1        3.68          3.68
********************************************************************************

SQL ID: 4n7u7f6u7kd2y Plan Hash: 1218588913

insert into OBJ$CLONE select * from sys.obj$ where obj#=3000

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          8          4           0
Execute      1      0.00       0.00          0          5          8           1
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2      0.00       0.00          0         13         12           1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 90
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         0          0          0  LOAD TABLE CONVENTIONAL  (cr=5 pr=0 pw=0 time=518 us)
         1          1          1   TABLE ACCESS BY INDEX ROWID OBJ$ (cr=3 pr=0 pw=0 time=75 us cost=3 size=86 card=1)
         1          1          1    INDEX RANGE SCAN I_OBJ1 (cr=2 pr=0 pw=0 time=46 us cost=2 size=0 card=1)(object id 36)

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                       1        0.00          0.00
  SQL*Net message from client                     1        2.44          2.44
********************************************************************************

SQL ID: 23wm3kz7rps5y Plan Hash: 0

commit

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          1           0
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2      0.00       0.00          0          0          1           0

Misses in library cache during parse: 0
Parsing user id: 90

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                       1        0.00          0.00
  SQL*Net message from client                     1        4.10          4.10
********************************************************************************
but there are some datas in this table:
SQL> select * from mlog$_OBJ$CLONE;

      OBJ# SNAPTIME$ D O CHANGE_VECTOR$$                     XID$$
---------- --------- - - ------------------------------ ----------
      3000 01-JAN-00 D O 000000                         5.6296E+15
      3000 01-JAN-00 I N FEFFFF                         5.6296E+15
And internal triggers is created to do this:
SQL> select * from user_internal_triggers;

TABLE_NAME                     INTERNAL_TRIGGER_TY
------------------------------ -------------------
OBJ$CLONE                      MVIEW LOG
By searching in USER_INTERNAL_TRIGGERS  code, we can see that this view references internal triggers on ‘DEFERRED RPC QUEUE’, ‘MVIEW LOG’, ‘UPDATABLE MVIEW LOG’ and ‘CONTEXT’. (based on the TAB$ table’s TRIGFLAG column).
select dbms_metadata.get_ddl('VIEW','USER_INTERNAL_TRIGGERS','SYS') ddl from dual

DDL
------------------------------------------------------------------------------------------------------------------------

  CREATE OR REPLACE FORCE VIEW "SYS"."USER_INTERNAL_TRIGGERS" ("TABLE_NAME", "INTERNAL_TRIGGER_TYPE") AS
  select o.name, 'DEFERRED RPC QUEUE'
from sys.tab$ t, sys.obj$ o
where o.owner# = userenv('SCHEMAID')
      and t.obj# = o.obj#
      and bitand(t.trigflag,1) = 1
union
select o.name, 'MVIEW LOG'
from sys.tab$ t, sys.obj$ o
where o.owner# = userenv('SCHEMAID')
      and t.obj# = o.obj#
      and bitand(t.trigflag,2) = 2
union
select o.name, 'UPDATABLE MVIEW LOG'
from sys.tab$ t, sys.obj$ o
where o.owner# = userenv('SCHEMAID')
       and t.obj# = o.obj#
       and bitand(t.trigflag,4) = 4
union
select o.name, 'CONTEXT'
from sys.tab$ t, sys.obj$ o
where o.owner# = userenv('SCHEMAID')
      and t.obj# = o.obj#
      and bitand(t.trigflag,8) = 8
If you want to trace what is really done by the internal trigger, you have a bunch of undocumented events that will trace specific triggers (Note there are events for repcat or synchronous replication for which there is no trigflag definition in USER_INTERNAL_TRIGGERS view code):
10302, 00000, "trace create or drop internal trigger"
// *Document: NO
// *Cause:
// *Action:

10303, 00000, "trace loading of library cache for internal triggers"
// *Document: NO
// *Cause:
// *Action:

10304, 00000, "trace replication trigger"
// *Document: NO
// *Cause:
// *Action:

10305, 00000, "trace updatable materialized view trigger"
// *Document: NO
// *Cause:
// *Action:

10306, 00000, "trace materialized view log trigger"
// *Document: NO
// *Cause:
// *Action:

10307, 00000, "trace RepCat execution"
// *Document: NO
// *Cause:
// *Action:

10308, 00000, "replication testing event"
// *Document: NO
// *Cause:
// *Action:

10309, 00000, "Trigger Debug event"
// *Document: NO
// *Cause:
// *Action:
// *Comment: This event replaces the earlier event number 10250
//           which had multiple definitions

10310, 00000, "trace synchronous change table trigger"
// *Document: NO
// *Cause:
// *Action:
I have chosen the 10306 event to trace internal triggers on Materialized view log (but you can use others depending on what you want to trace):
SQL> create materialized view log on OBJ$CLONE;

Materialized view log created.

SQL> select * from user_internal_triggers;

TABLE_NAME                     INTERNAL_TRIGGER_TY
------------------------------ -------------------
OBJ$CLONE                      MVIEW LOG

SQL> alter session set events '10306 trace name context forever, level 10';

Session altered.

SQL> delete from OBJ$CLONE where obj#=3000;

1 row deleted.

SQL> insert into OBJ$CLONE select * from sys.obj$ where obj#=3000;

1 row created.

SQL> commit;

Commit complete.
The related trace file shows SQL statements now executed on MLOG$_OBJ$CLONE table:
kntgslm()+
  ccl_kntsl = 1
  kntzcid =
        1
  lcc_kntsl = 0
  kntzlcs =
  kntzlcm =
  cpk_kntsl = 1
  pkc_kntsl =
        1
  flg_kntsl = 270434
  len_kntsl = 220
  sql_kntsl = INSERT /*+ IDX(%d) */ INTO "LAURENT"."MLOG$_OBJ$CLONE" (dmltype$$,old_new$$,snaptime$$,change_vector$$,xid$$,"OBJ#") VALUES (:d,:o,to_date('4000-01-01:00:00:00','YYYY-MM-DD:HH24:MI:SS'),:c,:x%s,:1)
  len2_kntsl = 0
  sql2_kntsl =
kntgslm()-
kntkca()
  ccl_kntsl = 1
  kntzcid =
        1
  lcc_kntsl = 0
  kntzlcs =
  kntzlcm =
  cpk_kntsl = 1
  pkc_kntsl =
        1
  flg_kntsl = 270434
  len_kntsl = 220
  sql_kntsl = INSERT /*+ IDX(%d) */ INTO "LAURENT"."MLOG$_OBJ$CLONE" (dmltype$$,old_new$$,snaptime$$,change_vector$$,xid$$,"OBJ#") VALUES (:d,:o,to_date('4000-01-01:00:00:00','YYYY-MM-DD:HH24:MI:SS'),:c,:x%s,:1)
  len2_kntsl = 0
  sql2_kntsl =
kntxslt()
  ccl_kntsl = 1
  kntzcid =
        1
  lcc_kntsl = 0
  kntzlcs =
  kntzlcm =
  cpk_kntsl = 1
  pkc_kntsl =
        1
  flg_kntsl = 270434
  len_kntsl = 220
  sql_kntsl = INSERT /*+ IDX(%d) */ INTO "LAURENT"."MLOG$_OBJ$CLONE" (dmltype$$,old_new$$,snaptime$$,change_vector$$,xid$$,"OBJ#") VALUES (:d,:o,to_date('4000-01-01:00:00:00','YYYY-MM-DD:HH24:MI:SS'),:c,:x%s,:1)
  len2_kntsl = 0
  sql2_kntsl =

*** 2013-01-16 11:53:21.065
kntkca()
  ccl_kntsl = 1
  kntzcid =
        1
  lcc_kntsl = 0
  kntzlcs =
  kntzlcm =
  cpk_kntsl = 1
  pkc_kntsl =
        1
  flg_kntsl = 270434
  len_kntsl = 220
  sql_kntsl = INSERT /*+ IDX(%d) */ INTO "LAURENT"."MLOG$_OBJ$CLONE" (dmltype$$,old_new$$,snaptime$$,change_vector$$,xid$$,"OBJ#") VALUES (:d,:o,to_date('4000-01-01:00:00:00','YYYY-MM-DD:HH24:MI:SS'),:c,:x%s,:1)
  len2_kntsl = 0
  sql2_kntsl =
kntxslt()
  ccl_kntsl = 1
  kntzcid =
        1
  lcc_kntsl = 0
  kntzlcs =
  kntzlcm =
  cpk_kntsl = 1
  pkc_kntsl =
        1
  flg_kntsl = 270434
  len_kntsl = 220
  sql_kntsl = INSERT /*+ IDX(%d) */ INTO "LAURENT"."MLOG$_OBJ$CLONE" (dmltype$$,old_new$$,snaptime$$,change_vector$$,xid$$,"OBJ#") VALUES (:d,:o,to_date('4000-01-01:00:00:00','YYYY-MM-DD:HH24:MI:SS'),:c,:x%s,:1)
  len2_kntsl = 0
  sql2_kntsl =
You can notice we have 5 instructions in sql_kntsl for 2 rows modified. Each statement seems to be controlled by 2 internal functions:  kntkca() and kntxslt(). There is one more function used to start this sequence kntgslm().
If I trace only one DML instruction in my OBJ$CLONE table, the result is one call to kntgslm() followed by a call to kntkca() and kntxslt().
So the sequence might start with a call to kntgslm followed by a call to  kntkca() and kntxslt() per row in the source table.
Finally, there’s a last problem: the sql text mentioned in sql_kntsl. It seems to be incomplete because of %d or %s in text but with those events, we can find in the sql statements executed we didn’t fin in the sql trace of the original statement.

How ASM disk header block repair works

In this post, I will explain one of my last works about the ASM disk header block.

First, I will create a TEST tablespace in my orcl database. The TEST’s datafile will be managed by ASM.

[oracle@oel ~]$ . oraenv
ORACLE_SID = [+ASM] ? orcl
The Oracle base remains unchanged with value /u01/app/oracle
[oracle@oel ~]$ sqlplus / as sysdba
SQL> create tablespace test datafile '+data' size 5M autoextend on maxsize unlimited;
Tablespace created.

SQL>  select instance_name from v$instance;
INSTANCE_NAME
----------------
orcl

SQL> select file_id,tablespace_name,file_name,status,online_status from dba_data_files
  2  /

   FILE_ID TABLESPACE_NAME                FILE_NAME                                          STATUS    ONLINE_
---------- ------------------------------ -------------------------------------------------- --------- -------
         4 USERS                          /u02/oradata/orcl/users01.dbf                      AVAILABLE ONLINE
         3 UNDOTBS1                       /u02/oradata/orcl/undotbs01.dbf                    AVAILABLE ONLINE
         2 SYSAUX                         /u02/oradata/orcl/sysaux01.dbf                     AVAILABLE ONLINE
         1 SYSTEM                         /u02/oradata/orcl/system01.dbf                     AVAILABLE SYSTEM
         5 EXAMPLE                        /u02/oradata/orcl/example01.dbf                    AVAILABLE ONLINE
         6 TEST                           +DATA/orcl/datafile/test.258.798578613             AVAILABLE ONLINE

6 rows selected.

My ASM instance manages two diskgroups DATA (external redundancy) and RL (normal redundancy).

DATA is a 2 disks diskgroup managed by asmlib.

SQL> select group_number,name,type,state from v$asm_diskgroup
  2  /

GROUP_NUMBER NAME TYPE   STATE
------------ ---- ------ -----------
           1 DATA EXTERN MOUNTED
           2 RL   NORMAL MOUNTED
SQL> select path from v$asm_disk where group_number=1;

PATH
---------------------------------
ORCL:ASM1
ORCL:ASM2

And now, we erase the header block of each disk. Header block is the first one of an ASM disk device, and its default size is 4096 bytes (This size is available in _asm_blksize undocumented parameter).

[oracle@oel ~]$ dd if=/dev/zero of=/dev/oracleasm/disks/ASM1 bs=4096 count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 2.2e-05 seconds, 186 MB/s
[oracle@oel ~]$ dd if=/dev/zero of=/dev/oracleasm/disks/ASM2 bs=4096 count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.000336 seconds, 12.2 MB/s

Even my block headers destroyed, I can still write in my diskgroup and allocate extents in it.

SQL> alter database datafile 6 resize 10M;

Database altered.

SQL> create table t tablespace TEST as select * from dba_source;

Table created.

SQL> alter system checkpoint;

System altered.

SQL> select file#,checkpoint_time from v$datafile_header where file#=6;

     FILE# CHECKPOINT_TIME
---------- -------------------
         6 05/11/2012 19:27:43

Well, let’s try to restart the rdbms instance:

SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

Total System Global Area 1068937216 bytes
Fixed Size                  2235208 bytes
Variable Size             377488568 bytes
Database Buffers          683671552 bytes
Redo Buffers                5541888 bytes
Database mounted.
Database opened.
SQL> select count(*) from t;

  COUNT(*)
----------
    702070

Damned ! I can still restart it and read all file extents.

There’s something strange in this demo, my ASM disk header is invalid and I still can read files. I will verify with kfed my headers state (only one ASM disk is shown below):

oracle@oel ~]$ kfed read /dev/oracleasm/disks/ASM1
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                            0 ; 0x001: 0x00
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:                       0 ; 0x008: file=0
kfbh.check:                           0 ; 0x00c: 0x00000000
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
2B469E002400 00000000 00000000 00000000 00000000  [................]
  Repeat 255 times
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]

[oracle@oel ~]$ sudo /u01/app/oracle/product/11.2.0.3/dbhome_1/bin/kfed read /dev/sdb1
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                            0 ; 0x001: 0x00
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:                       0 ; 0x008: file=0
kfbh.check:                           0 ; 0x00c: 0x00000000
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
2B413559C400 00000000 00000000 00000000 00000000  [................]
  Repeat 255 times
KFED-00322: file not found; arguments: [kfbtTraverseBlock] [Invalid OSM block type] [] [0]

So KFED has confirmed my header blocks are invalid. Now, I will try to unmount and remount the diskgroup to see if there’s any effect on diskgroup mount operation.

[oracle@oel ~]$ sqlplus / as sysasm

SQL> alter diskgroup DATA dismount;

Diskgroup altered.

SQL> alter diskgroup DATA mount;
alter diskgroup DATA mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15017: diskgroup "DATA" cannot be mounted
ORA-15063: ASM discovered an insufficient number of disks for diskgroup "DATA"

So ASM disk header seems to be important for mounting the diskgroup but without effect on asm extent allocation etc.

Now I will repair asm disks with kfed and repair operation to restore my asm header blocks:

[oracle@oel ~]$ kfed repair /dev/oracleasm/disks/ASM1
[oracle@oel ~]$ kfed repair /dev/oracleasm/disks/ASM2

[oracle@oel ~]$ sqlplus / as sysasm

SQL> alter diskgroup DATA mount;

Diskgroup altered.

So KFED is able to restore the header (with the repair operation).

[oracle@oel ~]$ . oraenv
ORACLE_SID = [+ASM] ? orcl
The Oracle base remains unchanged with value /u01/app/oracle

[oracle@oel ~]$ sqlplus / as sysdba

Connected to an idle instance.

SQL> startup
...
Database opened.
SQL> select count(*) from T;

  COUNT(*)
----------
    702070

And datas in the T table are still there … no problem !

Well, let’s summarize … I erased the header block of both asm disks in my diskgroup without any impact on file extent allocation. The only impact I had was on the mount capability. Next, I have restored the header block of each disk with a kfed repair operation … yes, but without any backup of the disk or the disk header. So I wonder … where was the backup of my header block?

First, I will try a very simple method. I will have a look at all metadata blocks in my asm disk device. I hope I will find another header block. For this purpose, i will use a basic shell script to analyze each block. If the script finds a block with the type KFBTYP_DISKHEAD, it will keep the block position, and print them at the end of script execution:

The basic script:

#!/bin/bash
#set -x
export ORAENV_ASK=NO
export ORACLE_SID=+ASM
. oraenv

i=0
ret=0
tab_cnt=0;
blk_typ='';

while [ "$blk_typ" != "KFBTYP_INVALID" ];
do
  blk_typ=`kfed read $1 blkn=$i | grep kfbh.type | awk '{print $5;}' | sed 's/^ *//g' | sed 's/ *$//g' `
  ret=$?
  if [ "$blk_typ" = "KFBTYP_DISKHEAD" ]; then
    t[$i]=$i
  fi
  let i=$i+1
done
echo "list of header block with KFBTYP_DISKHEAD type"
echo ${t[@]}

Execution results:

[oracle@oel ~]$ ./asm_surgery.sh /dev/oracleasm/disks/ASM1
The Oracle base remains unchanged with value /u01/app/oracle
list of header block with KFBTYP_DISKHEAD type
0 510
[oracle@oel ~]$ ./asm_surgery.sh /dev/oracleasm/disks/ASM2
The Oracle base remains unchanged with value /u01/app/oracle
list of header block with KFBTYP_DISKHEAD type
0 510

Ok, so there is a copy of the header block in the 510th block in my disk. Indeed, as Bane Radulović mentioned it in its blog, a backup copy of ASM disk header is in the second last block of allocation unit 1. So, if my AU is 1Mb and my block size 4096 bytes, a copy of my header block will be available in ((1048576 / 4096) * 2 – 1  = 511), as the block# starts at 0, it is located in the 510th block.

To double check this, I will erase the header block of my first ASM disk, and use kfed repair and strace to see what really happens:

[oracle@oel ~]$ dd if=/dev/zero of=/dev/oracleasm/disks/ASM1 bs=4096 count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 2.3e-05 seconds, 178 MB/s

[oracle@oel ~]$ strace kfed repair /dev/oracleasm/disks/ASM1
.../...

stat("/dev/oracleasm/disks/ASM1", {st_mode=S_IFBLK|0660, st_rdev=makedev(8, 17), ...}) = 0
access("/dev/oracleasm/disks/ASM1", F_OK) = 0
statfs("/dev/oracleasm/disks/ASM1", {f_type=0x958459f6, f_bsize=4096, f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={0, 0}, f_namelen=255, f_frsize=4096}) = 0
open("/dev/oracleasm/disks/ASM1", O_RDWR) = 7
lseek(7, 2088960, SEEK_SET)             = 2088960
read(7, "\1\202\1\1\376\200<\206\371\7"..., 4096) = 4096
lseek(7, 0, SEEK_SET)                   = 0
read(7, ""..., 4096) = 4096
lseek(7, 0, SEEK_SET)                   = 0
write(7, "\1\202\1\1\200\302\206\371\7"..., 4096) = 4096
close(7)                                = 0

This is very interesting.

First operation, it opens the device and read the block after 2088960 bytes. But 2088960/4096 = 510, so it reads the 510th block of the disk. Next it reads the block at position 0 (header block), and then writes in it the content of the 510th block.

Well, now I know there’s a copy of each header block in a secret position !!! (block #510) and kfed uses this block to repair the main header block.

During my tests, I noticed that sometimes, asm disks lost their asm label (I don’t know why). As a consequence, repaired disks won’t be recreated by the oracleasm scandisks (or after a reboot).

[root@oel ~]# ls /dev/oracleasm/disks/*
/dev/oracleasm/disks/ASM1  /dev/oracleasm/disks/ASM2  /dev/oracleasm/disks/ASM3  /dev/oracleasm/disks/ASM4
[root@oel ~]# oracleasm scandisks
Reloading disk partitions: done
Cleaning any stale ASM disks...
Cleaning disk "ASM1"
Scanning system for ASM disks...
[root@oel ~]# ls /dev/oracleasm/disks/*
/dev/oracleasm/disks/ASM2  /dev/oracleasm/disks/ASM3  /dev/oracleasm/disks/ASM4

According the fact that you know which device is corresponding to asmlib disk device, you can relabel this disk with oracleasm renamedisk command (be very careful with this command):

[root@oel ~]# oracleasm renamedisk -f /dev/sdb1 ASM1
Writing disk header: done
Instantiating disk "ASM1": done

[root@oel ~]# oracleasm scandisks
Reloading disk partitions: done
Cleaning any stale ASM disks...
Scanning system for ASM disks...

[root@oel ~]# ls /dev/oracleasm/disks/*
/dev/oracleasm/disks/ASM1  /dev/oracleasm/disks/ASM2  /dev/oracleasm/disks/ASM3  /dev/oracleasm/disks/ASM4