If you already faced RAC problems, you had to analyze log files from the entry points which are the alert.log files.
But as Oracle clusterware logs its events in many logfiles (usually one per process), and you have a logfile directory on each server, it can be difficult to detect the causes, and specially the root cause of a specific problem.
Recently I created a little perl script that will help me for this purpose. This script is named “crs_log_aggregator.pl” and is available for download at this URL : https://app.box.com/s/xi3h7bpssrqtkp55q43c.
This script works like this. As input, it takes one or n directories to analyze recursively. Those directories are containing all crs log files of all cluster nodes you want to analyze.
For each directory, it parses each log files (except the alert.log) and each history log files (with file extension: l??). Parsing will get the timestamp mentioned in the read line, if there’s no timestamp (because message is written in many lines for example), the perl script will use the preceding timestamp. (If there were no timestamp for example, in the file header, it will flag the timestamp as ‘N/A’). Every analyzed line in the log file, is completed by its origin represented by the relative file path.
As a result, an analyzed line, is represented in the output file by three parts:
- the timestamp of the message
- the origin file of the message
- the message
As a separator, I used a “<SEP>” tag.
So at this step, each line is recorded in a single unsorted file. If a filter expression has been given to the program as input, which is highly recommended, the filter is applied during the parsing process.
Next, the program will sort the output file by timestamp and will produce a single output file which is now filtered (or not) and this file can be analyzed.
Well, let’s see how it works:
For example, let’s consider a two nodes RAC (hostnames are rac1 and rac2).
As alert.log files are not analyzed, you need to analyze them before to see when the problem occurs.
In my case, I want to analyze 2 days: 2013-05-30 and 2013-05-31.
So first step is to copy is a filesystem, or on your desktop, all the logfiles.
I recommend to put in a single directory all the directories related to a single hostname. Here, I have two directories coming from both cluster nodes ($GRID_HOME/log) :
[oracle@rac1 aggregator]$ tree -d .
.
|-- rac1
| |-- acfslog
| |-- acfsrepl
| |-- acfsreplroot
| |-- acfssec
| |-- admin
| |-- agent
| | |-- crsd
| | | |-- oraagent_oracle
| | | |-- orarootagent_root
| | | `-- scriptagent_oracle
| | `-- ohasd
| | |-- oraagent_oracle
| | |-- oracssdagent_root
| | |-- oracssdmonitor_root
| | `-- orarootagent_root
| |-- client
| |-- crflogd
| |-- crfmond
| |-- crsd
| |-- cssd
| |-- ctssd
| |-- cvu
| | |-- cvulog
| | `-- cvutrc
| |-- diskmon
| |-- evmd
| |-- gipcd
| |-- gnsd
| |-- gpnpd
| |-- mdnsd
| |-- ohasd
| |-- racg
| | |-- racgeut
| | |-- racgevtf
| | `-- racgmain
| `-- srvm
`-- rac2
|-- acfslog
|-- acfsrepl
|-- acfsreplroot
|-- acfssec
|-- admin
|-- agent
| |-- crsd
| | |-- oraagent_oracle
| | |-- orarootagent_root
| | `-- scriptagent_oracle
| `-- ohasd
| |-- oraagent_oracle
| |-- oracssdagent_root
| |-- oracssdmonitor_root
| `-- orarootagent_root
|-- client
|-- crflogd
|-- crfmond
|-- crsd
|-- cssd
|-- ctssd
|-- cvu
| |-- cvulog
| `-- cvutrc
|-- diskmon
|-- evmd
|-- gipcd
|-- gnsd
|-- gpnpd
|-- mdnsd
|-- ohasd
|-- racg
| |-- racgeut
| |-- racgevtf
| `-- racgmain
`-- srvm
74 directories
[oracle@rac1 aggregator]$ ls -l
total 20
drwxr-xr-x 24 oracle oinstall 4096 Sep 8 14:39 rac1
drwxr-xr-x 24 oracle oinstall 4096 Sep 8 14:00 rac2
If you want some help about this tool, you can use the -h option:
[oracle@rac1 aggregator]$ ./crs_log_aggregator.pl -h
RAC Logs consolidator
---------------------------------------------
Usage
./crs_log_aggregator.pl -c dir1[,dir2,dir3,...,dirN] -o outputfile [-f filter]
[MANDATORY]: -c dir1,dir2,...,dirN
comma separated values of source directories to analyze
[MANDATORY]: -o outputfile
unique sorted and filtered output file
[OPTIONAL] : -f filter
regex filter to apply
[OPTIONAL] : -h
print this help
Examples:
$ ./crs_log_aggregator.pl -c data/ -o outfile
will sort without any filter all log files (*.log, *.l??) recursively found in the data directory. outfile is produced in the current directory as a result.
$ ./crs_log_aggregator.pl -c data/rac1/cssd,data/rac1/crsd -o outfile
will sort without any filter only the content of data/rac1/cssd,data/rac1/crsd directories.
$ ./crs_log_aggregator.pl -c data/ -o outfile_filtered -f '2012-10-26|2012-10-27'
will sort with a filter (regex) that will filter 2 days 2012-10-26 and 2012-10-27.
more information on perl regex: http://perldoc.perl.org/perlre.html
Next, I will launch the program with the good options:
[oracle@rac1 aggregator]$ ./crs_log_aggregator.pl -c ./rac1,./rac2 -o output_file -f '2013-05-30|2013-05-31'
---------Alert.log : ./rac1/alertrac1.log excluded from analyze
---------Alert.log : ./rac2/alertrac2.log excluded from analyze
[oracle@rac1 aggregator]$ ls -lh output_file
-rw-r--r-- 1 oracle oinstall 243M Sep 8 15:16 output_file
Please note that, if an alert.log file is found, it will be excluded and mentioned in the standard output.
I highly recommend you to use a filter to reduce the amount of data to sort (It can burn lot of CPU and memory). The -f option is use to do this, and takes a regexp style filter (it ‘s used directly in the perl script, so if you are not aware with regexp used in perl script, you can have a look at this URL:
http://perldoc.perl.org/perlre.html).
As a result, you have this kind of messages in the output file:
2013-05-31 09:14:59.573<SEP>../rac2/cssd/ocssd.log<SEP> [ CSSD][1111411008]clssgmRPCBroadcast: rpc(0x2a1002a), status(0), sendcount(0), filtered by specific properties:
2013-05-31 09:14:59.573<SEP>../rac2/cssd/ocssd.log<SEP> [ CSSD][1111411008]clssgmFreeRPCIndex: freeing rpc 673
2013-05-31 09:14:59.573<SEP>../rac2/cssd/ocssd.log<SEP> [ CSSD][1111411008]clssgmDeleteGrock: (0x7f3f38637990) grock(EVMDMAIN2) deleted
2013-05-31 09:14:59.576<SEP>../rac1/crfmond/crfmond.log<SEP> [ COMMCRS][1105488192]clsc_connect: (0x7f23b40531f0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_LOGD))
2013-05-31 09:14:59.576<SEP>../rac1/crfmond/crfmond.log<SEP> [ default][1083722048]clsdmc_send error code: -7
2013-05-31 09:14:59.576<SEP>../rac1/crfmond/crfmond.log<SEP> [ default][1083722048]Mond is being shutdown by default
2013-05-31 09:14:59.695<SEP>../rac1/agent/ohasd/orarootagent_root/orarootagent_root.l01<SEP> [ USRTHRD][1102350656] {0:0:2} Thread:[DaemonCheck:crf]Daemon Select Thread exiting
2013-05-31 09:14:59.695<SEP>../rac1/agent/ohasd/orarootagent_root/orarootagent_root.l01<SEP> [ USRTHRD][1102350656] {0:0:2} Thread:[DaemonCheck:crf]Skipping Agent Initiated a check action
2013-05-31 09:14:59.695<SEP>../rac1/agent/ohasd/orarootagent_root/orarootagent_root.l01<SEP> [ USRTHRD][1102350656] {0:0:2} Thread:[DaemonCheck:crf]isRunning is reset to false here
2013-05-31 09:15:00.270<SEP>../rac2/ctssd/octssd.l01<SEP> [ CTSS][1096657216]sclsctss_gvss1: NTP default config file found
2013-05-31 09:15:00.270<SEP>../rac2/ctssd/octssd.l01<SEP> [ CTSS][1096657216]sclsctss_gvss8: Return [0] and NTP status [2].
2013-05-31 09:15:00.270<SEP>../rac2/ctssd/octssd.l01<SEP> [ CTSS][1096657216]ctss_check_vendor_sw: Vendor time sync software is detected. status [2].
2013-05-31 09:15:00.314<SEP>../rac2/ctssd/octssd.l01<SEP> [ CTSS][1109387584]ctsselect_vermon7: Expecting clssgsevGRPPRIV event. Ignoring 2 event.
2013-05-31 09:15:00.314<SEP>../rac2/ctssd/octssd.l01<SEP> [ CTSS][1109387584]ctsselect_vermon7: Expecting clssgsevGRPPRIV event. Ignoring 2 event.
2013-05-31 09:15:00.314<SEP>../rac2/ctssd/octssd.l01<SEP> [ CTSS][1109387584]ctsselect_vermon7: Expecting clssgsevGRPPRIV event. Ignoring 3 event.
2013-05-31 09:15:00.314<SEP>../rac2/ctssd/octssd.l01<SEP> [ CTSS][1109387584]ctsselect_vermon7: Expecting clssgsevGRPPRIV event. Ignoring 1 event.
2013-05-31 09:15:00.345<SEP>../rac2/gipcd/gipcd.l04<SEP> [ CLSINET][1093364032] Returning NETDATA: 1 interfaces
2013-05-31 09:15:00.345<SEP>../rac2/gipcd/gipcd.l04<SEP> [ CLSINET][1093364032] # 0 Interface 'eth2',ip='10.10.10.12',mac='08-00-27-92-f1-98',mask='255.255.255.0',net='10.10.10.0',use='cluster_interconnect'
2013-05-31 09:15:00.354<SEP>../rac2/agent/ohasd/orarootagent_root/orarootagent_root.l01<SEP> [ora.crf][1089526080] {0:0:613} [stop] (:CLSN00108:) clsn_agent::stop }
2013-05-31 09:15:00.354<SEP>../rac2/agent/ohasd/orarootagent_root/orarootagent_root.l01<SEP> [ AGFW][1089526080] {0:0:613} Command: stop for resource: ora.crf 1 1 completed with status: SUCCESS
Now, you can analyze the output file directly, but I have developed another app (in Java). This app is available at this URL: https://app.box.com/s/xi3h7bpssrqtkp55q43c.
To launch it, you need to use java 1.7.
This app will show in a more beautiful interface the result of the output file. Messages that are coming from the same file are colored with the same color:
[oracle@rac1 aggregator]$ java -jar Aggr_visualisator.jar
Update:
- in v0.2, I implemented a filter function. This filter is active on the “Message” column.
- I implemented a colorization function for the “Message” column.
- if the term “error” is found, the cell will be colored in red
- If the term “warn” is found, the cell will be colored in orange
- and if the term “fail” is found, the cell wil be colored in yellow
two snapshots of this version below 😉
- Filtering messages (based on a regexp)
If you have any suggestion to improve these tools, send me an email or a comment. I will study them.
I hope this will help you 😉
Like this:
Like Loading...
Related
Awesome !! Reblogged
Reblogged this on Raheel's Blog.
this is truly inspirational…
Very useful tool !!
Cet outil est super. Le seul souci, par expérience, est une mauvaise gestion de la mémoire par certains outils. j’utilise MobaXTerm (avec une base Cygwin) et je rencontre souvent l’erreur suivante quand il y a trop de données:
Out of memory during request for XXXX bytes, total sbrk()
Basé sur un article de PerlMonks, j’avais modifié la taille du tas Cygwin en exécutant:
regtool -i set /HKLM/Software/Cygnus\ Solutions/Cygwin/heap_chunk_in_m 8192
Mais ce n’est pas suffisant. j’ai du aussi ajouter le paquet « rebase » qui fournit l’exécutable « peflags » et lancer:
peflags –cygwin-heap /bin/perl
peflags –cygwin-heap=8192 /bin/perl
Cela permet de contourner la limite initiale.
Attention toutefois, dans certains cas, cela finit par une erreur de segmentation (je n’ai pu chercher pourquoi pour l’instant, mais je suppose qu’il s’agit à nouveau d’un problème de volumétrie).
J’avais également eu ces problèmes avec cygwin mais pas avec le perl embarqué dans le linux.
Merci Yoann pour l’info … 🙂
Hi,
does it also work for RAC 12.1?
Thanks
Ben
Didn’t test. But it should work if you use log files now located on ADR for crs.
Feel free to test and give your feedback.