wtorek, 19 lutego 2008

Oracle RAC 11g - very first impressions

After few hours of observation ( in non production environment ) i decided to write down my remarks.

In plus :
#1 - It is possible to make rolling update from CRS 10.2.0.3, great!
#2 - We install CRS 11g without any problems known for earlier versions
#3 - There is no need to manualy reallocate cluster services after node crash or just node maintanance process.

In minus:
#1 - Not working with Quest Spotlight.

poniedziałek, 18 lutego 2008

Hitting bugs while installing Oracle CRS 10g on Solaris

While installing CRS 10.2.0.2 on Solaris 10 ( and probably other operating systems ) installation process may fail, reporting you a problem with setting up VIP adresses.
Why? Let's take a look..

At the and of CRS installation, OUI will dictate you to run two shell scripts as privileged user. The second one ( root.sh ) calls vipca to make your public addres accessible for database/cluster clients. Before configuring it as an alias for primary interface, vipca checks propriety of all given parameters.
If you decided to configure VIP as set of non-routable IP addreses ( RFC1918 ) from :

10.0.0.0 - 10.255.255.255 (10/8 prefix)
172.16.0.0 - 172.31.255.255 (172.16/12 prefix)
192.168.0.0 - 192.168.255.255 (192.168/16 prefix)

subnets - vipca will fail. Why? Is it a bug or just fussing "feature"? I don't know.. but in 10.2.0.3 Oracle Corporation decided to modify vipca not to check propriety of VIP IPs.

So, how to deal it?
The simpliest way is to install 10.2.0.2 CRS without setting up VIP ( force OK in OUI ), patch CRS software to 10.2.0.3 and finally run vipca by your own.

$CRS_HOME/bin/vipca -silent -nodelist $CRS_NODELIST -nodevips $CRS_NODEVIPS

Unluckily, in CRS 10.2.0.3 for Solaris 64-bit there is a bug in ONS makeing your cluster unstable.
I recommend you to install 5749953 patch asap. You can download it from metalink and follow up README.txt instructions.

Oracle CRS 10g running on Sun Solaris 10

Instead of rewriting cofiguration instructions, i'd like to shed some light on potential problems you may hit during CRS installation process on SUN Solaris 10 64-bit operating system.

Let's start from the begining..

Folowing documentation provided by Oracle Corporation, you set up all preinstallation parameters required to set up Clusterware software ( kernel parameters, environment variables and many other mentioned on manual pages) and you think you're done with operating system preparation? Unfortunately it is barely truth..

Oracle installer will search for SSC/SCP software in hardcoded locations, which are simply not avaliable in standard Sun Solaris 10 installation.
You have to create symlinks:
/usr/local/bin/ssh -> /usr/bin/ssh
/usr/local/bin/scp -> /usr/bin/scp

Than, if your system is 64-bit capable, you have to "tell" Oracle Universal Installer to prepare 64-bit libraries by setting up:

crle -64 -u -l $CRS_HOME/lib

Otherwise you may hit the problem with libclntsh.so. Of course it's corrigible, and it's not much complicated:

export ORACLE_HOME="/oracle/crs/"
cd $ORACLE_HOME/bin
Edit ./genclntsh file and replace:
LD="ld -m -Dfiles -i -G -z text -L${ORACLE_HOME}/${LIB}"
with:
LD="ld -m -Bsymbolic -i -G -z text -L${ORACLE_HOME}/${LIB}"

Run "relink all" again.
./genclntsh

/bin/clscfg -install -nn $CRS_NODE_NAME_LIST -pn $CRS_PRIVATE_NAME_LIST \
$HOSTNAMES_ARGS -o $CH -c $CRS_CLUSTER_NAME -l "$CRS_LANGUAGE_ID" \
-q $CRS_VOTING_DISKS $CLSCFG_MISCNT

Eg.
./clscfg -install -nn node1,1,node2,2,node3,3 -pn node1-priv,1,node2-priv,2,node3-priv,3 -hn node1,1,node2,2,node3,3 -o /oracle/crs/ -c mycluster -l "'AMERICAN_AMERICA.WE8ISO8859P1'" -q /dev/rdsk/c4t600508B400105E660001000003960000d0s6 -force

After above operations, you can check your RAC health.

root@node1:/oracle/crs/bin# ./olsnodes
node1
node2
node3

root@node1:/oracle/crs/bin# ./crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy

Juppie! ;)

The last step is to configure VIP interfaces:
$CRS_HOME/bin/vipca -silent -nodelist $CRS_NODELIST -nodevips $CRS_NODEVIPS

sobota, 22 grudnia 2007

Uninstalling Oracle CRS software?

Something went wrong ( errors during installation process? ) and you have to deinstall Oracle CRS related software? How?
Yes.., of course! You can use Universall installer to stand up the trouble.. but unfortunetly, in most circumstances it's simply unhelpful.
So, is there any possibility to completly clean up crs software from your system?

Sure, and all you have to do is ( example based on Solaris 10 OS, but you can easily port it to all kind of Unix/Linux based database systems ) to:

Stop node related applications:
srvctl stop nodeapps -n node_name

Remove files:
/etc/init.d/init.cssd
/etc/init.d/init.crs
/etc/init.d/init.crsd
/etc/init.d/init.evmd
/etc/rc3.d/K96init.crs
/etc/rc3.d/S96init.crs
/etc/inittab.crs

And directories:
/var/opt/oracle/scls_scr
/var/opt/oracle/oprocd

Replace inittab:
/etc/inittab.orig => /etc/inittab

Kill all of CRS daemons:
You can find them in processlist using
ps -ef | egrep 'crs|evm|css'
command.

Remove directories:
/var/tmp/.oracle
/tmp/.oracle
/oracle/crs

And finally, clear data on devices used for OCR ( Oracle Cluster Registry ) and VD ( Voting Disk ).

# dd if=/dev/zero of=/dev/rdsk/XXXXXXXXX count=10240
# dd if=/dev/zero of=/dev/rdsk/XXXXXXXXX count=10240

As you can see on presented example, there is no need overwrite whole devices.

środa, 14 listopada 2007

Lagrange polynomial interpolation..

.. used for hacking online banking systems?
Have you ever think about appointing polynomial to discover singe-use passwords generating model?
Yesterday, my friend run me against it, bothering my head about it. We were discussing problem for couple of minutes, trying to find the best mathematical algorithm..
I think we did it, but it seems to be barely possible ( math. NP-hard problem ) to do it without any aditional conditions ( knowledge of general principles used by banks to pregenerate single-use passwords lists ).
I know some directions, that can simplify ordeal.. but i'm passing it over in silence, attempting to produce resolution by all possible, but non privileged means.


If you bethink onself to do it, you are kindly welcome to renew a debate here.

środa, 15 sierpnia 2007

Oracle 10g bug, and its resolution story.

During strong reorganizations on our Oracle based datawarehouse we found something interesting. Simple DML queries running on new configuration were causing 100% of CPU usage ( IBM Power6, running AIX 5.3 OS ). We have newer seen that kind of bottlenecks while operating on old, less powerfull hardware and OS. After several hours of investigation, we decided to create TAR..

Next few weeks elapsed on excuseing the problem by metalink, generating hundereds of tracefiles, exploring execution plans and participating in technical and argle-bargle phone calls..
Finally, the service request was qualified as bug with severity #1. I had a white hope, it will be ravine providing us to terminate performance problems. Illusory feeling.. ;) We had to generate more ( still the same! ) dumps, backtraces and paper work. We were closer to the reason of bottlenecks, but still in "the middle of nowhere". We knew, that 32k blocksize used by us as primary organization unit may produce troubles with database stability.
Following Oracle Corporation:
"The theory is understood for the infinite loop, where space layer presents a
corrupted block with high AVSP(more than the blocksize) with a huge blocksize
.
Normally datalayer marks the rejected block to full and returns to space
layer so that this block is not presented anymore.But in this corrupted block
the AVSP stands at higher size, Datalayer thinks this could be a candidate in
the future and does not mark it full, so space layer keeps presenting the
block to try leading to a loop."

They suggested us to change the block size to 8k to solve the problem! It was such a terrible news, we had to recreate whole ( as i wrote at the begining - datawarehouse ~ 10TB ) database. Yes, I know.. sounds crazy, but what we had to do? We worked under company pressure, we couldn't spent more time dealing with unexpected issues.
We created new LPAR with fresh database installed, we put into some data from main database and run sample queries.

Gee! Imagine our wonderment revealing the same issue on 8k block! We updated service report, created dumps, traces.. made aditional calls to Oracle Corporation and.. we were still nowhere :(
After aditional two weeks, Oracle answered:
"We reproduced situation in our test environment, developers are working on patch preparation. It may take 2-4 weeks to complete."


No comment needed..
We were so close to migrate whole database to another dbms, but..
- which one? ( remember, few TB of data.. )
- we had no time
- there was nothing we can say to our business unit. They paid a lot of money for Enterprise licences, for support..

Thank god we hit the bug on non OLTP system. We could "separate" problematic aggregation procedures to old system, and run them for example over dblinks.
At long last, Oracle prepared patch for us, but only for 8k block. They told us, we have to weit for next 4 weeks to get 32k block patch. It was not acceptable.
We decided to move to 8k block, and made whole database recreation.

We were save. Finally!

By the way.. for now, we have 8k block database with all disadvantages of using it.
( Yes, i know, we can mix blocksizes in single instance.. but it's next apprehend reorganization ).

piątek, 10 sierpnia 2007

sluggish ROWNUM statement in Oracle 10g

It seems, that in some circumstances it might be faster and more efficient to use SQL queries without rownum limitation, even if you're asking for hundreds of records. We hit the exigence, where simple SELECT statement using rownum clause may paralyze whole database system by hanging up CPU ( 100% usage ).
What is database doing? Database optimizer uses another plans for both queries, unfortunately - plan for limited query causes very expensive nested loops.

Why did database spent so much time to serve few records from indexed table? I dunno.. we're working on it.. and Oracle corporation too..

Information from Oracle Corporation in reply to service request - "It is not a bug. It is an expected behavior if the nested loop order is chosen wrongly."

In my opinion it is an unexpected behavior of preparing and choosing completely wrong execution plan by database engine.

To be continued..

Finally! I'm close to the answer. Database choosen plan was inadequate, and causes not anticipate results. The only way to deal with it is to set hidden init parameter _optimizer_rownum_pred_based to false. Until used, you can't show its value using "show parameter optimizer". Parameter is hidden, not well documented and using it
is not permissive by Oracle.
With this parameter set to false, database will use first K rows plan for getting the first K rows to be returned, and plan for both queries ( with and without rownum ) looks the same. Database processes them the same way, applying stop count condition after K rows in rownum limited query.
Of course there is possibility not to change optimizer parameter. You can use hints in SQL queries, but this solution may not come true for all the situations, specially for OLTP systems and queries based on users requests.