środa, 15 sierpnia 2007

Oracle 10g bug, and its resolution story.

During strong reorganizations on our Oracle based datawarehouse we found something interesting. Simple DML queries running on new configuration were causing 100% of CPU usage ( IBM Power6, running AIX 5.3 OS ). We have newer seen that kind of bottlenecks while operating on old, less powerfull hardware and OS. After several hours of investigation, we decided to create TAR..

Next few weeks elapsed on excuseing the problem by metalink, generating hundereds of tracefiles, exploring execution plans and participating in technical and argle-bargle phone calls..
Finally, the service request was qualified as bug with severity #1. I had a white hope, it will be ravine providing us to terminate performance problems. Illusory feeling.. ;) We had to generate more ( still the same! ) dumps, backtraces and paper work. We were closer to the reason of bottlenecks, but still in "the middle of nowhere". We knew, that 32k blocksize used by us as primary organization unit may produce troubles with database stability.
Following Oracle Corporation:
"The theory is understood for the infinite loop, where space layer presents a
corrupted block with high AVSP(more than the blocksize) with a huge blocksize
.
Normally datalayer marks the rejected block to full and returns to space
layer so that this block is not presented anymore.But in this corrupted block
the AVSP stands at higher size, Datalayer thinks this could be a candidate in
the future and does not mark it full, so space layer keeps presenting the
block to try leading to a loop."

They suggested us to change the block size to 8k to solve the problem! It was such a terrible news, we had to recreate whole ( as i wrote at the begining - datawarehouse ~ 10TB ) database. Yes, I know.. sounds crazy, but what we had to do? We worked under company pressure, we couldn't spent more time dealing with unexpected issues.
We created new LPAR with fresh database installed, we put into some data from main database and run sample queries.

Gee! Imagine our wonderment revealing the same issue on 8k block! We updated service report, created dumps, traces.. made aditional calls to Oracle Corporation and.. we were still nowhere :(
After aditional two weeks, Oracle answered:
"We reproduced situation in our test environment, developers are working on patch preparation. It may take 2-4 weeks to complete."


No comment needed..
We were so close to migrate whole database to another dbms, but..
- which one? ( remember, few TB of data.. )
- we had no time
- there was nothing we can say to our business unit. They paid a lot of money for Enterprise licences, for support..

Thank god we hit the bug on non OLTP system. We could "separate" problematic aggregation procedures to old system, and run them for example over dblinks.
At long last, Oracle prepared patch for us, but only for 8k block. They told us, we have to weit for next 4 weeks to get 32k block patch. It was not acceptable.
We decided to move to 8k block, and made whole database recreation.

We were save. Finally!

By the way.. for now, we have 8k block database with all disadvantages of using it.
( Yes, i know, we can mix blocksizes in single instance.. but it's next apprehend reorganization ).

Brak komentarzy: