środa, 15 sierpnia 2007

Oracle 10g bug, and its resolution story.

During strong reorganizations on our Oracle based datawarehouse we found something interesting. Simple DML queries running on new configuration were causing 100% of CPU usage ( IBM Power6, running AIX 5.3 OS ). We have newer seen that kind of bottlenecks while operating on old, less powerfull hardware and OS. After several hours of investigation, we decided to create TAR..

Next few weeks elapsed on excuseing the problem by metalink, generating hundereds of tracefiles, exploring execution plans and participating in technical and argle-bargle phone calls..
Finally, the service request was qualified as bug with severity #1. I had a white hope, it will be ravine providing us to terminate performance problems. Illusory feeling.. ;) We had to generate more ( still the same! ) dumps, backtraces and paper work. We were closer to the reason of bottlenecks, but still in "the middle of nowhere". We knew, that 32k blocksize used by us as primary organization unit may produce troubles with database stability.
Following Oracle Corporation:
"The theory is understood for the infinite loop, where space layer presents a
corrupted block with high AVSP(more than the blocksize) with a huge blocksize
.
Normally datalayer marks the rejected block to full and returns to space
layer so that this block is not presented anymore.But in this corrupted block
the AVSP stands at higher size, Datalayer thinks this could be a candidate in
the future and does not mark it full, so space layer keeps presenting the
block to try leading to a loop."

They suggested us to change the block size to 8k to solve the problem! It was such a terrible news, we had to recreate whole ( as i wrote at the begining - datawarehouse ~ 10TB ) database. Yes, I know.. sounds crazy, but what we had to do? We worked under company pressure, we couldn't spent more time dealing with unexpected issues.
We created new LPAR with fresh database installed, we put into some data from main database and run sample queries.

Gee! Imagine our wonderment revealing the same issue on 8k block! We updated service report, created dumps, traces.. made aditional calls to Oracle Corporation and.. we were still nowhere :(
After aditional two weeks, Oracle answered:
"We reproduced situation in our test environment, developers are working on patch preparation. It may take 2-4 weeks to complete."


No comment needed..
We were so close to migrate whole database to another dbms, but..
- which one? ( remember, few TB of data.. )
- we had no time
- there was nothing we can say to our business unit. They paid a lot of money for Enterprise licences, for support..

Thank god we hit the bug on non OLTP system. We could "separate" problematic aggregation procedures to old system, and run them for example over dblinks.
At long last, Oracle prepared patch for us, but only for 8k block. They told us, we have to weit for next 4 weeks to get 32k block patch. It was not acceptable.
We decided to move to 8k block, and made whole database recreation.

We were save. Finally!

By the way.. for now, we have 8k block database with all disadvantages of using it.
( Yes, i know, we can mix blocksizes in single instance.. but it's next apprehend reorganization ).

piątek, 10 sierpnia 2007

sluggish ROWNUM statement in Oracle 10g

It seems, that in some circumstances it might be faster and more efficient to use SQL queries without rownum limitation, even if you're asking for hundreds of records. We hit the exigence, where simple SELECT statement using rownum clause may paralyze whole database system by hanging up CPU ( 100% usage ).
What is database doing? Database optimizer uses another plans for both queries, unfortunately - plan for limited query causes very expensive nested loops.

Why did database spent so much time to serve few records from indexed table? I dunno.. we're working on it.. and Oracle corporation too..

Information from Oracle Corporation in reply to service request - "It is not a bug. It is an expected behavior if the nested loop order is chosen wrongly."

In my opinion it is an unexpected behavior of preparing and choosing completely wrong execution plan by database engine.

To be continued..

Finally! I'm close to the answer. Database choosen plan was inadequate, and causes not anticipate results. The only way to deal with it is to set hidden init parameter _optimizer_rownum_pred_based to false. Until used, you can't show its value using "show parameter optimizer". Parameter is hidden, not well documented and using it
is not permissive by Oracle.
With this parameter set to false, database will use first K rows plan for getting the first K rows to be returned, and plan for both queries ( with and without rownum ) looks the same. Database processes them the same way, applying stop count condition after K rows in rownum limited query.
Of course there is possibility not to change optimizer parameter. You can use hints in SQL queries, but this solution may not come true for all the situations, specially for OLTP systems and queries based on users requests.

Hello internet!

After several dozen hours at work.. I've screved up and decided to create this blog.
Why? Huh, don't know.. just like I still don't know the reason why my new voice activation system is not
working fine!!


Hello. Hello internet! Hello!!! Heellooo...