Oracle
and linux I/O Scheduler
Part 1
Following the paper on the block size I
decided to write something more on the Linux I/O schedulers and their
interaction with oracle.
This paper involves a series of tests stressing oracle with a TPC-C
workload while the oracle DB rely on different linux schedulers.
The purpose of the I/O scheduler is to sort and merge the I/O request
the I/O queues in order to increase efficiency and boost performance.
Using the /sys pseudo file system you can change and tune the I/O
scheduler for a given block device.
For any scheduler there is a different directory tree representing the
tuning options.
There are four schedulers available at the moment:
The noop scheduler is a FIFO queue.
Only the I/O merging is provided. Good if your application already
sorts the I/O.
/sys/block/sdb/queue/scheduler
/sys/block/sdb/queue/max_sectors_kb
/sys/block/sdb/queue/max_hw_sectors_kb
/sys/block/sdb/queue/read_ahead_kb
/sys/block/sdb/queue/nr_requests
The deadline scheduler
uses an round-robin algorithm to minimize the latency for any I/O
request. It implements merging and sorting plus a deadline mechanism to
avoid starvation. It prefers reads above writes
/sys/block/sdc/queue/iosched/fifo_batch
/sys/block/sdc/queue/iosched/front_merges
/sys/block/sdc/queue/iosched/writes_starved
/sys/block/sdc/queue/iosched/write_expire
/sys/block/sdc/queue/iosched/read_expire
/sys/block/sdc/queue/scheduler
/sys/block/sdc/queue/max_sectors_kb
/sys/block/sdc/queue/max_hw_sectors_kb
/sys/block/sdc/queue/read_ahead_kb
/sys/block/sdc/queue/nr_requests
The anticipatory
scheduler try to predict the future workload delaying the I/O in order
to merge request and decrease the number of seeks. It impelements
merging and sorting plus an algorithm to minimize disk head movements.
It is suggeste for workstation and old hardware.
It's tree:
/sys/block/sdb/queue/iosched/write_batch_expire
/sys/block/sdb/queue/iosched/read_batch_expire
/sys/block/sdb/queue/iosched/antic_expire
/sys/block/sdb/queue/iosched/write_expire
/sys/block/sdb/queue/iosched/read_expire
/sys/block/sdb/queue/iosched/est_time
/sys/block/sdb/queue/scheduler
/sys/block/sdb/queue/max_sectors_kb
/sys/block/sdb/queue/max_hw_sectors_kb
/sys/block/sdb/queue/read_ahead_kb
/sys/block/sdb/queue/nr_requests
The cfq is
the default for SLES10 (and SLES9). It uses a round-robin trying to be
fair dividing the available I/O bandwith amongst all I/O requests.
It implements merging and sorting.
/sys/block/sdb/queue/iosched/max_depth
/sys/block/sdb/queue/iosched/slice_idle
/sys/block/sdb/queue/iosched/slice_async_rq
/sys/block/sdb/queue/iosched/slice_async
/sys/block/sdb/queue/iosched/slice_sync
/sys/block/sdb/queue/iosched/back_seek_penalty
/sys/block/sdb/queue/iosched/back_seek_max
/sys/block/sdb/queue/iosched/fifo_expire_async
/sys/block/sdb/queue/iosched/fifo_expire_sync
/sys/block/sdb/queue/iosched/queued
/sys/block/sdb/queue/iosched/quantum
/sys/block/sdb/queue/scheduler
/sys/block/sdb/queue/max_sectors_kb
/sys/block/sdb/queue/max_hw_sectors_kb
/sys/block/sdb/queue/read_ahead_kb
/sys/block/sdb/queue/nr_requests
The command:
# cat
/sys/block/sdb/queue/scheduler
noop [anticipatory] deadline cfq
Is going to tell you which scheduler you are using.
On newer kernel you can change the scheduler without a reboot by simply
issuing:
# echo cfq >
/sys/block/sdb/queue/scheduler
The testing software:
The chosen tool is hammerora
which generates a TPC-C workload trying to "hammer" oracle as much as
possible. Definitly a good stress test.
In the last version (1.26) I had scalability problems. The numember of
trasaction per minute (tpm) were low and I noticed in my DB wait events
lots of 'read by other session'.
Investigating further I saw the ITEM table (used by hammerora) was
growing and lot of tablescan were performed on it.
I simply create an index with this DDL:
CREATE INDEX TPCC.ITEM_ID
ON TPCC.ITEM (I_ID)
INITRANS 255 MAXTRANS 255
TABLESPACE USERS
PCTFREE 60;
And the problem disappeared.
I even increased for every index and table the number on inittrans to
255 trying to increase the concurrency.
The difference was 100 folds in the number of tpm.
The virtual users for the initial tests were 10.
My DB:
A 10.2.0.2 (10g release 2 with first patchset).
SQL> show sga
Total System Global Area 838860800 bytes
Fixed
Size
1263572 bytes
Variable
Size
83888172 bytes
Database Buffers
746586112 bytes
Redo
Buffers
7122944 bytes
SQL> show parameter sga_target
NAME
TYPE VALUE
------------------------------------ -----------
------------------------------
sga_target
big integer 800M
SQL> show parameter pga_aggregate_target
NAME
TYPE VALUE
------------------------------------ -----------
------------------------------
pga_aggregate_target
big integer 103M
Asynch I/O is activated while direct I/O is diabled (to be sure to use
of the feature of the I/O scheduler).
I configured the AWR to take a snapshot every 10 minutes.
I'm going to measure the results using the report create with the AWR
(similar to the old statspack).
Disk Layout:
For the first test all the database files are on the same disk: sdb.
They are divided on two reiser file systems: one for the datafile of
4KB block size and one for the redolog of 512 byte.
Hardware:
IBM x335
2 CPU Xeon(TM) 2.00GHz
1,5 GB RAM
6 disks of 36 GB in three different RAID 1 (/dev/sda, /dev/sdb,
/dev/sdc)
Operating system:
SUSE Linux Enterprise Server 10 beta8.
I choose this version since it is going to be certified soon with
oracle and because it is the first SUSE Enterprise were the I/O
scheduler can be changed on the fly.
This last characteristic is really important.
With a simple command like:
echo deadline > /sys/block/sdb/queue/scheduler
the scheduler is changed.
On older SUSE versions like SLES9 the I/O scheduler can be changed at
boot time with the parameter elevator=[name of the scheduler] where the
name can be: noop, deadline, as, cfq.
Unfortunately with this method you have one scheduler for all the block
devices of the system.
It is not possible to combine more I/O scheduler so the tuning
capabilities are limited.
Testing methodology:
With 10 virtual users a constant workload is kept on the database.
After 30 minutes the scheduler is chnged. The default parameters are
kept in place.
After three cycles of all I/O schedulers the AWR snapshots are used to
generate reports and to compare them.
First results:
For any scheduler you can see an AWR report following the links:
|
transaction per second
|
log file sync %
|
user calls
|
physical reads
|
physical writes
|
noop
|
74.61 |
3.2
|
350.39 |
153.69 |
117.57 |
| anticipatory |
30.12
|
44.4
|
140.62
|
67.31
|
53.94
|
deadline
|
77.74
|
3.1
|
362.25
|
151.96 |
118.71 |
cfq
|
23.13 |
36.8 |
107.97 |
51.58 |
40.22 |
The winner is the deadline scheduler. It is interesting to see that cfq
and anticipatory have to lowest number of transaction per second
(around 23 against more than 70 of deadline and noop).
Probably this is due to the high 'log file sync' of cfq and
anticipatory. They are the clear losers on the redo log file writes!!!
This is worrying since cfq is the
default scheduler of SLES distribution (and RedHat AS).
If you are going to implement a
OLTP then it is better you test your application using different
schedulers. Maybe the default is not right for you.
Deadline seems thebest scheduler on this kind of workload but it wins
shortly against noop.
It would be interesting to divide redolog from datafile on different
block devices. Then set the deadline scheduler on the redolog device
and to retest switching the scheduler only on the datafile device (you
can set a different scheduler for any block device).
This second test is going to be performed here.
Contact information:
fabrizio.magni _at_ gmail.com