Mine
Mine
RATE(Q
k
)
.
COST
CPU
(Q
k
)
Similarly for I/O. Tells us how many disks to buy, how powerful a CPU.
(Costs go up approximately linearly with CPU power within small ranges).
Ideally, the weights W
1
and W
2
used to calculate COST(PLAN) from I/O and
CPU costs will reflect actual costs of equipment
Good DBA tries to estimate workload in advance to make purchases, have
equipment ready for a new application.
*** Note that one other major expense is response time. Could have poor
response time (1-2 minutes) even on lightly loaded inexpensive system.
This is because many queries need to perform a LOT of I/O, and some
commercial database systems have no parallelism: all I/Os in sequence.
But long response time also has a cost, in that it wastes user time. Pay
workers for wasted time. Employees quit from frustration and others must
be trained. Vendors are trying hard to reduce response times.
-201-
Usually there exist parallel versions of these systems as well; worth it if
extremely heavy I/O and response time is a problem, while number of queries
running at once is small compared to number of disks.
Note that in plans that follow, we will not try to estimate CPU. Too hard.
Assume choose plan with best I/O and try to estimate that. Often total cost
is proportional to I/O since each I/O entails extra CPU cost as well.
Statistics. Need to gather, put in System tables. System does not auto-
matically gather them with table load, index create, updates to table.
In DB2, use Utility called RUNSTATS. Fig. 9.3, pg. 538.
RUNSTATS ON TABLE username.tablename
[WITH DISTRIBUTION [AND DETAILED] {INDEXES ALL | INDEX indexname}]
[other clauses not covered or deferred]
e.g.: runstats on table poneil.customers;
System learns how many rows, how many data pages, stuff about indexes,
etc., placed in catalog tables.
ORACLE uses ANALYZE command. Fig. 9.4, pg. 538.
ANALYZE {INDEX | TABLE | CLUSTER}
[schema.] {indexname | tablename | clustername}
{COMPUTE STATISTICS | other alternatives not covered}
{FOR TABLE | FOR ALL [INDEXED] COLUMNS [SIZE n]
| other alternatives not covered}
Will see how statistics kept in catalog tables in DB2 a bit later.
Retrieving the Query Plans. In DB2 and ORACLE, perform SQL statement. In
DB2:
EXPLAIN PLAN [SET QUERYNO = n] [SET QUERYTAG = 'string'] FOR
explainable-sql-statement;
For example:
explain plan set queryno = 1000 for
select * from customers
-202-
where city = 'Boston' and discnt between 12 and 14;
The Explain Plan statement puts rows in a "plan_table" to represent individual
procedural steps in a query plan. Can get rows back by:
select * from plan_table where queryno = 1000;
Recall that a Query Plan is a sequence of procedural access steps that carry
out a program to answer the query. Steps are peculiar to the DBMS.
From one DBMS to another, difference in steps used is like difference
between programming languages. Can't learn two languages at once.
We will stick to a specific DBMS in what follows, MVS DB2, so we can end up
with an informative benchmark we had in the first edition.
But we will have occasional references to ORACLE, DB2 UDB. Here is the
ORACLE Explain Plan syntax.
EXPLAIN PLAN [SET STATEMENT_ID = 'text-identifier'] [INTO
[schema.]tablename]
FOR explainable-sql-statement;
This inserts a sequence of statements into a user created DB2/ORACLE
table known as PLAN_TABLE one row for each access step. To learn more
about this, see ORACLE8 documentation named in text.
Need to understand what basic procedural access steps ARE in the particular
product you're working with.
The set of steps allowed is the "bag of tricks" the query optimizer can use.
Think of these procedural steps as the "instructions" a compiler can use to
create "object code" in compiling a higher-level request.
A system that has a smaller bag of tricks is likely to have less efficient ac-
cess plans for some queries.
MVS DB2 (and the architecturally allied DB2 UDB) have a wide range of tricks,
but not bitmap indexing or hashing capability.
Still, very nice capabilities for range search queries, and probably the most
sophisticated query optimizer.
-203-
Basic procedrual steps covered in the next few Sections (thumbnail):
Table Scan Look through all rows of table
Unique Index Scan Retrieve row through unique index
Unclustered Matching Index Scan Retrieve multiple rows through a
non-unique index, rows not same order
Clustered Matching Index Scan Retrieve multiple rows through a
non-unique clustered index
Index-Only Scan Query answered in index, not rows
Note that the steps we have listed access all the rows restricted by the
WHERE clause in some single table query.
Need two tables in the FROM clause to require two steps of this kind and
Joins come later. A multi-step plan for a single table query will also be
covered later.
Such a multi-step plan on a single table is one that combines multiple indexes
to retrieve data. Up to then, only one index per table can be used.
9.2 Table Space Scans and I/O
Single step. The plan table (plan_table) will have a column ACCESSTYPE with
value R (ACCESSTYPE = R for short).
Example 9.2.1. Table Space Scan Step. Look through all rows in table
to answer query, maybe because there is no index that will help.
Assume in DB2 an employees table with 200,000 rows, each row of 200
bytes, each 4 KByte page just 70% full. Thus 2800 usable pages, 14
rows/pg. Need CEIL(200,000/14) = 14,286 pages.
Consider the query:
select eid, ename from employees where socsecno = 113353179;
If there is no index on socsecno, only way to answer query is by reading in all
rows of table. (Stupid not to have an index if this query occurs with any
frequency at all!)
-204-
In Table Space Scan, might not stop when find proper row, since statistics
for a non-indexed column might not know socsecno is unique.
Therefore have to read all pages in table in from disk, and COST
I/O
(PLAN)
= 14286 I/Os. Does this mean 14286 random I/Os? Maybe not.
But if we assume random I/O, then at 80 I/Os per second, need about
14286/80 = 178.6 seconds, a bit under three minutes. This dominates CPU
by a large factor and would predict elapsed time quite well.
Homework 4, non-dotted Exercises through Exercise 9.9. This is due when we
finish Section 9.6..
Assumptions about I/O
We are now going to talk about I/O assumptions a bit. First, there might be
PARALLELISM in performing random I/Os from disk.
The pages of a table might be striped across several different disks, with
the database system making requests in parallel for a single query to keep
all the disk arms busy. See Figure 9.5, pg. 542
When the first edition came out, it was rare for most DBMS systems to
make multiple requests at once (a form of parallelism), now it's common.
DB2 has a special form of sequential prefetch now where it stripes 32 pages
at a time on multiple disks, requests them all at once.
While parallelism speeds up TOTAL I/O per second (expecially if there's ony
one user process running), it doesn't really save any RESOURCE COST.
If it takes 12.5 ms (0.0125 seconds) to do a random I/O, doesn't save
resources to do 10 random I/Os at once on 10 different disks.
Still have to make all the same disk arm movements, cost to rent the disk
arms is the same if there is parallelism: just spend more per second.
Will speed things up if there are few queries running, fewer than the number
of disks, and there is extra CPU not utilized. Then can use more of the disk
arms and CPUs with this sort of parallelism.
Parallelism shows up best when there is only one query running!
-205-
But if there are lots of queries compared to number of disks and accessed
pages are randomly placed on disks, probably keep all disk arms busy already.
But there's another factor operating. Two disk pages that are close to each
other on one disk can be read faster because there's a shorter seek time.
Recall that the system tries to make extents contiguous on disk, so I/Os in
sequence are faster. Thus, a table that is made up of a sequence of (mainly)
contiguous pages, one after another within a track, will take much less time
to read in.
In fact it seems we should be able to read in successive pages at full
transfer speed would take about .00125 secs per page.
Used to be that by the time the disk controller has read in the page to a
memory buffer and looked to see what the next page request is, the page
immediately following has already passed by under the head.
But now with multiple requests to the disk outstanding, we really COULD get
the disk arm to read in the next disk page in sequence without a miss.
Another factor supports this speedup: the typical disk controller buffers an
entire track in it's memory whenever a disk page is requested.
Reads in whole track containing the disk page, returns the page requested,
then if later request is for page in track doesn't have to access disk again.
So when we're reading in pages one after another on disk, it's like we're
reading from the disk an entire track at a time.
I/O is about TEN TIMES faster for disk pages in sequence compared to
randomly place I/O. (Accurate enough for rule of thumb.)
PUT ON BOARD: We can do 800 I/Os per second when pages in sequence
(S) instead of 80 for randomly placed pages (R). Sequential I/O takes
0.00125 secs instead of 0.0125 secs for random I/O.
DB2 Sequential Prefetch makes this possible even if turn off buffering on
disk (which actually hurts performance of random I/O, since reads whole
track it doesn't need: adds 0.008 sec to random I/O of 0.0125 sec)
-206-
IBM puts a lot of effort into making I/O requests sequentially in a query plan
to gain this I/O advantage!
Example 9.2.2. Table Space Scan with Sequential Advantage. The
14286R of Example 9.2.1 becomes 14286S (S for Sequential Prefetch I/O
instead of Random I/O). And 14286S requires 14286/800 = 17.86 seconds
instead of the 178.6 seconds of 142286R. Note that this is a REAL COST
SAVINGS, that we are actually using the disk arm for a smaller period.
Striping reduces elapsed time but not COST.
Cover idea of List Prefetch. 32 pages, not in perfect sequence, but rela-
tively close together. Difficult to predict time.
We use the rule of thumb that List Prefetch reads in 200 pages per second.
See Figure 9.10, page 546, for table.
Plan table row for an access step will have PREFETCH = S for sequential
prefetch, PREFETCH = L for list prefetch, PREFETCH = blank if random I/O.
See Figure 9.10. And of course ACCESSTYPE = R when really Random I/O.
Note that sequential prefetch is just becoming available on UNIX database
systems. Often just put a lot of requests out in parallel and depend on
smart I/O system to use arm efficiently
-207-
Class 14.
Exam 1.
Class 15.
9.3 Simple Indexed Access in DB2.
Index helps efficiency of query plan. There is a gread deal of complexity
here. Remember, we are not yet covering queries with joins: only one table in
FROM clause and NO subquery in WHERE clause.
Examples with tables: T1, T2, . . , columns C1, C2, C3, . .
Example 9.3.1. Assume index C1X exists on column C1 of table T1 (always
a B-tree secondary index in DB2). Consider:
select * from T1 where C1 = 10;
This is a Matching Index Scan. In plan table: ACCESSTYPE = I, ACCESSNAME =
C1X, MATCHCOLS = 1. (MATCHCOLS might be >1 in multiple column index.)
Perform matching index scan by walking down B-tree to LEFTMOST entry of
C1X with C1 = 10. Retrieve row pointed to.
Loop through entries at leaf level from left to right until run out of entries
with C1 = 10. For each such entry, retrieve row pointed to. No assumption
about clustering or non-clustering of rows here.
In Example 9.3.2, assume other restrictions in WHERE clause, but matching
index scan used on C1X. Then other restrictions are validated as rows are
accessed (row is qualified: look at row, check if matches restrictions).
Not all predicates are indexable. In DB2, indexable predicate is one that can
be used in a matching index scan, i.e. a lookup that uses a contiguous section
of an index. Covered in full in Section 9.5.
For example, looking up words in the dictionary that start with the letters
'pre' is a matching index scan. Looking up words ending with 'tion' is not.
DB2 considers the predicate C1 <> 10 to be non-indexable. It is not im-
possible that an index will be used in a query with this predicate:
-208-
select * from T1 where C1 <> 10;
But the statistics usually weigh against it's use and so the query will be
performed by a table space scan. More on indexable predicates later.
OK, now what about query:
select * from T1 where C1 = 10 and C2 between 100 and 200
and C3 like 'A%';
These three predicates are all indexable. If have only C1X, will be like
previous example with retrieved rows restricted by tests on other two
predicates.
If have index combinx, created by:
create index combinx on T1 (C1, C2, C3) . . .
Will be able to limit (filter) RIDs of rows to retrieve much more completely
before going to data. Like books in a card catalog, looking up
authorlname = 'James' (c1 = 10) and authorfname between 'H' and 'K'
and title begins with letter 'A'
Finally, we will cover the question of how to filter the RIDs of rows to
retrieve if we have three indexes, C1X, C2X, and C3X. This is not simple.
See how to do this by taking out cards for each index, ordering by RID, then
merge-intersecting.
It is an interesting query optimization problem whether this is worth it.
OK, now some examples of simple index scans.
Example 9.3.3. Index Scan Step, Unique Match. Continuing with
Example 9.2.1, employees table with 200,000 rows of 200 bytes and pctfree
= 30, so 14 rows/pg and CEIL(200,000/14) =14,286 data pages. Assume in
index on eid, also have pctfree = 30, and eid||RID takes up 10 bytes, so 280
entries per pg, and CEIL(200,000/280) = 715 leaf level pages. Next level up
CEIL(715/280) = 3. Root next level up. Write on board:
-209-
employees table: 14,286 data pages
index on eid, eidx: 715 leaf nodes, 3 level 2 nodes, 1 root node.
Now query: select ename from employees where eid = '12901A';
Root, on level 2 node, 1 leaf node, 1 data page. Seems like 4R. But what
about buffered pages? Five minute rule says should purchase enough
memory so pages referenced more frequently than about once every 120
seconds (popular pages) should stay in memory. Assume we have done this.
If workload assumes 1 query per second of this on ename with eid = predi-
cate (no others on this table), then leaf nodes and data pages not buffered,
but upper nodes of eidx are. So really 2R is cost of query.
This Query Plan is a single step, with ACCESSTYPE = I, ACCESSNAME = eidx,
MATCHCOLS = 1.
-210-
Class 16.
OK now we introduce a new table called prospects. Based on direct mail
applications (junk mail). People fill out warrenty cards, name hobbies, salary
range, address, etc.
50M rows of 400 bytes each. FULL data pages (pctfree = 0) and on all in-
dexes: 10 rows on 4 KByte page, so 5M data pages.
prospects table: 5M data pages
Now: create index addrx on prospects (zipcode, city, straddr) cluster . . .;
zipcode is integer or 4 bytes, city requires 12 bytes, straddr 20 bytes, RID 4
bytes, and assume NO duplicate values so no compression.
Thus each entry requires 40 bytes, and we can fit 100 on a 4 KByte page.
With 50M total entries, that means 500,000 leaf pages. 5000 directory
nodes at level 2. 50 level 3 node pages. Then root page. Four levels.
Also assume a nonclustering hobbyx index on hobbies, 100 distinct hobbies (.
. . cards, chess, coin collecting, . . .). We say CARD(hobby) = 100.
(Like we say CARD(zipcode) = 100,000. Not all possible integer zipcodes can
be used, but for simplicity say they are.)
Duplicate compression on hobbyx, each key (8 bytes?) amortized over 255
RIDS (more?), so can fit 984 RIDs (or more) per 4 KByte page, call it 1000.
Thus 1000 entries per leaf page. With 50M entries, have 50,000 leaf pages.
Then 50 nodes at level 2. Then root.
index on eid, eidx: 715 leaf nodes, 3 level 2 nodes, 1 root node.
prospects table addrx index hobbyx index
50,000,000 rows 500,000 leaf pages 50,000 leaf pages
5,000,000 data pages 5,000 level 3 nodes 151 level 2 nodes
(10 rows per page) 50 level 2 nodes 1 root node
1 root node (1000 entries/leaf)
CARD(zipcode)= 100,000 CARD(hobby)=100
Figure 9.12. Some statistics for the prospects table, page 552
-211-
Example 9.3.4. Matching Index Scan Step, Unclustered Match.
Consider the following query:
select name, straddr from prospects where hobby = 'chess';
Query optimizer assumes each of 100 hobbies equally likely (knows there are
100 from RUNSTATS), so restriction cuts 50M rows down to 500,000.
Walk down hobby index (2R for directory nodes) and across 500,000 entries
(1000 per page so 500 leaf pages, sequential prefeth so 500S).
For every entry, read in row -- non clustered so all random choices out of 5M
data pages, 500,000 distinct I/Os (not in order, so R), 500,000R.
Total I/O is 500S + 500,002R. Time is 500/800 + 500,002/80, about
500,000/80 = 6250 seconds. Or about 1.75 hours (2 hrs = 7200 secs).
Really only picking up 500,000 distinct pages, will lie on less than 500,000
pages (out of 5 M). Would this mean less than 500,000 R because buffering
keeps some pages around for double/triple hits?
VERY TRIVIAL EFFECT! Hours of access, 120 seconds pages stay in buffer.
Can generally assume that upper level index pages are buffer resident (skip
2R) but leaf level pages and maybe one level up are not. Should calculate
index time and can then ignore it if insignificant.
If we used a table space scan for Example 9.3.4, qualifying rows to ensure
hobby = 'chess, how would time compare to what we just calculated?
Simple: 5M pages using sequential prefetch, 5,000,000/800 = 625 seconds.
(Yes, CPU is still ignored in fact is relatively insignificant.)
But this is the same elapsed time as for indexed access of 1/100 of rows!!
Yes, surprising. But 10 rows per page so about 1/10 as many pages hit, and
S is 10 times as fast as R.
Query optimizer compares these two approaches and chooses the faster
one. Would probably select Table Space Scan here But minor variation in
CARD(hobby) could make either plan a better choice.
-212-
Example 9.3.5. Matching Index Scan Step, Clustered Match.
Consider the following query:
select name, straddr from prospects
where zipcode between 02159 and 03158;
Recall CARD(zipcode) = 100,000. Range of zipcodes is 1000. Therefore, cut
number of rows down by a factor of 1/100. SAME AS 9.3.4.
Bigger index entries. Walk down to leaf level and walk across 1/100 of leaf
level: 500,000 leaf pages, so 5000 pages traversed. I/O of 5000S.
And data is clustered by index, so walk across 1/100 of 5M data pages,
50,000 data pages, and they're in sequence on disk, so 50,000S.
Compared to Nonmatching index scan of Example 9.3.4, walk across 1/10 as
many pages and do it with S I/O instead of R. Ignore directory walk.
Then I/O cost is 55,000S, with elapsed time 55,000/500 = 137.5 seconds, a
bit over 2 minutes, compared with 1.75 hrs for unclustered index scan.
The difference between Examples 9.3.4 and 9.3.5 doesn't show up in the
PLAN table. Have to look at ACCESSNAME = addrx and note that this index is
clustered, (clusterratio) whereas ACCESSNAME = hobbyx is not.
(1) Clusterratio determines if index still clustered in case rows exist that
don't follow clustering rule. (Inserted when no space left on page.)
(2) Note that entries in addrx are 40 bytes, rows of prospects are 400
bytes. Seems natural that 5000S for index, 50,000S for rows.
Properties of index:
1. Index has directory structure, can retrieve range of values
2. Index entries are ALWAYS clustered by values
3. Index entries are smaller than the rows.
Example 9.3.6. Concatenated Index, Index-Only Scan. Assume (just
for this example) a new index, naddrx:
create index naddrx on prospects (zipcode, city, straddr, name)
. . . cluster . . .;
-213-
Now same query as before:
select name, straddr from prospects where zipcode
between 02159 and 03158;
Can be answered in INDEX ONLY (because find range of zipcodes and read
name and straddr off components of index: Show components:
naddrx keyvalue: zipcodeval.cityval.straddrval.nameval
This is called an Index Only scan, and with EXPLAIN plan table gets new
column: INDEXONLY = Y (ACCESSTYPE = I, ACCESSNAME = naddrx). Previous
plans had INDEXONLY = N.
(All these columns always reported; I just mention them when relevant.)
Time? Assume naddrx takes 60 bytes instead of 40 bytes, then amount
read in index, instead of 5000S is 7500S, elapsed time 7500/800 = 9.4
seconds. Compare to 62.5 seconds with Example 9.3.5.
Valuable idea, Index Only. Select count(*) from . . . is always index only if
index can do in a single step at all, since count entries.
But can't build index on the spur of the moment. If don't have needed one
already, out of luck. E.g., consider query:
select name, straddr, age from prospects where zipcode
between 02159 and 02258;
Now naddrx doesn't have all needed components. Out of luck.
If try to foresee all needed components in an index, essentially duplicating
the rows, and lose performance boost from size.
Indexes cost something. Disk media cost (not commonly crucial). With in-
serts or updates of indexed rows, lot of extra I/O (not common).
With read-only, like prospects table, load time increases. Still often have
every col of a read-only table indexed.
Chapter 9.4. Filter Factors and Statistics
-214-
Recall, estimated probability that a random row made some predicate true.
By statistics, determine the fraction (FF(pred)) of rows retrieved.
E.g., hobby column has 100 values. Generally assume uniform distribution,
and get: FF(hobby = const) = 1/100 = .01.
And zipcode column has 100,000 values, FF(zipcode = const) = 1/100,000.
FF(zipcode between 02159 and 03158) = 1000
.
(1/100,000) = 1/100.
How does the DB2 query optimizer make these estimates?
DB2 statistics.
See Figure 9.13, pg. 558. After use RUNSTATS, these statistics are up to
date. (Next pg. of these notes) Other statistics as well, not covered.
DON'T WRITE THIS ON BOARD -- SEE IN BOOK
Catalog
Name
St at i st i c
Name
Def aul t
Value
Descri pt i on
SYSTABLES CARD
NPAGES
10, 000
CEIL(1+CARD/20)
Number of rows in the table
Number of data pages containing rows
SYSCOLUMNS COLCARD
HIGH2KEY
LOW2KEY
2 5
n/ a
n/ a
Number of distinct values in this column
Second highest value in this column
Second lowest value in this column
SYSINDEXES NLEVELS
NLEAF
FIRSTKEY-
CARD
FULLKEY-
CARD
CLUSTER-
RATIO
0
CARD/300
2 5
2 5
0% if CLUSTERED = 'N
95% if CLUSTERED =
Number of Levels of the Index B-tree
Number of leaf pages in the Index B-tree
Number of distinct values in the first
column, C1, of this key
Number of distinct values in the full
key, all components: e.g. C1.C2.C3
Percentage of rows of the table that are
clustered by these index values
Figure 9.13. Some Statistics gathered by RUNSTATS used for access plan determination
Statistics gathered into DB2 Catalog Tables named. Assume that index
might be composite, (C1, C2, C3)
Go over table. CARD, NPAGES for table. For column, COLCARD, HIGH2KEY,
LOW2KEY. For Indexes, NLEVELS, NLEAF, FIRSTKEYCARD, FULLKEYCARD,
CLUSTERRATIO. E.g., from Figure 9.12, statistics for prospects table (given
on pp. 552-3). Write these on Board.
SYSTABLES
NAME CARD NPAGES
-215-
. . . . . . . . .
prospects 50, 000, 000 5, 000, 000
. . . . . . . . .
SYSCOLUMNS
NAME TBNAME COLCARD HIGH2KEY LOW2KEY
. . . . . . . . . . . . . . .
hobby prospects 1 0 0 Wines Bicycling
zipcode prospects 100000 99998 00001
. . . . . . . . . . . . . . .
SYSINDEXES
NAME TBNAME NLEVELS NLEAF FIRSTKEY
CARD
FULLKEY
CARD
CLUSTER
RATIO
. . . . . . . . . . . . . . . . . . . . .
addrx prospects 4 500, 000 100, 000 50, 000, 000 1 0 0
hobbyx prospects 3 50, 000 1 0 0 1 0 0 0
. . . . . . . . . . . . . . . . . . . . .
CLUSTERRATIO is a measure of how well the clustering property holds for an
index. With 80 or more, will use Sequential Prefetch in retrieving rows.
Indexable Predicates in DB2 and their Filter Factors
Look at Figure 9.14, pg. 560. QOPT guesses at Filter Factor. Product rule
assumes independent distributions of columns. Still no subquery predicate.
Predicate Type Filter Factor Notes
Col = const 1/COLCARD "Col <> const" same as "not (Col = const)"
Col const Interpolation formula "" is any comparison predicate other
than equality; an example follows
Col < const or
Col <= const
(const - LOW2KEY)
(HIGH2KEY-LOW2KEY)
LOW2KEY and HIGH2KEY are estimates for
extreme points of the range of Col values
Col between const1
and const2
( const 2 - const 1)
(HIGH2KEY-LOW2KEY)
"Col not between const1 and const2" same
as "not (Col between const1 and const2)"
Col in list (list size)/COLCARD "Col not in list" same as "not (Col in list)"
Col is null 1/COLCARD "Col is not null" same as "not(Col is null)"
Col like 'pattern' Interpolation Formula Based on the alphabet
Pred1 and Pred2
FF(Pred1)
.
FF(Pred2)
As in probability
Pred1 or Pred2 FF(Pred1)+FF(Pred2)
-FF(Pred1)
.
FF(Pred2)
As in probability
not Pred1 1 - FF(Pred1) As in probability
Figure 9.20. Filter Factor formulas for various predicate types Class 17.
Matching Index Scans with Composite Indexes
-216-
(Finish -> 9.6 Class 19, homework due Class 20 (Wed, April 12)
Assume new index mailx:
create index mailx on prospects (zipcode, hobby, incomeclass, age);
NOT clustered. column incomeclass has 10 distinct values, age has 50.
FULLKEYCARD(mailx) could be as much as
CARD(zipcode)
.
CARD(hobby)
.
CARD(incomeclass)
.
CARD(age) =
100,000
.
100
.
10
.
50 = 1,000,000,000.
Can't be that much, only 50,000,000 rows, so assume FULLKEYCARD is
50,000,000, with no duplicate rows. (Actually, 50M darts in 5G slots. About
1/100 of slots hit, so only about 1% duplicate keyvalues.)
Entries for mailx have length: 4 (integer zipcode) + 8 (hobby) + 2
(incomeclass) + 2 (age) + 4 (RID) = 20 bytes. So 200 entries per page.
NLEAF = 50,000,000/100 = 500,000 pages. Next level up has 5,000 nodes,
next level 50, next is root, so NLEVELS = 4.
SYSINDEXES
NAME TBNAME NLEVELS NLEAF FIRSTKEY
CARD
FULLKEY
CARD
CLUSTER
RATIO
. . . . . . . . . . . . . . . . . . . . .
mai l x prospects 4 250, 000 100, 000 50, 000, 000 0
. . . . . . . . . . . . . . . . . . . . .
Example 9.5.1. Concatenated Index, Matching Index Scan.
select name straddr from prospects where
zipcode = 02159 and hobby = 'chess' and incomeclass = 10;
Matching Index Scan here means that the three predicates in the WHERE
clause match the INITIAL column in concatenated mailx index.
Argue that matching means all entries to be retrieved are contiguous in the
index.
Full filter factor for three predicates given is 1/100,000
.
1/ 100
.
1/10 =
1/100M, with 50M rows, so only 0.5 rows selected. 0.5R
-217-
Interpret this probabilistically, and expected time for retrieving rows is only
1/80 second. Have to add index I/O of course. 2R, .05 sec.
Example 9.5.2. Concatenated Index, Matching index scan.
select name straddr from prospects
where zipcode between 02159 and 04158
and hobby = 'chess' and incomeclass = 10;
Now, important. Not one contiguous interval in index. There is one interval
for: z = 02159 and h = 'c' and inc = 10, and then another for z = 02160 and
h = 'c' and inc = 10, and . . . But there is stuff between them.
Analogy in telephone directory: last name between 'Sma' and 'Smz' and first
name 'John'. Lot of directory to look through, not all matches.
Query optimizer here traverses from leftmost z = 02159 to rightmost z =
04158 and uses h = 'c' and inc = 10 as screening predicates.
We say the first predicate is a MATCHING predicate (used for cutting down
interval of index considered) and other two are SCREENING predicates.
(This MATCHING predicate is what we mean by Matching Index Scan.)
So index traversed is: (2000/100,000) (filter factor) of 500,000 leaf
pages, = 10,000 leaf pages. Query optimizer actually calculates FF as
(04158-02159)(HIGH2KEY-LOW2KEY) = 2000/(99998-00001)
= 200/99997 or approximately 2000/100,000 = 1/50
Have to look through 1/50
.
NLEAF = 5000 pages, I/O cost is 5,000S with
elapsed time: 5,000/400 = 12.5 seconds.
How many rows retrieved? (1/50)(1/100)(1/10) = (1/50,000) with 50M
rows, so 1000 rows retrieved. Sequential, class?
No. 1000R, with elapsed time 1000/40 = 25 seconds. Total elapsed time is
37.5 secs.
Example 9.5.3. Concatenated Index, Non-Matching Index Scan.
-218-
select name straddr from prospects where
hobby = 'chess' and incomeclass = 10 and age = 40;
Like saying First name = 'John' and City = 'Waltham' and street = 'Main'.
Have to look through whole index, no matching column, only screening
predicates.
Still get small number of rows back, but have to look through whole index.
250,000S. Elapsed time 250,000/400 = 625 seconds, about 10.5 minutes.
Number of rows retrieved: (1/100)(1/10)(1/50)(50,000,000) = 1000.
1000R = 25 seconds.
In PLAN TABLE, for Example 9.5.2, have ACCESSTYPE = I, ACCESSNAME =
mailx, MATCHCOLS = 1; In Example 9.5.3, have MATCHCOLS = 0.
Definition 9.5.1. Matching Index Scan. A plan to execute a query where at
least one indexable predicate must match the first column of an index
(known as matching predicate, matching index). May be more.
What is an indexable predicate? Equal match predicate is one: Col = const
See Definition 9.5.3. Pg. 565
Say have index C1234X on table T, composite index on columns (C1, C2, C3,
C4). Consider following compound predicates.
C1 = 10 and C2 = 5 and C3 = 20 and C4 = 25 (matches all columns)
C2 = 5 and C3 = 20 and C1 = 10 (matches first three: needn't be in order)
C2 = 5 and C4 = 22 and C1 = 10 and C6 = 35 (matches first two)
C2 = 5 and C3 = 20 and C4 = 25 (NOT a matching index scan)
Screening predicates are ones that match non-leading columns in index. E.g.,
in first example all are matching, in second all are matching, in third, two are
matching, one is screening, and one is not in index, in fourth all three are
screening.
Finish through Section 9.6 by next class. Homework due next class
(Wednesday after Patriot's day). NEXT homework is rest of Chapter 9 non-
dotted exercises if you want to work ahead.
-219-
Definition 9.5.2. Basic Rules of Matching Predicates
(1) A matching predicate must be an indexable predicate. See pg. 560,
Table 9.14 for a list of indexable predicates.
( 2) Matching predicates must match successive columns, C1, C2, . . . of an
index. Procedure: Look at index columns from left-to right. If find a
matching predicate for this column, then this is a matching column. As soon
as column fails to be matching terminate the search.
Idea is that sequence of matching predicates cuts down index search to
smaller contiguous range. (One exception: In-list predicate, covered
shortl y).
( 3) A non-matching predicate in an index scan can still be a screening
predicate.
Look at rule (1) again. This is actually a kind of circular definition. For a
predicate to be matching it must be indexable and:
Definition 9.5.3: An indexable predicate is one that can be used to match
a column in a matching index scan.
Calling such a predicate indexable is confusing. Even if a predicate is not
indexable, the predicate can use the index for screening.
Would be much better to call such predicates matchable, but this nomen-
clature is embedded in the field for now.
When K leading columns of index C1234X are matching for a query, EXPLAIN
into plan table, get ACCESSTYPE = I, ACCESSNAME = C1234X, MATCHCOLS =
K. When non-matching index scan, MATCHCOLS = 0.
Recall Indexable Predicates in Figure 9.14, pg. 560, and relate to telephone
directory. Does the predicate give you a contiguous range?
Col > const? between? In-list is special. like 'pattern' with no leading wild
card? Col1 like Col2? (same middle name as street name) Predicate and?
Predicate or? Predicate not?
-220-
OK, a few more rules on how to determine matching predicates. Page 566,
Def 9.5.4. Match cols. in index left to right until run out of predicates. But
(3) Stop at first range predicate (between, <, >, <=, >=, like).
(4) At most one In-list predicate.
In-list is special because it is considered a sequence of equal matching
predicates that the query optimizer agrees to bridge in the access plan
C1 in (6, 8, 10) and C2 = 5 and C3 = 20 is like C1 = 6 and . . .;
Plan for C1 = 8 and . . .; then C1 = 10 and . . .; etc.
But the following has only two matching columns since only one in-list can be
used.
C1 in (6, 8, 10) and C2 = 5 and C3 in (20, 30, 40)
When In-list is used, say ACCESSTYPE = 'N'.
Example 9.5.4. In following examples, have indexes C1234X, C56X, Unique
index C7X.
( 1) select C1, C5, C8 from T where C1 = 5 and C2 = 7 and C3 <> 9;
ACCESSTYPE = I, ACCESSNAME = C1234X, MATCHCOLS = 2
( 2) select C1, C5, C8 from T where C1 = 5 and C2 >= 7 and C3 = 9;
C3 predicate is indexable but stop at range predicate.
ACCESSTYPE = I, ACCESSNAME = C1234X, MATCHCOLS = 2
( 3) select C1, C5, C8 from T
where C1 = 5 and C2 = 7 and C5 = 8 and C6 = 13;
We ignore for now the possibility of combining multiple indexes
Note, we don't know what QOPT will choose until we see plan table row
ACCESSTYPE = I, ACCESSNAME = C56X, MATCHCOLS = 2
( 4) select C1, C4 from T
where C1 = 10 and C2 in (5, 6) and (C3 = 10 or C4 = 11);
C1 and C2 predicates are matching. The "or" operator doesn't give
indexable predicate, but would be used as screening predicate (not
mentioned in plan table, but all predicates used to filter, ones that
exist in index certainly would be used when possible). ACCESSTYPE = 'N',
-221-
ACCESSNAME = C1234X, MATCHCOLS = 2 Also: INDEXONLY = 'Y'
( 5) select C1, C5, C8 from T where C1 = 5 and C2 = 7 and C7 = 101;
ACCESSTYPE = I, ACCESSNAME = C7X, MATCHCOLS = 1
(Because unique match, but nothing said about this in plan table)
( 6) select C1, C5, C8 from T
where C2 = 7 and C3 = 10 and C4 = 12 and C5 = 16;
Will see can't be multiple index. Either non-matching in C1234X or
matching on C56X. ACCESSTYPE = I, ACCESSNAME = C1234X,
MATCHCOLS = 2
Some Special Predicates
Pattern Match Search. Leading wildcards not indexable.
"C1 like 'pattern'" with a leading '%' in pattern (or leading '_'?) like looking in
dictionary for all word ending in 'tion'. Non-matching scan (non-indexable
predicate).
Exist dictionaries that index by backward spelling, and DBA can use this trick:
look for word with match: Backwards = 'noit%'
Expressions. Use in predicate makes not indexable.
select * from T where 2*C1 <= 56;
DB2 doesn't do algebra. You can re-write: where C1 <= 28;
Never indexable if two different columns are used in predicate: C1 = C2.
One-Fetch Access. Select min/max . . .
select min(C1) from T;
Look at index C1234X, leftmost value, read off value of C1. Say have index
C12D3X on T (C1, C2 DESC, C3). Each of following qs has one-fetch access.
select min(C1) from T where C1 > 5; (NOT obviously 5)
select min(C1) from T where C1 between 5 and 6;
select max(C2) from T where C1 = 5;
select max(C2) from T where C1 = 5 and C2 < 30;
-222-
select min(C3) from T where C1 = 6 and C2 = 20 and C3 between 6 and 9;
-223-
Class 18.
9.6 Multiple Index Access
Assume index C1X on (C1), C2X on (C2), C345X on (C3, C4, C5), query:
(9.6.1) select * from T where C1 = 20 and C2 = 5 and C3 = 11;
By what we've seen up to now, would have to choose one of these indexes.
Then only one of these predicates could be matched.
Other two predicates are not even screening predicates. Don't appear in
index, so must retrieve rows and validate predicates from that.
BUT if only use FF of one predicate, terrible inefficiency may occur. Say T
has 100,000,000 rows, FF for each of these predicates is 1/100.
Then after applying one predicate, get 1,000,000 rows retrieved. If none of
indexes are clustered, will take elapsed time of: 1,000,000/40 = 25,000
seconds, or nearly seven hours.
If somehow we could combine the filter factors, get composite filter factor
of (1/100)(1/100)(1/100) = 1/1,000,000, only retrieve 100 rows.
Trick. Multiple index access. For each predicate, matching on different
index, extract RID list. Sort it by RID value.
(Draw a picture - three extracts from leaf of wedge into lists)
Intersect (AND) all RID lists (Picture) (easy in sorted order). Result is list of
RIDs for answer rows. Use List prefetch to read in pages. (Picture)
This is our first multiple step access plan.
See Figure 9.15. Put on board.
V Note
TNAME ACCESSTYPE MATCHCOLS ACCESSNAME PREFETCH MIXOPSEQ
T M 0 L 0
T MX 1 C1X S 1
T MX 1 C2X S 2
T MX 1 C345X S 3
T MI 0 4
T MI 0 5
Figure 9.15 Plan table rows of a Multiple Index Access plan for Query (9.6.1)
-224-
Row M, start Multiple index access. Happens at the end, after Intersect
Diagram steps in picture. RID lists placed in RID Pool in memory.
Note Plan acts like reverse polish calculation: push RID lists as created by
MX step; with MI, pop two and intersect , push result back on stack.
MX steps require reads from index. MI steps require no I/O: already in
memory, memory area known as RID Pool.
Final row access uses List prefetch, program disk arms for most efficient
path to bring in 32 page blocks that are not contiguous, but sequentially
listed. Most efficient disk arm movements.
(Remember that an RID consists of (page_number, slot_number), so if RIDs
are placed in ascending order, can just read off successive page numbers).
Speed of I/O depends on distance apart. Rule of thumb for Random I/O is
40/sec, Seq. prefetch 400/sec, List prefetch is 100/sec.
-225-
Have mentioned ACCESSTYPEs = M, MX, MI. One other type for multiple index
access, known as MU (to take the Union (OR) of two RID lists). E.g.
(9.6. 2) select * from T where C1 = 20 and (C2 = 5 or C3 = 11);
Here is Query Plan (Figure 9.16):
TNAME ACCESSTYPE MATCHCOLS ACCESSNAME PREFETCH MIXOPSEQ
T M 0 L 0
T MX 1 C1X S 1
T MX 1 C2X S 2
T MX 1 C345X S 3
T MU 0 4
T MI 0 5
Figure 9.16 Plan table rows of a Multiple Index Access plan for Query (9.6.2)
Actually, query optimizer wouldn't generate three lists in a row if avoidable.
Tries not to have > 2 in existence (not always possible).
Figure 9.17. 1. MX C2X, 2. MX C345X, 3. MU, 4. MX C1X, 5. MI
Example 9.6.1. Multiple index access. prospects table, addrx index,
hobbyx index (see Figure 9.11, p 544), index on age (agex) and incomeclass
(incomex). What is NLEAF for these two, class? (Like hobbyx: 50,000.)
select name, straddr from prospects
where zipcode = 02159 and hobby = 'chess' and incomeclass = 10;
Same query as Example 9.5.1 when had zhobincage index, tiny cost, 2R index,
only .5R for row. But now have three different indexes: PLAN.
TNAME ACCESSTYPE MATCHCOLS ACCESSNAME PREFETCH MIXOPSEQ
T M 0 L 0
T MX 1 hobbyx S 1
T MX 1 addrx S 2
T MI 0 3
T MX 1 incomex S 4
T MI 0 5
Figure 9.18 Plan table rows of Multiple Index Access plan for Example 9.6.1
Calculate I/O cost. FF(hobby = 'chess') = 1/100, on hobbyx (NLEAF =
50,000), 500S (and ignore directory walk).
-226-
For MIXOPSEQ = 2, FF(zipcode = 02159) = 1/100,000 on addrx (NLEAF =
500,000 ), so 5S.
Intersect steps MI requires no I/O.
FF(incomeclass = 10) = 1/10, on incomex (NLEAF = 50,000), so 5,000S.
Still end up with 0.5 rows at the end by taking product of FFs for three
predicates: (1/100,000)(1/100)(1/10) = 1/100,000,000
Only 50,000,0000 rows; ignore list prefetch of .5L.
So 5,505, elapsed time 5,505/400 = 13.8 seconds.
List Prefetch and RID Pool
All the things we've been mentioning up to now have been relatively universal,
but RID list rules are quite product specific. Probably won't see all this
duplicated in ORACLE.
-> We need RID extraction to access the rows with List Prefetch, but since
this is faster than Random I/O couldn't we always use RID lists, even in single
index scan to get at rows with List Prefetch in the end?
YES, BUT: There are restrictive rules that make it difficult. Most of these
rules arise because RID list space is a scarce resource in memory.
-> The RID Pool is a separate memory storage area that is 50% the size of all
four Disk Buffer Pools, except cannot exceed 200 MBytes.
Thus if DBA sets aside 2000 Disk Buffer Pages, will have 1000 pages (about
4 MBytes) of RID Pool. (Different memory area, though.)
Def. 9.6.1. Rules for RID List Use.
( 1) QOPT, when it constructs query plan, predicts size of RIDs active at any
time. Cannot use > 50% of total capacity. If guess wrong, abort during
runti me.
(2) No Screening predicates can be used in an Index scan that extracts a RID
Li st.
-227-
( 3) An In-list predicate cannot be used when extract a RID list.
Example 9.6.3. RID List Size Limit. addrx, hobbyx, incomex, add sexx
on column sex. Two values, 'M' and 'F', uniformly distributed.
select name, straddr from prospects where zipcode between
02159 and 02358 and incomeclass = 10 and sex = 'F';
Try multiple index access. Extract RID lists for each predicate and intersect
the lists.
But consider, sexx has 50,000 leaf pages. (Same as hobbyx, from
50,000,000 RID size entries.) Therefore sex = 'F' extracts 25,000 page RID
list. 4 Kbytes each page, nearly 100 MBytes.
Can only use half of RID Pool, so need 200 MByte RID Pool.
Not unreasonable. A proof that we should use large buffers.
Example 9.6.5. No Index Screening. Recall zhobincage index in Example
9.3.8.
select name, straddr from prospects where zipcode between 02159
and 04158 and hobby = 'chess' and incomeclass = 10;
Two latter predicates were used as screening predicates. But screening
predicates can't be used in List Prefetch. To do List Prefetch, must resolve
two latter predicates after bring in rows.
The 'zipcode between' predicate has a filter factor of 2000/100,000 =
1/50, so with 50 M rows, get 1M rows, RID list size of 4 MBytes, 1000 buffer
pages, at least 2000 in pool. Assume RID pool is large enough.
So can do List Prefetch, 1,000,000/100 = 10,000 seconds, nearly 3 hours.
If use screening predicates, compound FF is (1/50)(1/100)(1/10) =
1/50,000. End with 1000 rows. Can't use List Prefetch, so 1000R but
1000/40 is only 25 seconds. Same index I/O cost in both cases. Clearly
better not to use RID list.
-228-
What is reason not to allow List Prefetch with Screening Predicates? Just
because resource is scarce, don't want to tie up RID list space for a long
time while add new RIDs slowly. If refuse screening, QOPT might use other
plans without RIDs.
Point of Diminishing Returns in MX Access.
Want to use RID lists in MIX. >> Talk it
Def. 9.6.2. (1) In what follows, assume have multiple indexes, each with a
disjoint set of matching predicates from a query. Want to use RID lists.
(2) Way QOPT sees it: List RID lists from smallest size up (by increasing
Filter Factor). Thus start with smaller index I/O costs (matching only) and
larger effect in saving data page I/O.
(3) Generate successive RID lists and calculate costs; stop when cost of
generating new RID list doesn't save enough in eventual data page reads by
reducing RID set.
Example 9.6.6.
select name straddr from prospects
where prospects between o2159 and 02659
and age = 40 and hobby = 'chess' and incomeclass = 10;
(1) FF(zipcode between 02159 and 02658) = 500/100,000 = 1/200.
(2) FF(hobby = 'chess') = 1/100
(3) FF(age = 40) = 1/50 (Typo in text)
(4) FF(incomeclass = 10) = 1/10
Apply predicate 1. (1/200) (50M rows) retrieved, 250,000 rows, all on
separate pages from 5M, and 250,000L is 2500 seconds. Ignore the index
cost .
Apply predicate 2 after 1. Leaf pages of hobbyx scanned are (1/100)
(50,000) = 500S, taking 500/400 = 1.25 seconds. Reduce number of rows
retrieved from 250,000 to about 2500, all on separate pages, and 2500L
-229-
takes about 25 seconds. So at cost of 1.25 seconds of index I/O, reduce
table page I/O from 2500 seconds to 25 seconds. Clearly worth it.
Apply predicate 3 after 1 and 2. Leaf pages of agex scanned are (1/50)
(50,000) = 1000S, taking 1000/400 = 2.5 seconds. Rows retrieved down to
(1/50) (2500) = 50, 50L is .5 seconds. Thus at cost of 2.5 seconds of index
I/O reduce table page I/O from 25 seconds to 0.5 seconds. Worth it.
With predicate 4, need index I/O at leaf level of (1/10) (50,000) = 5000S or
12.5 seconds. Not enough table page I/O left (0.5 seconds) to pay for it.
Don't use predicate 4. We have reached point of diminishing returns.
Chapter 10
Class 20.
We have encountered the idea of a transaction before in Embedded SQL.
Def. 10.1 Transaction. A transaction is a means to package together a
number of database operations performed by a process, so the database
system can provide several guarantees, called the ACID properties.
Think of writing: BEGIN TRANSACTION op1 op2 . . . opN END TRANSACTION
Then all ops within the transaction are packaged together.
There is no actual BEGIN TRANSACTION statement in SQL. A transaction is
begun by a system when there is none in progress and the application first
performs an operation that accesses data: Select, Insert, Update, etc.
The application logic can end a transaction successfully by executing:
exec sql commit work; /* called simply a Commi t * /
Then any updates performed by operations in the transaction are success-
fully completed and made permanent and all locks on data items are re-
leased. Alternatively:
exec sql rollback work; /* called an Abort * /
means that the transaction was unsuccessful: all updates are reversed, and
locks on data items are released.
-230-
The ACID guarantees are extremely important -- This and SQL is what
differentiates a database from a file system.
Imagine that you are trying to do banking applications on the UNIX file system
(which has it's own buffers, but no transactions). There will be a number of
problems, the kind that faced database practitioners in the 50s.
1. Inconsistent result. Our application is transferring money from one
account to another (different pages). One account balance gets out to disk
(run out of buffer space) and then the computer crashes.
When bring computer up again, have no idea what used to be in memory
buffer, and on disk we have destroyed money.
2. Errors of concurrent execution. (One kind: Inconsistent Analysis.)
Teller 1 transfers money from Acct A to Acct B of the same customer,
while Teller 2 is performing a credit check by adding balances of A and B.
Teller 2 can see A after transfer subtracted, B before transfer added.
3. Uncertainty as to when changes become permanent. At the very
least, we want to know when it is safe to hand out money: don't want to
forget we did it if system crashes, then only data on disk is safe.
Want this to happen at transaction commit. And don't want to have to write
out all rows involved in transaction (teller cash balance -- very popular, we
buffer it to save reads and want to save writes as well).
To solve these problems, systems analysts came up with idea of transaction
(formalized in 1970s). Here are ACID guarantees:
Atomicity. The set of record updates that are part of a transaction are
indivisible (either they all happen or none happen). This is true even in
presence of a crash (see Durability, below).
Consistency. If all the individual processes follow certain rules (money is
neither created nor destroyed) and use transactions right, then the rules
won't be broken by any set of transactions acting together. Implied by
Isolation, below.
Isolation. Means that operations of different transactions seem not to be
interleaved in time -- as if ALL operations of one Tx before or after all
operations of any other Tx.
-231-
Durability. When the system returns to the logic after a Commit Work
statement, it guarantees that all Tx Updates are on disk. Now ATM machine
can hand out money.
The system is kind of clever about Durability. It doesn't want to force all
updated disk pages out of buffer onto disk with each Tx Commit.
So it writes a set of notes to itself on disk (called logs). After crash run
Recovery (also called Restart) and makes sure notes translate into ap-
propriate updates.
What about Read-Only Tx? (No data updates, only Selects.) Atomicity and
Durability have no effect, but Isolation does.
Money spent on Transactional systems today is about SIX BILLION DOLLARS A
YEAR. We're being rigorous about some of this for a BUSINESS reason.
10.1 Transactional Histories.
Reads and Writes of data items. A data item might be a row of a table or it
might be an index entry or set of entries. For now talking about rows.
Read a data item when access it without changing it. Often a select.
select val into :pgmval1 from T1 where uniqueid = A;
We will write this as R
i
(A): transaction with identification number i reads
data item A. Kind of rough won't always have be retrieving by uniqueid =
A. But it means that we are reading a row identified as A. Now:
update T1 set val = pgmval2 where uniqueid = B;
we will write this as W
j
(B); Tx j writes B; say Update results in Write.
Can get complicated. Really reading an index entry as well to write B.
Consider:
update T1 set val = val + 2 where uniqueid = B;
-232-
Have to read an index entry, R
j
(predicate: uniqueid = B), then a pair of row
operations: R
j
(B) (have to read it first, then update it) W
j
(B). Have to read
it in this case before can write it.
update T set val = val + 2 where uniqueid between :low and :high;
Will result in a lot of operations: R
j
(predicate: uniqueid between :low and
:high), then R
j
(B1) W
j
(B1) R
j
(B2) W
j
(B2) . . . R
j
(BN) W
j
(BN).
>>The reason for this notation is that often have to consider complex inter-
leaved histories of concurrent transactions; Example history:
(10.1.2) . . . R
2
(A) W
2
(A) R
1
(A) R
1
(B) R
2
(B) W
2
(B) C
1
C
2
. . .
Note C
i
means commit by Tx i. A sequence of operations like this is known
as a History or sometimes a Schedule.
A history results from a series of operations submitted by users, translated
into R & W operations at the level of the Scheduler. See Fig. 10.1.
It is the job of the scheduler to look at the history of operations as it comes
in and provide the Isolation guarantee, by sometimes delaying some
operations, and occasionally insisting that some transactions be aborted.
By this means it assures that the sequence of operations is equivalent in
effect to some serial schedule (all ops of a Tx are performed in sequence
with no interleaving with other transactions). See Figure 10.1, pg. 640.
In fact, (10.1.2) above is an ILLEGAL schedule. Because we can THINK of a
situation where this sequence of operations gives an inconsistent result.
Example 10.1.1. Say that the two elements A and B in (10.1.2) are Acct
records with each having balance 50 to begin with. Inconsistent Analysis.
T
1
is adding up balances of two accounts, T
2
is transferring 30 units from A
to B.
. . . R
2
(A, 50) W
2
(A, 20) R
1
(A, 20) R
1
(B, 50) R
2
(B, 50) W
2
(B, 80) C
1
C
2
. .
.
And T determines that the customer fails the credit check (because under
balance total of 80, say).
-233-
But this could never have happened in a serial schedule, where all operation
of T
2
occurred before or after all operations of T
2
.
. . . R
2
(A, 50) W
2
(A, 20) R
2
(B, 50) W
2
(B, 80) C
2
R
1
(A, 20) R
1
(B, 80) C
2
. . .
or
. . . R
1
(A, 50) R
1
(B, 50) C
1
R
2
(A, 50) W
2
(A, 20) R
2
(B, 50) W
2
(B, 80) C
2
. . .
And in both cases, T
1
sees total of 100, a Consistent View.
Notice we INTERPRETED the Reads and Writes of (10.1.2) to create a model
of what was being read and written to show there was an inconsistency.
This would not be done by the Scheduler. It simply follows a number of rules
we explain shortly. We maintain that a serial history is always consistent
under any interpretation.
10.2. Interleaved Read/Write Operations
Quick>>
If a serial history is always consistent, why don't we just enforce serial
histories.
The scheduler could take the first operation that it encounters of a given
transaction (T
2
in the above example) and delay all ops of other Txs (the
Scheduler is allowed to do this) until all operations of T
2
are completed and
the transaction commits (C
2
).
Reason we don't do this? Performance. It turns out that an average Tx has
relatively small CPU bursts and then I/O during which CPU has nothing to do.
See Fig 10.3, pg. 644. When I/O is complete, CPU can start up again.
What do we want to do? Let another Tx run (call this another thread) during
slack CPU time. (Interleave). Doesn't help much if have only one disk (disk is
bottleneck). See Fig 10.4, pg. 644.
But if we have two disks in use all the time we get about twice the
throughput. Fig 10.5, pg. 645.
And if we have many disks in use, we can keep the CPU 100% occupied. Fig
10.6, pg 646.
-234-
In actuality, everything doesn't work out perfectly evenly as in Fig 10.6.
Have multiple threads and multiple disks, and like throwing darts at slots.
Try to have enough threads running to keep lots of disk occupied so CPU is
90% occupied. When one thread does an I/O, want to find another thread
with completed I/O ready to run again.
Leave this to you -- covered in Homework.
10.3 Serializability and the Precedence Graph.
We want to come up with a set of rules for the Scheduler to allow operations
by interleaved transactions and guarantee Serializability.
Serializability means the series of operations is EQUIVALENT to a Serial
schedule, where operations of Tx are not interleaved.
How can we guarantee this? First notice that if two transactions never
access the same data items, it doesn't matter that they're interleaved.
We can commute ops in the history of requests permitted by the scheduler
until all ops of one Tx are together (serial history). The operations don't
affect each other, and order doesn't matter.
We say that the Scheduler is reading a history H (order operations are
submitted) and is going to create a serializable history S(H) (by delay, etc.)
where operations can be commuted to a serial history.
OK, now if we have operations by two different transactions that do affect
the same data item, what then?
There are only four possibilities: R or W by T
1
followed by R or W by T
2
.
Consider history:
. . . R
1
(A) . . . W
2
(A) . . .
Would it matter if the order were reversed? YES. Can easily imagine an
interpretation where T
2
changes data T
1
reads: if T
1
reads it first, sees old
version, if reads it after T
2
changes it, sees later version.
We use the notation:
-235-
R
1
(A) <<
H
W
2
(A)
to mean that R
1
(A) comes before W
2
(A) in H, and what we have just noticed
is that whenever we have the ordering in H we must also have:
R
1
(A) <<
S(H)
W
2
(A)
That is, these ops must occur in the same order in the serializable schedule
put out by the Scheduler. If R
1
(A) <<
H
W
2
(A) then R
1
(A) <<
H
W
2
(A).
Now these transaction numbers are just arbitrarily assigned labels, so it is
clear we could have written the above as follows:
If R
2
(A) <<
H
W
1
(A) then R
2
(A) <<
H
W
1
(A).
Here Tx 1 and Tx 2 have exchanged labels. This is another one of the four
cases. Now what can we say about the following?
R
1
(A) <<
H
R
2
(A)
This can be commuted -- reads can come in any order since they don't affect
each other. Note that if there is a third transaction, T
3
, where:
R
1
(A) <<
H
W
3
(A) <<
H
R
2
(A)
Then the reads cannot be commuted (because we cannot commute either
one with W
3
(A)), but this is because of application of the earlier rules, not
depending on the reads as they affect each other.
Finally, we consider:
W
1
(A) <<
H
W
2
(A)
And it should be clear that these two operations cannot commute. The ul-
timate outcome of the value of A would change. That is:
If W
1
(A) <<
H
W
2
(A) then W
1
(A) <<
S(H)
W
2
(A)
To summarize our discussion, we have Definition 10.3.1, pg. 650.
-236-
Def. 10.3.1. Two operations X
i
(A) and Y
j
(B) in a history are said to conflict
(i.e., the order matters) if and only if the following three conditions hold:
(1) A B. Operations on distinct data items never conflict.
(2) i j. Operations conflict only if they are performed by different Txs.
(3) One of the two operations X or Y is a write, W. (Other can be R or W.)
Note in connection with (2) that two operations of the SAME transaction also
cannot be commuted in a history, but not because they conflict. If the
scheduler delays the first, the second one will not be submitted.
-237-
Class 21.
We have just defined the idea of two conflicting operations. (Repeat?)
We shall now show how some histories can be shown not not to be serial-
izable. Then we show that such histories can be characterized by an easily
identified characteristic in terms of conflicting operations.
To show that a history is not serializable (SR), we use an interpretation of
the history.
Def. 10.3.2. An interpretation of an arbitrary history H consists of 3 parts.
(1) A description of the purpose of the logic being performed. (2) Spec-
ification of precise values for data items being read and written in the
history. (3) A consistency rule, a property that is obviously preserved by
isolated transactions of the logic defined in (1).
Ex. 10.3.1. Example 10.3.1. Here is a history, H1, we clain is not SR.
H1 = R
2
(A) W
2
(A) R
1
(A) R
1
(B) R
2
(B) W
2
(B) C
1
C
2
Here is an interpretation. T
1
is doing a credit check, adding up the balances
of A and B. T is transferring money from A to B. Here is the consistency
rule: Neither transaction creates or destroys money. Values for H1 are:
H1' = R
2
(A,50) W
2
(A,20) R
1
(A,20) R
1
(B,50) R
2
(B,50) W
2
(B,70) C
1
C
2
The schedule H1 is not SR because H1' shows an inconsistent result: sum of
70 for balances A and B, though no money was destroyed by T
2
in the
transfer from A to B. This could not have occurred in a serial execution.
The concept of conflicting operations gives us a direct way to confirm that
the history H1 is not SR. Note the second and third operations of H1,
W
2
(A) and R
1
(A). Since W(A) comes before R(A) in H1, written:
W
2
(A) <<
H1
R
1
(A)
We know since these operations conflict that they must occur in the same
order in any equivalent serial history, S(H1), i.e.: W
2
(A) <<
S(H1)
R
1
(A)
Now in a serial history, all operations of a transaction occur together
-238-
Thus W
2
(A) <<
S(H1)
R
1
(A) means that T
2
<<
S(H1)
T
1
, i.e. T
2
occurs before
T
1
in any serial history S(H1) (there might be more than one).
But now consider the fourth and sixth operations of H1. We have:
R
1
(B) <<
H1
W
2
(B)
Since these operations conflict, we also have R
1
(B) <<
S(H1)
W
2
(B)
But this implies that T
1
comes before T
2
, T
1
<<
S(H1)
T
2
, in any equivalent
serial history H1. And this is at odds with our previous conclusion.
In any serial history S(H1), either T
1
<<
S(H1)
T
2
or T
2
<<
S(H1)
T
1
, not both.
Since we conclude from examining H1 that both occur, S(H1) must not really
exist. Therefore, H1 is not SR.
We illustrate this technique a few more times, and then prove a general
characterization of SR histories in terms of conflicting operations.
Ex. 10.3.2. Consider the history:
H2 = R
1
(A) R
2
(A) W
1
(A) W
2
(A) C
1
C
2
We give a interpretation of this history as a paradigm called a lost update.
Assume that A is a bank balance starting with the value 100 and T1 tries to
add 40 to the balance at the same time that T2 tries to add 50.
H2' = R
1
(A, 100) R
2
(A, 100) W
1
(A, 140) W
2
(A, 150) C
1
C
2
Clearly the final result is 150, and we have lost the update where we added
40. This couldn't happen in a serial schedule, so H1 is non-SR.
In terms of conflicting operations, note that operations 1 and 4 imply that
T1 <<
S(H2)
T2. But operations 2 and 3 imply that T2 <<
S(H2)
T1. No SR
schedule could have both these properties, therefore H2 is non-SR.
By the way, this example illustrates that a conflicting pair of the form R
1
(A)
. . . W
2
(A) does indeed impose an order on the transactions, T1 << T2, in any
equivalent serial history.
-239-
H2 has no other types of pairs that COULD conflict and make H2 non-SR.
Ex 10.3.3. Consider the history:
H3 = W
1
(A) W
2
(A) W
2
(B) W
1
(B) C
1
C
2
This example will illustrate that a conflicting pair W
1
(A) . . . W
2
(A) can
impose an order on the transactions T1 << T2 in any equivalent SR history.
Understand that these are what are known as "Blind Writes": there are no
Reads at all involved in the transactions.
Assume the logic of the program is that T1 and T2 are both meant to "top
up" the two accounts A and B, setting the sum of the balances to 100.
T1 does this by setting A and B both to 50, T2 does it by setting A to 80 and
B to 20. Here is the result for the interleaved history H3.
H3' = W
1
(A, 50) W
2
(A, 80) W
2
(B, 80) W
1
(B, 50) C
1
C
2
Clearly in any serial execution, the result would have A + B = 100. But with
H3' the end value for A is 80 and for B is 50.
To show that H3 is non-SR by using conflicting operations, note that op-
erations 1 and 2 imply T1 << T2, and operations 3 and 4 that T2 << T1.
Seemingly, the argument that an interleaved history H is non-SR seems to
reduce to looking at conflicting pairs of operations and keeping track of the
order in which the transactions will occur in an equivalent S(H).
When there are two transactions, T1 and T2, we expect to find in a non-SR
schedule that T1 <<
S(H)
T2 and T2 <<
S(H)
T1, an impossibility.
If we don't have such an impossibility arise from conflicting operations in a
history H, does that mean that H is SR?
And what about histories with 3 or more transactions involved? WIll we ever
see something impossible other than T1 <<
S(H)
T2 and T2 <<
S(H)
T1?
We start by defining a Precedence Graph. The idea here is that this allows
us to track all conflicting pairs of operations in a history H.
-240-
Def. 10.3.3. The Precedence Graph. A precedence graph for a history H is a
directed graph denoted by PG(H).
The vertices of PG(H) correspond to the transactions that have COMMITTED
in H: that is, transaction Ti where C exists as an operation in H.
An edge Ti -> Tj exists in PG(H) whenever two conflicting operations X
i
and Y
j
occur in that order in H. Thus, Ti -> Tj should be interpreted to mean that Ti
must precede Tj in any equivalent serial history S(H).
Whenever a pair of operations conflict in H for COMMITTED transactions, we
can draw the corresponding direct arc in the Precedence Graph, PG(H).
The reason uncommitted transactions don't count is that we're trying to
figure out what the scheduler can allow. Uncommitted transactions can
always be aborted, and then it will be as if they didn't happen.
It would be unfair to hold uncommitted transactions against the scheduler by
saying the history is non-SR because of them.
The examples we have given above of impossible conditions arising from
conflicting operations look like this:
T1 ----------------------> T2
^<------------------------/
Of course this is what is called a circuit in a directed graph (a digraph). This
suggests other problems that could arise with 3 or more Txs.
It should be clear that if PG(H) has a circuit, there is no way to put the
transactions in serial order so Ti always comes befor Tj for all edges Ti ->
Tj in the circuit.
There'll always be one edge "pointing backward" in time, and that's a con-
tradiction, since Ti -> Tj means Ti should come BEFORE Tj in S(H).
How do we make this intuition rigorous? And if PG(H) doesn't have a circuit,
does that mean the history is SR?
Thm. 10.3.4. The Serializability Theorem. A history H has an equivalent
serial execution S(H) iff the precedence graph PG(H) contains no circuit.
-241-
Proof. Leave only if for exercises at end of chapter. I.e., will show there
that a circuit in PG(H) implies there is no serial ordering of transactions.
Here we prove that if PG(H) contains no circuit, there is a serial ordering of
the transactions so no edge of PG(H) ever points from a later to an earlier
transacti on.
Assume there are m transactions involved, and label them T1, T2, . . ., Tm.
We are trying to find a reordering of the integers 1 to m, i(1), i(2), . . ., i(m),
so that Ti(1), Ti(2), . . ., Ti(m) is the desired serial schedule.
Assume a lemma to prove later: In any directed graph with no circuit there is
always a vertex with no edge entering it.
OK, so we are assuming PG(H) has no circuit, and thus there is a vertex, or
transaction, Tk, with no edge entering it. We choose Tk to be Ti(1).
Note that since Ti(1) has no edge entering it, there is no conflict in H that
forces some other transaction to come earlier.
(This fits our intuition perfectly. All other transactions can be placed after
it in time, and there won't be an edge going backward in time.)
Now remove this vertex, Ti(1), from PG(H) and all edges leaving it. Call the
resulting graph PG
1
(H).
By the Lemma, there is now a vertex in PG
1
(H) with no edge entering it. Call
that vertex Ti(2).
(Note that an edge from Ti(1) might enter Ti(2), but that edge doesn't count
because it's been removed from PG
1
(H).)
Continue in this fashion, removing Ti(2) and all it's edges to form PG
2
(H), and
so on, choosing Ti(3) from PG
2
(H), . . ., Ti(m) from PG
m-1
(H).
By construction, no edge of PG(H) will ever point backward in the sequence
S(H), from Ti(m) to Ti(n), m > n.
The algorithm we have used to determine this sequence is known as a
topological sort. This was a hiring question I saw asked at Microsoft.
-242-
The proof is complete, and we now know how to create an equivalen SR
schedule from a history whose precedence graph has no circuit.
Lemma 10.3.5. In any finite directed acyclic graph G there is always a
vertex with no edges entering it.
Proof. Choose any vertex v1 from G. Either this has the desired property,
or there is an edge entering it from another vertex v2.
(There might be several edges entering v1, but choose one.)
Now v2 either has the desired property or there is an edge entering it from
vertex v3. We continue in this way, and either the sequence stops at some
vertex vm, or the sequence continues forever.
If the sequence stops at a vertex vm, that's because there is no edge en-
tering vm, and we have found the desired vertex.
But if the sequence continues forever, since this is a finite graph, sooner or
later in the sequence we will have to have a repeated vertex.
Say that when we add vertex vn, it is the same vertex as some previously
mentioned vertex in the sequence, vi.
Then there is a path from vn -> v(n-1) -> . . . v(i+1) ->vi, where vi vn. But
this is the definition of a circuit, which we said was impossible.
Therefore the sequence had to terminate with vm and that vertex was the
one desired with no edges entering.
-243-
Class 22.
10.4 Locking Ensures Serializabilty
See Fig. 10.8, pg. 609. TM passes on calls such as fetch, select, insert,
delete, abort; Scheduler interprets them as: R
i
(A), W
j
(B).
It is the job of the scheduler to make sure that no non-SR schedules get
past. This is normally done with Two-Phase Locking, or 2PL.
Def. 10.4.1. 2PL. Locks taken in released following three rules.
(1) Before Tx i can read a data item, R
i
(A), scheduler attempts to Read Lock
the item on it's behalf, RL
i
(A); before W
i
(A), try Write Lock, WL
i
(A).
(2) If conflicting lock on item exists, requesting Tx must WAIT. (Conflicting
locks corresponding to conflicting ops: two locks on a data item conflict if
they are attempted by different Txs and at least one of them is a WL).
(3) There are two phases to locking, the growing phase and the shrinking
phase (when locks are released: RU
i
(A); The scheduler must ensure that
can't shrink (drop a lock) and then grow again (take a new lock).
Rule (3) implies can release locks before Commit; More usual to release all
locks at once on Commit, and we shall assume this in what follows.
Note that a transaction can never conflict with its own locks! If Ti holds RL
on A, can get WL so long as no other Tx holds a lock (must be RL).
A Tx with a WL doesn't need a RL (WL more powerful than RL).
Clearly locking is defined to guarantee that a circuit in the Precedence Graph
can never occur. The first Tx to lock an item forces any other Tx that gets
to it second to "come later" in any SG.
But what if other Tx already holds a lock the first one now needs? This
would mean a circuit, but in the WAIT rules of locking it means NEITHER Tx
CAN EVER GO FORWARD AGAIN. This is a DEADLOCK. Example shortly.
Side effect of 2PL is that Deadlocks can occur: When a deadlock occurs,
scheduler will recognize it and force one of the Txs involved to Abort.
(Note, there might be more than 2 Txs involved in a Deadlock.)
-244-
Ex. Here is history not SR (Error in text: this is a variant of Ex. 10.4.1).
H4 = R
1
(A) R
2
(A) W
2
(A) R
2
(B) W
2
(B) R
1
(B) C
1
C
2
Same idea as 10.3.1 why it is non-SR: T2 reads two balances that start out
A=50 and B=50, T1 moves 30 from A to B. Non-SR history because T1 sees
A=50 and B=80. Now try locking and releasing locks at commit.
RL
1
(A) R
1
(A) RL
2
(A) R
2
(A) WL
2
(A) (conflicting lock held by T1 so T2 must
WAIT) RL
1
(B) R
1
(B) C
1
(now T2 can get WL
2
(A)) W
2
(A) RL
2
(B) R
2
(B) WL
2
(B)
W
2
(B) C
2
Works fine: T1 now sees A=50, B=50. Serial schedule, T1 then T2.
But what if allowed to Unlock and then acquire more locks later. Get non-SR
schedule. Shows necessity of 2PL Rule (3).
RL
1
(A) R
1
(A) RU
1
(A) RL
2
(A) R
2
(A) WL
2
(A) W
2
(A) WU
2
(A) RL
2
(B) R
2
(B)
WL
2
(B) W
2
(B) WU
2
(B) RL
1
(B) R
1
(B) C
1
C
2
So H4 above is possible. But only 2PL rule broken is that T1 and T2 unlock
rows, then lock other rows later.
The Waits-For Graph. How scheduler checks if deadlock occurs. Vertices
are currently active Txs, Directed edgs Ti -> Tj iff Ti is waiting for a lock
held by Tj.
(Note, might be waiting for lock held by several other Txs. And possibly get in
queue for W lock behind others who are also waiting. Draw picture.)
The scheduler performs lock operations and if Tx required to wait, draws
new directed edges resulting, then checks for circuit.
Ex 10.4.2. Here is schedule like H4 above, where T2 reverses order it
touches A and B (now touches B first), but same example shows non-SR.
H5 = R
1
(A) R
2
(B) W
2
(B) R
2
(A) W
2
(A) R
1
(B) C
1
C
2
Locking result:
-245-
RL
1
(A) R
1
(A) RL
2
(B) R
2
(B) WL
2
(B) W
2
(B) RL
2
(A) R
2
(A) WL
2
(A) (Fails:
RL1(A) held, T2 must WAIT for T1 to complete and release locks) RL
1
(B)
(Fails: WL2(B) held, T1 must wait for T2 to complete: But this is a
deadlock! Choose T2 as victim (T1 chosen in text)) A2 (now RL
1
(B) will
succeed) R1(B) C1 (start T2 over, retry, it gets Tx number 3) RL
3
(B)
R
3
(B) WL
3
(B) W
3
(B) RL
3
(A) R
3
(A) WL
3
(A) W3(A) C3.
Locking serialized T1, then T2 (retried as T3).
Thm. 10.4.2. Locking Theorem. A history of transactional operations that
follows the 2PL discipline is SR.
First, Lemma 10.4.3. If H is a Locking Extended History that is 2PL and
the edge Ti -> Tj is in PG(H), then there must exist a data item D and two
conflicting operations Xi(D) and Yj(D) such that XUi(D) <<H YLj(D).
Proof. Since Ti -> Tj in PG(H), there must be two conflicting ops Xi(D) and
Yj(D) such that Xi(D) <<H Yj(D).
By the definition of 2PL, there must be locking and unlocking ops on either
side of both ops, e.g.: XLi(D) <<H Xi(D) <<H XUi(D).
Now between the lock and unlock for Xi(D), the X lock is held by Ti and
similarly for Yj(D) and Tj. Since X and Y conflict, the locks conflict and the
intervals cannot overlap. Thus, since Xi(D) <<H Yj(D), we must have:
XLi(D) <<H Xi(D) <<H XUi(D) <<H YLj(D) <<H Yj(D) <<H YUj(D)
And in particular XUi(D) <<H YLj(D).
Proof of Thm. 10.4.2. We want to show that every 2PL history H is SR.
Assume in contradiction that there is a cycle T1 -> T2 -> . . . -> Tn -> T1 in
PG(H). By the Lemma, for each pair Tk -> T(k+1), there is a data item Dk
where XUk(Dk) <<H YL(k+1)(Dk). We write this out as follows:
1. XU1(D1) <<H YL2(D1)
2. XU2(D2) <<H YL3(D2)
. . .
n-1. XU(n-1)(D(n-1)) <<H YLn(D(n-1))
n. XUn(Dn) <<H YL1(Dn) (Note, T1 is T(n+1) too.)
-246-
But now have (in 1.) an unlock of a data item by T1 before (in n.) a lock of a
data item. So not 2PL after all. Contradiction.
Therefore H is 2PL implies no circuit in the PG(H), and thus H is SR.
Note that not all SR schedules would obey 2PL. E.g., the following is SR:
H7 = R1(A) W1(A) R2(A) R1(B) W1(B) R2(B) C1 C2
But it is not 2PL (T2 breaks through locks held by T1). We can optimistically
allow a Tx to break through locks in the hopes that a circuit won't
occur in PG(H). But most databases don't do that.
-247-
Class 23.
10.5 Levels of Isolation
The idea of Isolation Levels, defined in ANSI SQL-92, is that people might
want to gain more concurrency, even at the expense of imperfect isolation.
A paper by Tay showed that when there is serious loss of throughput due to
locking, it is generally not because of deadlock aborts (having to retry) but
simply because of transactions being blocked and having to wait.
Recall that the reason for interleaving transaction operations, rather than
just insisting on serial schedules, was so we could keep the CPU busy.
We want there to always be a new transaction to run when the running
transaction did an I/O wait.
But if we assume that a lot of transactions are waiting for locks, we lose
this. There might be only one transaction running even if we have 20 trying
to run. All but one of the transactions are in a wait chain!
So the idea is to be less strict about locking and let more transactions run.
The problem is that dropping proper 2PL might cause SERIOUS errors in
applications. But people STILL do it.
The idea behind ANSI SQL-99 Isolation Levels is to weaken how locks are held.
Locks aren't always taken, and even when they are, many locks are released
before EOT.
And more locks are taken after some locks are released in these schemes.
Not Two-Phase, so not perfect Isolation.
(Note in passing, that ANSI SQL-92 was originally intended to define isolation
levels that did not require locking, but it has been shown that the definitions
failed to do this. Thus the locking interpretation is right.)
Define short-term locks to mean a lock is taken prior to the operation (R or
W) and released IMMEDIATELY AFTERWARD. This is the only alternative to
long-term locks, which are held until EOT.
Then ANSI SQL-92 Isolation levels are defined as follows (Fig. 10.9 -- some
difference from the text):
-248-
Write locks on
rows of a table
are long term
Read Locks on
rows of a table
are long term
Read locks on
predicates are
long term
Read Uncommitted
(Dirty Reads)
NA
(Read Only)
No Read Locks
taken at all
No Read Locks
taken at all
Read Committed Yes No No
Repeatable Read Yes Yes No
Serializable Yes Yes Yes
Note that Write Predicate Locks are taken and held long-term in
all isolation levels listed. What this means is explained later.
In Read Uncommitted (RU), no Read locks are taken, thus can read data on
which Write lock exists (nothing to stop you if don't have to WAIT for RL).
Thus can read uncommitted data; it will be wrong if Tx that changed it later
aborts. But RU is just to get a STATISTICAL idea of sales during the day
(say). CEO wants to know ballpark figure -- OK if not exact.
In Read Committed (RC), we take Write locks and hold them to EOT, and Read
Locks on rows read and predicates and release immediately. (Cover
predicates below.)
Problem that can arise is serious one, Lost Update (Example 10.3.2):
... R
1
(A,100) R
2
(A,100) W
1
(A,140) W
2
(A,150) C
1
C
2
...
Since R locks are released immediately, nothing stops the later Writes, and
the increment of 40 is overwritten by an increment of 50, instead of the two
increments adding to give 90.
Call this the Scholar's Lost Update Anomaly (since many people say Lost
Update only happens at Read Uncommitted).
This is EXTREMELY serious, obviously, and an example of lost update in SQL is
given in Figure 10.12 (pg. 666) for a slightly more restrictive level: Cursor
Stability. Applications that use RC must avoid this kind of update.
In Figure 10.11, we see how to avoid this by doing the Update indivisibly in a
single operation.
-249-
Not all Updates can be done this way, however, because of complex cases
where the rows to bue updated cannot be determined by a Boolean search
condition, or where the amount to update is not a simple function.
It turns out the IBM's Cursor Stability guarantees a special lock will be held
on current row under cursor, and at first it was thought that ANSI Read
Committed guaranteed that, but it does not.
Probably most products implement a lock on current of cursor row, but
there is no guarantee. NEED TO TEST if going to depend on this.
In Repeatable Read Isolation, this is the isolation level that most people think
is all that is meand by 2PL. All data items read and written have RLs and
WLs taken and held long-term, until EOT.
So what's wrong? What can happen?
Example 10.5.3, pg. 666, Phantom Update Anomaly.
R
1
(predicate: branch_id = 'SFBay') R
1
(A1,100.00) R
1
(A2, 100.00)
R
1
(A3,100.00) I
2
(A4, branch_id = 'SFBay', balance =100.00)
R
2
(branch_totals, branch_id = SFBay, 300.00)
W
2
(branch_totals, branch_id = SFBay ,400.00) C
2
R
1
(branch_totals, branch_id = SFBay, 400) (Prints out error message) C
1
T1 is reading all the accounts with branch_id = SFBay and testing that the
sum of balances equals the branch_total for that branch (accounts and
branch_totals are in different tables)
After T1 has gone through the rows of accounts, T2 inserts another row
into accounts with branch_id = SFBay (T1 will not see this as it's already
scanned past the point where it is inserted), T2 then updates the
branch_total for SFBay, and commits.
Now T1, having missed the new account, looks at the branch_total and sees
an error.
There is no error really, just a new account row that T1 didn't see.
Note that nobody is breaking any rules about data item locks. The insert by
T2 holds a write lock on a new account row that T1 never read. T2 locks the
branch_total row, but then commits before T1 tries to read it.
-250-
No data item lock COULD help with this problem. But we have non-SR be-
havior nonetheless.
The solution is this: When T1 reads the predicate branch_id = SFBay on
accounts, it takes a Read lock ON THAT PREDICATE, that is to say a Read
lock on the SET of rows to be returned from that Select statement.
Now when T2 tries to Insert a new row in accounts that will change the set of
rows to be returned for SFBay, it must take a Write lock on that predicate.
Clearly this Write lock and Read lock will conflict. Therefore T2 will have to
wait until T1 reaches EOT and releases all locks.
So the history of Example 10.5.3 can't happen. (In reality, use a type of
locking called Key-Range locking to guarantee predicate locks. Cover in
Database Implementation course.)
ANSI Repeatable Read Isolation doesn't provide Predicate Locks, but ANSI
Serializable does.
Note that Oracle doesn't ever perform predicate locks. Oracle's
SERIALIZABLE isolation level uses a different approach, based on snapshot
reads, that is beyond what we can explain in this course.
10.6 Transactional Recovery.
The idea of transactional recovery is this.
Memory is "Volatile", meaning that at unscheduled times we will lose memory
contents (or become usure of the validity).
But a database transaction, in order to work on data from disk, must read it
into memory buffers.
Remember that a transaction is "atomic", meaning that all update operationa
a transaction performs must either ALL succeed or ALL fail.
If we read two pages into memory during a transaction and update them
both, we might (because of buffering) have one of the pages go back out to
disk before we commit.
-251-
What are we to do about this after a crash? An update has occurred to disk
where the transaction did not commit. How do we put the old page back in
place?
How do we even know what happened? That we didn't commit?
A similar problem arises if we have two pages in memory and after commit
we manage to write one of the pages back to disk, but not the other.
(In fact, we always attempt to minimize disk writes from popular buffers,
just as we minimize disk reads.)
How do we fix it so that the page that didn't get written out to disk gets out
during recovery?
The answer is that as a transaction progresses we write notes to ourselves
about what changes have been made to disk pages. We ensure that these
notes get out to disk to allow us to correct any errors after a crash.
These notes are called "logs", or "log entries". The log entries contain
"Before Images" and "After Images" of every update made by a Transaction.
In recovery, we can back up an update that shouldn't have gotten to disk (the
transaction didn't commit) by applying a Before Image.
Similarly, we can apply After Images to correct for disk pages that should
have gotten to disk (the transaction did commit) but never made it.
There is a "log buffer" in memory (quite long), and we write the log buffer
out to the "log on disk" every time one of following events occur.
(1) The log buffer fills up. We write it to disk and meanwhile continue filling
another log buffer with logs. This is known as "double buffering" and saves us
from having to wait until the disk write completes.
(2) Some transaction commits. We write the log buffer, including all logs up
to the present time, before we return from commit to the application (and
the application hands out the money at the ATM). This way we're sure we
won't forget what happened.
Everything else in the next few sections is details: what do the logs look like,
how does recovery take place, how can we speed up recovery, etc.
-252-
10.7 Recovery in Detail: Log Formats.
Consider the following history H5 of operations as seen by the scheduler:
(10.7.1) H5 = R
1
(A,50) W
1
(A,20) R
2
(C,100) W
2
(C,50) C2 R
1
(B,50) W
1
(B,80)
C
1
Because of buffering, some of the updates shown here might not get out to
disk as of the second commit, C1. Assume the system crashes immediately
after. How do we recover all these lost updates?
While the transaction was occurring, we wrote out the following logs as each
operation occurred (Figure 10.13, pg. 673).
OPERA- LOG ENTRY *** LEAVE UP ON BOARD ***
TION
R
1
(A,50) (S, 1) Start transaction T
1
- no log entry is written
for a Read operation, but this operation is the start of T
1
W
1
(A,20) (W, 1, A, 50, 20) T
1
Write log for update of A.balance.
The value 50 is the Before Image (BI) for A.balance
column in row A, 20 is the After Image (AI) for A.balance
R
2
(C,100) (S, 2), another start transaction log entry.
W
2
(C,50) (W, 2, C, 100, 50), another Write log entry.
C2 (C, 2) Commit T
2
log entry. (Write Log Buffer to Log
File.)
R
1
(B,50) No log entry.
W
1
(B,80) (W, 1, B, 50, 80)
C
1
(C, 1) Commit T
1
(Write Log Buffer to Log File. )
Assume that a System Crash occurred immediately after the W
1
(B,80) op-
eration.
-253-
This means that the log entry (W, 1, B, 50, 80) has been placed in the log
buffer, but the last point at which the log buffer was written out to disk was
with the log entry (C, 2)
This is the final log entry we will find when we begin to recover from the
crash. Assume that the values out on disk are A = 20 (the update to 20
drifted out to disk), B = 50 (update didn't get to disk), and C = 100 (same).
If you look carefully at the sequence, where T2 committed and T1 didn't, you
will see that the values should be: A = 50, B = 50, C = 50.
After the crash, a commend is given by the system operator that initiates
recovery. This is usually called the RESTART command.
The process of recovery takes place in two phases, Roll Back and Roll
Forward.. The Roll Back phase backs out updates by uncommitted transac-
tions and Roll Forward reapplies updates of committed transactions.
In Roll Back, the entries in the disk log are read backward to the beginning,
System Startup, when A = 50, B = 50, and C = 100.
In Roll Back, the system makes a list of all transactions that did and did not
commit. This is used to decide what gets backed out and reapplied.
LOG ENTRY ROLL BACK/ROLL FORWARD ACTION PERFORMED
1.(C, 2) Put T
2
into "Committed List"
2.(W, 2, C,100,50) Since T
2
is on "Committed List", we do nothing.
3.(S, 2) Make a note that T
2
is no longer "Active"
4.(W, 1, A, 50, 20) Transaction T
1
has never committed (it's last
operation was a Write). Therefore, the system
performs UNDO of this update by Writing the
Before Image value (50) into data item B.
Put T
1
into "Uncommitted List"
5.(S, 1) Make a note that T
1
is no longer "Active". Now
that no transactions were active, we can end the
ROLL BACK phase.
-254-
ROLL FORWARD
6. (S, 1) No action required
7.(W, 1, A, 50, 20) T
1
is Uncommitted No action required
8.(S, 2) No action required
9.(W, 2, C,100,50) Since T
2
is on Committed List, we REDO this
update by writing After Image value (50) into
data item C
10 (C, 2) No action required
1 1 We note that we have rolled forward through all
log entries and terminate Recovery.
Note that at this point, A = 50, B = 50, and C = 50.
Guarantees That Needed Log Entries are on Disk
How could a problem occur with our method of writing logs and recovering?
Look at the history earlier again and think what would happen if we ended up
with B = 80 because the final written value of B got out to disk.
Since we have been assuming that the log (W, 1, B, 50, 80) did NOT get out
to disk, we wouldn't be able to UNDO this update (which should not occur,
since T1 did not commit).
This is a problem that we solve with a policy that ties the database buffer
writes to the Log. The policy is called Write-Ahead Log (WAL).
It guarantees that no buffer dirty page gets written back to disk before the
Log that would be able to UNDO it gets written to the disk Log file.
OK, this would solve the problem of UNDOs. Do we ever have a problem with
REDOs? No, because we always write the Log buffer to the Log file as part
of the Commit. So we're safe in doing REDO for committed Txs.
The text has a "proof" that recovery will work, given these log formats, the
RESTART procedure of Roll Back/Roll Forward, and the WAL policy.
-255-
-256-
Class 24.
10.8 Checkpoints
In the recovery process we just covered, we performed ROLLBACK to
System Startup Time, when we assume that all data is valid.
We assume that System Startup occurs in the morning, and database pro-
cessing continues during the day and everything completes at the end of day.
(This is a questionable assumption nowadays, with many companies needing
to perform 24X7 processing, 24 hours a day, 7 days a week, with no time
when transactions are guaranteed to not be active.)
Even if we have an 8-hour day of processing, however, we can run into real
problems recovering a busy transactional system (heavy update throughput)
with the approach we've outlined so far.
The problem is that it takes nearly as much processing time to RECOVER a
transaction as it did to run it in the first place.
If our system is strained to the limit keeping up with updates from 9:00 AM
to 5:00 PM, and the system crashes at 4:59, it will take nearly EIGHT HOURS
TO RECOVER.
This is the reason for checkpointing. When a "Checkpoint" is taken at a given
time (4:30 PM) this makes it possible for Recovery to limit the logs it needs
to ROLLBACK and ROLLFORWARD.
A simple type of Checkpoint, a "Commit Consistent Checkpoint," merely
duplicates the process of shutting down for the night, but then transactions
start right up again.
The problem is that it might take minutes to take a Commit Consistent
Checkpoint, and during that time NO NEW TRANSACTIONS CAN START UP.
For this reason, database systems programmers have devised two other
major checkpointing schemes that reduce the "hiccup" in transaction pro-
cessing that occurs while a checkpoint is being performed.
-257-
The Commit Consistent Checkpoint is improved on by using something called
a "Cache Consistent Checkpoint". Then an even more complicated checkpoint
called a "Fuzzy Checkpoint" improves the situation further.
So this is what we will cover now, in order (put on board):
Commit Consistent Checkpoint
Cache Consistent Checkpoint
Fuzzy Checkpoint
We define the Checkpoint Process. From time to time, a Checkpoint is
triggered, probably by a time since last checkpoint system clock event.
Def 10.8.1. Commit Consistent Checkpoint steps. After the "per-
forming checkpoint state" is entered, we have the following rules.
(1) No new transactions can start until the checkpoint is complete.
(2) Database operation processing continues until all existing transacti ons
Commit, and all their log entries are written to disk. (Thus we are Commit
Consistent ).
(3) Then the current log buffer is written out to the log file, and af t er
this the system ensures that all dirty pages in buffers have been wri t t en
out to disk.
(4) When steps (1)-(3) have been performed, the system writes a special log
entry, (CKPT), to disk, and the Checkpoint is complete.
It should be clear that these steps are basically the same ones that would be
performed to BRING THE SYSTEM DOWN for the evening.
We allow transactions in progress to finish, but don't allow new ones, and
everything in volatile memory that reflects a disk state is put out to disk.
As a matter of fact, the Disk Log File can now be emptied. We needed it
while we were performing the checkpoint in case we crashed in the middle,
but now we don't need it any longer.
We will never need the Log File again to UNDO uncommitted transactions that
have data on disk (there are no such uncommitted transactions) or REDO
-258-
committed transactions that are missing updates on disk (all updates have
gone out to disk already).
From this, it should be clear that we can modify the Recovery approach we
have been talking about so that instead of a ROLLBACK to the Beginning of
the Log File at System Startup, we ROLLBACK to the LAST CHECKPOINT!!!
If we take a Checkpoint every five minutes, we will never have to recover
more than five minutes of logged updates, so recovery will be fast.
The problem is that the Checkpoint Process itself might not be very fast.
Note that we have to allow all transactions in progress to complete before
we can perform successive steps. If all applications we have use very short
transactions, there should be no problem.
Cache Consistent Checkpoint
But what if some transactions take more than five minutes to execute?
Then clearly we can't guarantee a Checkpoint every five minutes!!!
Worse, while the checkpoint is going on (and the last few transactions are
winding up) nobody else can start any SHORT transactions to read an ac-
count balance or make a deposit!
We address this problem with something called the "Cache Consistent
Checkpoint". With this scheme, transactions can continue active through the
checkpoint. We don't have to wait for them all to finish and commit.
Definition 10.8.2. Cache Consistent Checkpoint procedure steps.
(1) No new transactions are permitted to start.
(2) Existing transactions are not permitted to start any new operations.
(3) The current log buffer is written out to disk, and after this t he
system ensures that all dirty pages in cache buffers have been wri t t en
out to disk. (Thus, we are "Cache" (i.e., Buffer) Consistent on disk.)
-259-
(4) Finally, a special log entry, (CKPT, List) is written out to disk, and t he
Checkpoint is complete. NOTE: this (CKPT) log entry contains a list o f
active transactions at the time the Checkpoint occurs.
The recovery procedure using Cache Consistent Checkpoints differs f r om
Commit Consistent Checkpoint recovery in a number of ways.
Ex 10.8.1 Cache Consistent Checkpoint Recovery.
Consider the history H5:
H5: R
1
(A, 10) W
1
(A, 1) C
1
R
2
(A, 1) R
3
(B, 2) W
2
(A, 3) R
4
(C, 5) CKPT
W
3
(B, 4) C
3
R
4
(B, 4) W
4
(C, 6) C
4
CRASH
Here is the series of log entry events resulting from this history. The
last one that gets out to disk is the (C, 3) log entry.
(S, 1) (W, 1, A, 10, 1) (C, 1) (S 2) (S, 3) (W, 2, A, 1, 3) (S, 4)
(CKPT, (LIST = T
2
, T
4
)) (W, 3, B, 2, 4) (C, 3) (W, 4, C, 5, 6) (C, 4)
At the time we take the Cache Consistent Checkpoint, we will have values
out on disk: A = 3, B = 2 , C = 5. (The dirty page in cache containing A a t
checkpoint time is written to disk.)
Assume that no other updates make it out to disk before the crash, and
so the data item values remain the same.
Here is a diagram of the time scale of the various events. Transaction T
k
begins with the (S, k) log, and ends with (C, k). **LEAVE ON BOARD**
Checkpoint Crash
T1 |-------|
T
2
|--------------------------------------------
T
3
|------------------------------|
T
4
|---------|
Next, we outline the actions taken in recovery, starting with ROLL BACK.
-260-
ROLL BACK
1.(C, 3) Note T
3
is a committed Tx in active list.
2.(W, 3, B, 2, 4) Committed transaction, wait for ROLL FORWARD.
3.(CKPT, (LIST = T
2
, T
4
)) Note active transactions T
2
and T
4
;
THESE HAVE NOT COMMITTED (no (C, 2) or
(C, 4) logs have been encountered)
4.(S, 4) List of Active transactions now shorter: {T
2
, T
3
}
5.(W, 2, A, 1, 3) Not Committed. UNDO: A = 1
6.(S, 3) List of Active Transactions shorter.
7.(S, 2) List of Active Transactions empty. STOP ROLLBACK.
With a Cache Consistent Checkpoint, when ROLL BACK encounters t he
CKPT log entry the system takes note of the transactions that were
active, even though we have never seen any operations in the log file.
We now take our list of active transactions, remove those that we have
seen committed, and have a list of transactions whose updates we need
to UNDO. Since Transactions can live through Checkpoints we have to go
back PRIOR to the Checkpoint, while UNDO steps might remain.
We continue in the ROLL BACK phase until we complete all such UNDO
actions. We can be sure when this happens because as we encounter (S,
k) logs, rolling backward.
When all Active Uncommitted T
k
have been removed, the ROLL BACK i s
complete, even though there may be more entries occurring earlier in t he
log file.
ROLL FORWARD
8.(CKPT, (LIST = T
2
, T
3
)) Skip forward in log file to this entry, start
after this.
9.(W, 4, C, 5, 6) Roll Forward: C = 6.
18. (C, 4) No Action. Last entry: ROLL FORWARD i s
complete.
In starting the Roll Forward Phase, we merely need to REDO all updates by
committed transactions that might not have gone out to disk.
-261-
We can jump forward to the first operation after the Checkpoint, since we
know that all earlier updates were flushed from buffers.
Roll Forward continues to the end of the Buffer File. Recall that the values
on disk at the time of the crash were: A = 3, B = 2 , C = 5. At the end of
Recovery, we have set A = 1 (Step 5), and C = 6 (Step 9).
We still have B = 4. A glance at the time scale figure shows us that we want
updates performed by T
4
to be applied, and those by T
2
and T
3
to be backed
out. There were no writes performed by T
4
that got out to disk, so we have
achieved what is necessary for recovery: A = 1, B = 4, C = 6.
Fuzzy Checkpoint.
A problem can still arise that makes the Cache Consistent Checkpoint a
major hiccup in Transaction Processing.
Note in the procedure that we can't let any Active Transactions continue, or
start any new ones, until all buffers are written to disk. What if there are a
LOT of Buffers?
Some new machines have several GBytes of memory! That's probably
Minutes of I/O, even if we have a lot of disks. DISK I/O is SLOW!!!
OK, with a Fuzzy Checkpoint, each checkpoint, when it completes, makes the
PREVIOUS checkpoint a valid place to stop ROLLBACK.
Definition 10.8.3. Fuzzy Checkpoint procedure steps.
( 1) Prior to Checkpoint start, the remaining pages that were dirty at t he
prior checkpoint will be forced out to disk (but the rate of writes should
leave I/O capacity to support current transactions in progress; there i s
no critical hurry in doing this).
( 2) No new transactions are permitted to start. Existing transacti ons
are not permitted to start any new operations.
(3) The current log buffer is written out to disk with an appended log
entry, (CKPT
N
, List), as in the Cache Consistent Checkpoint procedure.
( 4) The set of pages in buffer that have become dirty since the l ast
checkpoint log, CKPT
N-1
, is noted.
-262-
This will probably be accomplished by special flags on the Buf f er
directory. There is no need for this information to be made disk resident,
since it will be used only to perform the next checkpoint, not in case o f
recovery. At this point the Checkpoint is complete.
As explained above, the recovery procedure with Fuzzy Checkpoints differs
from the procedure with Commit Consistent Checkpoints only that ROLL
FORWARD must start with the first log entry following the SECOND to last
checkpoint log. We have homeowork on this.
-263-
Class 25.
Covered last time the various Checkpoints: Commit Consistent, Cache
Consistent, and Fuzzy. Any questions?
What really happens with commercial databases. Used to be all Commit
Consistent, now often Fuzzy.
Also used to be VERY physical. BI and AI meant physical copies of the entire
PAGE. Still need to do this sometimes, but for long-term log can be more
"logical".
Instead of "Here is the way the page looked after this update/before this
update", have: this update was ADD 10 to Column A of row with RID 12345
with version number 1121.
Version number is important to keep updates idempotent.
Note that recovery is intertwined with the type of recovery. It doesn't do
any good to have row-level locking if have page level recovery.
T1 changes Row 12345 column A from 123 to 124, and log gives PAGE BI
with A = 123, T1 hasn't committed yet.
T2 changes Row 12347 (same page) column B from 321 to 333 and
Commits, log gives Row 12345 with 124, Row 12347 column B with 333 on
page AI.
Transaction T2 commits, T1 doesn't, then have crash. What do we do?
Put AI of T2 in place? Gives wrong value to A. Put BI of T1 in place? Wrong
value of B.
Page level logging implies page level locking is all we can do.
Sybase SQL Server STILL doesn't have row-level locking.
10.9 Media Recovery
Problem is that Disk can fail. (Not just stop running: head can score disk
surface.) How do we recover?
-264-
First, we write our Log to TWO disk backups. Try to make sure two disks
have Independent Failure Modes (not same controller, same power supply).
We say that storage that has two duplicates is called stable storage, as
compared to nonvolatile storage for a normal disk copy.
Before System Startup run BACKUP (bulk copy of disks/databases).
Then, when perform Recovery from Media failure, put backup disk in place
and run ROLL FORWARD from that disk, as if from startup in the morning.
As if normal recovery on this disk except that all pages on this disk were
VERY popular and never got out from disk. Don't need Roll Back except
conceptually.
RAID Disks
Something they have nowadays is RAID disks. RAID stands for Redundant
Arrays of Inexpensive Disks. Invented at Berkeley. The Inexpensive part
rather got lost when this idea went commercial.
The simplest kind, RAID 1, mirrors Writes. Could have two disks that write
everything to in two copies. Every time we Write, need 2 Writes.
So if one disk lost, just use other disk. Put another blank disk in for mirror,
and while operating with normal Reads and Writes do BACKUP to new disk
until have mirror.
This approach saves the time needed to do media recovery. And of course
works for OS files, where there IS no media recovery.
You can buy these now. As complex systems get more and more disks, will
eventually need RAID. The more units there are on a system, the more
frequent the failures.
Note that mirrored Writes are handled by controller. Doesn't waste time of
System to do 2 Writes.
But when Read, can Read EITHER COPY. Use disk arms independently.
So we take twice as much media, and if all we do is Writes, need twice as
many disk arms to do the same work.
-265-
But if all we do is Reads, get independent disk arm movements, so get twice
as many Reads too.
But in order for twice as many Reads to work, need warm data, where disk
capacilty is not the bottleneck, but disk arm movement is.
Definitely lose the capacity in RAID 1. But if we were only going to use half
the capacity of the disk because we have too Many Reads, RAID 1 is fine.
There is an alternative form of RAID, RAID 5, that uses less capacity than
mirroring. Trick is to have 6 disks, 5 real copies of page, one checksum.
Use XOR for Checksum. CK = D1 XOR D2 XOR D3 XOR D4 XOR D5.
If (say) D1 disappears, can figure out what it was:
D1 = CK XOR D2 XOR D3 XOR D4 XOR D5
(Prove this: A = B XOR C, then B = A XOR C and C = A XOR B.
1 = 1 XOR 0 => 1 = 1 XOR 0 and 0 = 1 XOR 1
1 = 0 XOR 1 => 0 = 1 XOR 1 and 1 = 1 XOR 0
0 = 1 XOR 1 => etc.
0 = 0 XOR 0)
So if one disk drops out, keep accessing data on it using XOR of other 5.
Recover all disk pages on disk in same way. This takes a lot of time to
recover, but it DOES save disk media.
10.10 TPC-A Benchmark
The TPC-A Benchmark is now out of date. Newer TPC-C Benchmark: more
complex.
See Fig 10.16, pg. 686. Size of tables determined by TPS.
See Fig 10.17. pg. 687. All threads do the same thing. Run into each other
in concurrency control because of Branch table and History table.
Benchmark specifies how many threads there are, how often each thread
runs a Tx, costs of terminals, etc.
-266-
On a good system, just add disks until use 95% of CPU. On a bad system, run
into bottlenecks.
Ultimate measure is TPS and $/TPS