Hi All
I have not received any update on my above query. However, I found few things by myself. I am sharing that. Hope this might be helpful to others.
1. I create the below table.
CREATE TABLE SCPLN.CLASSES, NO FALLBACK
(
CLASSUID INTEGER NOT NULL,
COURSEUID INTEGER NOT NULL,
CREATEDDATE TIMESTAMP(6),
ETL_ACTION VARCHAR(1)
)
UNIQUE PRIMARY INDEX(CLASSUID)
;
where each course id may have many class id. So course:class has 1:many relation. Then I did
COLLECT STATS ON SCPLN.CLASSES COLUMN(COURSEUID);
COLLECT STATS ON SCPLN.CLASSES COLUMN(ETL_ACTION);
2. Now the 2nd table is
CREATE TABLE SCPLN.COURSES, NO FALLBACK
(
COURSEUID INTEGER NOT NULL,
COURSENAME VARCHAR(20)
)
UNIQUE PRIMARY INDEX(COURSEUID)
;
This table has around 30000 rows with even distribution.
3. Then I ran the below SQL with join between 2 tables.
SELECT O.CLASSUID, O.CREATEDDATE, C.COURSEUID, C.COURSENAME
FROM SCPLN.COURSES C
INNER JOIN
SCPLN.CLASSES O
ON C.COURSEUID = O.COURSEUID
WHERE O.ETL_ACTION = 'I'
;
The inital count of CLASSES was 246077 out of which COURSEUID = 383712 had count 4529.
First time the explain of SQL said,
4) We do an all-AMPs RETRIEVE step from SCPLN.O by way of an
all-rows scan with a condition of ("SCPLN.O.ETL_ACTION = 'I'")
into Spool 2 (all_amps), which is redistributed by the hash code
of (SCPLN.O.COURSEUID) to all AMPs. The size of Spool 2 is
estimated with high confidence to be 246,077 rows (7,628,387
bytes). The estimated time for this step is 0.08 seconds.
5) We do an all-AMPs JOIN step from SCPLN.C by way of an all-rows
scan with no residual conditions, which is joined to Spool 2 (Last
Use) by way of an all-rows scan. SCPLN.C and Spool 2 are joined
using a single partition hash join, with a join condition of (
"SCPLN.C.COURSEUID = COURSEUID"). The result goes into Spool 1
(group_amps), which is built locally on the AMPs. The size of
Spool 1 is estimated with low confidence to be 246,077 rows (
15,502,851 bytes). The estimated time for this step is 0.03
seconds.
4. I inserted another 100000 rows (approx). Now the row count for CLASSES is 327292 out of which COURSEUID = 383712 had count 85744.
5. I ran the SQL once again. Below is the explain
4) We execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from SCPLN.O by way of an
all-rows scan with a condition of ("SCPLN.O.ETL_ACTION =
'I'") into Spool 2 (all_amps), which is built locally on the
AMPs. The size of Spool 2 is estimated with high confidence
to be 327,292 rows (10,146,052 bytes). The estimated time
for this step is 0.02 seconds.
2) We do an all-AMPs RETRIEVE step from SCPLN.C by way of an
all-rows scan with no residual conditions into Spool 3
(all_amps), which is duplicated on all AMPs. The size of
Spool 3 is estimated with high confidence to be 2,758,860
rows (68,971,500 bytes). The estimated time for this step is
0.06 seconds.
5) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an
all-rows scan, which is joined to Spool 3 (Last Use) by way of an
all-rows scan. Spool 2 and Spool 3 are joined using a single
partition hash join, with a join condition of ("COURSEUID =
COURSEUID"). The result goes into Spool 1 (group_amps), which is
built locally on the AMPs. The size of Spool 1 is estimated with
low confidence to be 327,292 rows (20,619,396 bytes). The
estimated time for this step is 0.05 seconds.
So, I think optimizer is smart enough to decide whether data distribution is getting skewed or not. Accordingly it will change the join plan to make sure join processing is similar on each AMP.
Please correct me if I am missing any point.
Thanks
Santanu
Hi All
I have not received any update on my above query. However, I found few things by myself. I am sharing that. Hope this might be helpful to others.
1. I create the below table.
CREATE TABLE SCPLN.CLASSES, NO FALLBACK
(
CLASSUID INTEGER NOT NULL,
COURSEUID INTEGER NOT NULL,
CREATEDDATE TIMESTAMP(6),
ETL_ACTION VARCHAR(1)
)
UNIQUE PRIMARY INDEX(CLASSUID)
;
where each course id may have many class id. So course:class has 1:many relation. Then I did
COLLECT STATS ON SCPLN.CLASSES COLUMN(COURSEUID);
COLLECT STATS ON SCPLN.CLASSES COLUMN(ETL_ACTION);
2. Now the 2nd table is
CREATE TABLE SCPLN.COURSES, NO FALLBACK
(
COURSEUID INTEGER NOT NULL,
COURSENAME VARCHAR(20)
)
UNIQUE PRIMARY INDEX(COURSEUID)
;
This table has around 30000 rows with even distribution.
3. Then I ran the below SQL with join between 2 tables.
SELECT O.CLASSUID, O.CREATEDDATE, C.COURSEUID, C.COURSENAME
FROM SCPLN.COURSES C
INNER JOIN
SCPLN.CLASSES O
ON C.COURSEUID = O.COURSEUID
WHERE O.ETL_ACTION = 'I'
;
The inital count of CLASSES was 246077 out of which COURSEUID = 383712 had count 4529.
First time the explain of SQL said,
4) We do an all-AMPs RETRIEVE step from SCPLN.O by way of an
all-rows scan with a condition of ("SCPLN.O.ETL_ACTION = 'I'")
into Spool 2 (all_amps), which is redistributed by the hash code
of (SCPLN.O.COURSEUID) to all AMPs. The size of Spool 2 is
estimated with high confidence to be 246,077 rows (7,628,387
bytes). The estimated time for this step is 0.08 seconds.
5) We do an all-AMPs JOIN step from SCPLN.C by way of an all-rows
scan with no residual conditions, which is joined to Spool 2 (Last
Use) by way of an all-rows scan. SCPLN.C and Spool 2 are joined
using a single partition hash join, with a join condition of (
"SCPLN.C.COURSEUID = COURSEUID"). The result goes into Spool 1
(group_amps), which is built locally on the AMPs. The size of
Spool 1 is estimated with low confidence to be 246,077 rows (
15,502,851 bytes). The estimated time for this step is 0.03
seconds.
4. I inserted another 100000 rows (approx). Now the row count for CLASSES is 327292 out of which COURSEUID = 383712 had count 85744.
5. I ran the SQL once again. Below is the explain
4) We execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from SCPLN.O by way of an
all-rows scan with a condition of ("SCPLN.O.ETL_ACTION =
'I'") into Spool 2 (all_amps), which is built locally on the
AMPs. The size of Spool 2 is estimated with high confidence
to be 327,292 rows (10,146,052 bytes). The estimated time
for this step is 0.02 seconds.
2) We do an all-AMPs RETRIEVE step from SCPLN.C by way of an
all-rows scan with no residual conditions into Spool 3
(all_amps), which is duplicated on all AMPs. The size of
Spool 3 is estimated with high confidence to be 2,758,860
rows (68,971,500 bytes). The estimated time for this step is
0.06 seconds.
5) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an
all-rows scan, which is joined to Spool 3 (Last Use) by way of an
all-rows scan. Spool 2 and Spool 3 are joined using a single
partition hash join, with a join condition of ("COURSEUID =
COURSEUID"). The result goes into Spool 1 (group_amps), which is
built locally on the AMPs. The size of Spool 1 is estimated with
low confidence to be 327,292 rows (20,619,396 bytes). The
estimated time for this step is 0.05 seconds.
So, I think optimizer is smart enough to decide whether data distribution is getting skewed or not. Accordingly it will change the join plan to make sure join processing is similar on each AMP.
Please correct me if I am missing any point.
Thanks
Santanu