Why the most natural query(i.e. using INNER JOIN (instead of LEFT JOIN)) is very slow?

(as directed, I'm putting part of my comment in an answer as it solved the problem) Convert the EXISTS expressions into IN expressions This works better in this instance because the query will now be effectively evaluated from the "inside out" starting with the query that contains your most limiting factor: the full text search lookup. That query is going to return a small set of rows that can be looked up directly against the primary key of the outer query (WHERE x in (SELECT X...)) as opposed to calling the "inner" query once per value of the outer query (or for all values in your original case, if I am reading it correctly). The EXISTS method here results in Nested Loops (one evaluation of one query for each value in another) vs the IN method using Hash Joins (a much more efficient execution method in many, if not most, cases Notice that with the EXISTS method, there are four Nested Loops that execute with each running at least 3,000 times.

That cost adds up. While it's not a direct comparison, you can treat Nested Loops like you would FOR loops in application code: each time you invoke an inner loop, your big-O estimate goes up an order of magnitude: O(n) to O(n^2) to O(n^3), etc Hash Join is more like a map, where two arrays are stepped through at the same time and an operation is performed on both. This is roughly linear (O(n)).

Think of these being nested as additive so it would go O(n) to O(2n) to O(3n), etc Yeah, yeah, I know it's not quite the same thing, but the point is that having multiple nested loops usually indicates a slow query plan and comparing the two big-O style makes it easier to recognize, I believe Nested Loops and EXISTS are not evil, per se, but for most cases where there is a base filter condition that ultimately effects everything (for example, the full text search in the question), an IN expression (or, in some cases, a proper JOIN) yields a much more efficient plan.

(as directed, I'm putting part of my comment in an answer as it solved the problem) Convert the EXISTS expressions into IN expressions. This works better in this instance because the query will now be effectively evaluated from the "inside out" starting with the query that contains your most limiting factor: the full text search lookup. That query is going to return a small set of rows that can be looked up directly against the primary key of the outer query (WHERE x in (SELECT X...)) as opposed to calling the "inner" query once per value of the outer query (or for all values in your original case, if I am reading it correctly).

The EXISTS method here results in Nested Loops (one evaluation of one query for each value in another) vs the IN method using Hash Joins (a much more efficient execution method in many, if not most, cases. Notice that with the EXISTS method, there are four Nested Loops that execute with each running at least 3,000 times. That cost adds up.

While it's not a direct comparison, you can treat Nested Loops like you would FOR loops in application code: each time you invoke an inner loop, your big-O estimate goes up an order of magnitude: O(n) to O(n^2) to O(n^3), etc. Hash Join is more like a map, where two arrays are stepped through at the same time and an operation is performed on both. This is roughly linear (O(n)). Think of these being nested as additive so it would go O(n) to O(2n) to O(3n), etc. Yeah, yeah, I know it's not quite the same thing, but the point is that having multiple nested loops usually indicates a slow query plan and comparing the two big-O style makes it easier to recognize, I believe.

Nested Loops and EXISTS are not evil, per se, but for most cases where there is a base filter condition that ultimately effects everything (for example, the full text search in the question), an IN expression (or, in some cases, a proper JOIN) yields a much more efficient plan.

Is your query essentially the following (this is hard to ask as a comment): select c. Company_rec_id, c. The_company_code, c.

Company from company c where exists ( select * from parameter p join mlist_detail_parameter mdp on mdp. Parameter_rec_id = p. Parameter_rec_id join mlist_detail md on md.

Mlist_detail_rec_id = mdp. Mlist_detail_rec_id join mlist m on m. Mlist_rec_id = md.

Mlist_rec_id join parcel_application ord_app on ord_app. Parcel_application_rec_id = m. Parcel_application_rec_id join parcel ord on ord.

Parcel_rec_id = ord_app. Parcel_rec_id join tlist t on t. Mlist_rec_id = m.

Mlist_rec_id where ord. Client_rec_id = c. Company_rec_id and to_tsvector(extract_words(p.

Parameter)) @@ plainto_tsquery(extract_words('cadmium')) ) EDIT: 2010-07-06, added by Michael Buen "Hash Join (cost=21510-07-067..21710-07-06 rows=232 width=71) (actual time=71.106..71.207 rows=84 loops=1)" " Hash Cond: ((c. Company_rec_id)::text = (ord. Client_rec_id)::text)" " -> Seq Scan on company c (cost=0.00..10-07-06 rows=295 width=71) (actual time=0.004..0.030 rows=295 loops=1)" " -> Hash (cost=2150.04..2150.04 rows=232 width=37) (actual time=71.077..71.077 rows=84 loops=1)" " -> HashAggregate (cost=21410-07-067..2150.04 rows=232 width=37) (actual time=71.033..71.040 rows=84 loops=1)" " -> Nested Loop (cost=17810-07-068..21410-07-06 rows=652 width=37) (actual time=51.029..70.187 rows=1918 loops=1)" " -> Hash Join (cost=17810-07-068..19310-07-06 rows=652 width=111) (actual time=51.014..55.913 rows=1918 loops=1)" " Hash Cond: ((ord_app.

Parcel_rec_id)::text = (ord. Parcel_rec_id)::text)" " -> Hash Join (cost=16610-07-06..1810-07-067 rows=652 width=111) (actual time=48.360..52.004 rows=1918 loops=1)" " Hash Cond: ((ord_app. Parcel_application_rec_id)::text = (m.

Parcel_application_rec_id)::text)" " -> Seq Scan on parcel_application ord_app (cost=0.00..1210-07-068 rows=3218 width=74) (actual time=0.003..1.485 rows=3218 loops=1)" " -> Hash (cost=16510-07-069..16510-07-069 rows=652 width=111) (actual time=48.331..48.331 rows=1918 loops=1)" " -> Hash Join (cost=1610-07-069..16510-07-069 rows=652 width=111) (actual time=4.755..46.122 rows=1918 loops=1)" " Hash Cond: ((md. Mlist_rec_id)::text = (m. Mlist_rec_id)::text)" " -> Nested Loop (cost=10-07-067..14810-07-069 rows=652 width=37) (actual time=1.638..40.974 rows=1918 loops=1)" " -> Hash Join (cost=10-07-067..11610-07-06 rows=652 width=37) (actual time=1.590..18.090 rows=1918 loops=1)" " Hash Cond: ((mdp.

Parameter_rec_id)::text = (p. Parameter_rec_id)::text)" " -> Seq Scan on mlist_detail_parameter mdp (cost=0.00..10110-07-06 rows=37187 width=74) (actual time=0.003..5.499 rows=37187 loops=1)" " -> Hash (cost=10-07-06..10-07-06 rows=1 width=37) (actual time=1.568..1.568 rows=1 loops=1)" " -> Seq Scan on parameter p (cost=0.00..10-07-06 rows=1 width=37) (actual time=1.324..1.564 rows=1 loops=1)" " Filter: (to_tsvector(regexp_replace((parameter)::text, '\\(\\)\\! \\.

\\/,\\-\\? +'::text, ' '::text, 'g'::text)) @@ plainto_tsquery('cadmium'::text))" " -> Index Scan using pk_mlist_detail on mlist_detail md (cost=0.00..0.48 rows=1 width=74) (actual time=0.011..0.011 rows=1 loops=1918)" " Index Cond: ((md. Mlist_detail_rec_id)::text = (mdp.

Mlist_detail_rec_id)::text)" " -> Hash (cost=1110-07-069..1110-07-069 rows=3631 width=74) (actual time=3.096..3.096 rows=3631 loops=1)" " -> Seq Scan on mlist m (cost=0.00..1110-07-069 rows=3631 width=74) (actual time=0.003..0.994 rows=3631 loops=1)" " -> Hash (cost=710-07-06..710-07-06 rows=3087 width=74) (actual time=2.640..2.640 rows=3087 loops=1)" " -> Seq Scan on parcel ord (cost=0.00..710-07-06 rows=3087 width=74) (actual time=0.004..0.876 rows=3087 loops=1)" " -> Index Scan using fki_tlist__mlist on tlist t (cost=0.00..0.31 rows=1 width=37) (actual time=0.006..0.006 rows=1 loops=1918)" " Index Cond: ((t. Mlist_rec_id)::text = (m. Mlist_rec_id)::text)" "Total runtime: 71.373 ms.

Hmm.. your answer is also fast (between 0.06 to 0.08 second), but not as fast as IN version (0.03 to 0.05 second). +1 nonetheless, your answer works. Your answer is semantically the same with my query here.

Eventhough, I cannot use your query on my actual code, I refactor my original query so it can be used on two(or more) modules, I need the ord_app on outside of subquery, I have some modules that do some sort of GROUP_CONCAT on ord_app. I wish I could +2 you for taking time to deduce the intent of my code, and writing an actual query for that :-) – Michael Buen Jul 6 '10 at 1:53 @Michael, I'd be interested in the explain analyze of this query - can you edit this answer to include it? – Stephen Denne Jul 6 '10 at 1:56.

(as directed, I'm putting part of my comment in an answer as it solved the problem).

Is your query essentially the following (this is hard to ask as a comment).

Most works fine, however one particular query runs very very slow, and it always gives the error Incorrect key file for table '/tmp/#sql_xxxx_x. Later I narrowed down the problem into the inner join of 2 tables, the user table and agreement table. And the inner join took place between the foreign key field of user (i.e.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Why the most natural query(i.e. using INNER JOIN (instead of LEFT JOIN)) is very slow?

Related Questions

Using left join and inner join in the same query?

LINQ AssociateWith for data context generating LEFT JOIN instead of INNER JOIN?

Insert using LEFT JOIN and INNER JOIN?

Very very very very upset! (long) very upset!?

Full text search with inner join very slow?

Mysql slow query: INNER JOIN + ORDER BY causes filesort?