开发者

Find rows without many-to-many children meeting a certain condition

开发者 https://www.devze.com 2023-01-25 16:03 出处:网络
Here\'s a generic version of what I\'m trying to do: The table recipes has fields id and name. The table ingredients has fields id, name, and sweetness, describing how sweet that ingredient is on a s

Here's a generic version of what I'm trying to do:

The table recipes has fields id and name. The table ingredients has fields id, name, and sweetness, describing how sweet that ingredient is on a scale of 1-10. Recipes have many ingredients and ingredients are in many recipes, so the two are related in a ingredients_recipes table, with fields ingredient_id and recipe_id.

It's easy to find recipes that contain an ingredient with sweetness of 10.

SELECT DISTINCT recipes.* FROM recipes
INNER JOIN recipes_ingredients ri ON ri.recipe_id = recipes.id
INNER JOIN ingredients ON ingredients.id = ri.ingredient_id
WHERE ingredients.sweetness = 10

However, I'm having trouble with negating that query to find recipes with no ingredients with sweetness 10. My first thought was this:

SELECT DISTINCT recipes.* FROM recipes
INNER JOIN recipes_ingredients ri ON ri.recipe_id = recipes.id
INNE开发者_如何学PythonR JOIN ingredients ON ingredients.id = ri.ingredient_id
WHERE ingredients.sweetness != 10

However, that finds recipes that contain any non-sweetness-10 ingredients.

My next attempt was the following, which seems to work:

SELECT * FROM recipes WHERE
(
  SELECT count(*) FROM ingredients INNER JOIN recipes_ingredients ri ON
  ri.ingredient_id = ingredients.id WHERE ingredients.sweetness = 10 AND
  ri.recipe_id = recipes.id
) = 0

However, my general experience is that dependent subqueries run slowly compared to equivalent, well-crafted JOINs. I played around with joining, grouping, etc. but couldn't quite wrap my head around it, especially since, though it seems like LEFT JOIN and IS NULL were the proper tools, having two joins already made things nasty. Great SQL wizards, what query can I run to get the best results? Thanks!


Try this:

SELECT DISTINCT recipes.* 
FROM recipes r LEFT JOIN
(SELECT ri.recipe_id
FROM recipes_ingredients ri 
INNER JOIN ingredients ON ingredients.id = ri.ingredient_id
WHERE ingredients.sweetness = 10) i on i.recipe_id=r.recipe_id
WHERE i.recipe_id is null


Try:

select
  r.*
from
  recipes r
where
  not exists (
    select
      1
    from
      recipe_ingredients ri
      join ingredients i on ri.ingredient_id = ri.ingredient_id
    where
      ri.recipie_id = r.recipe_id
      and i.sweetness = 10
  )

It's still a correlated subquery, but exists and not exists have some optimizations that should make them perform better than your original query.

For a direct join solution, this should work:

select distinct
  r.*
from
  recipes r
  join recipe_ingredients ri on ri.recipe_id = r.recipe_id
  left join ingredents i on i.ingredient_id = ri.ingredient_id and i.sweetness = 10
where
  i.ingredient_id is null

Depending on indexing, the not exists solution could be faster as not exists returns immediately upon figuring out if any rows satisfy the given conditions without looking at any more of the table than necessary. For example, if it finds a single row of sweetness 10, it stops looking at the table and returns false.


I played around with the answers given me here (which I've since upvoted), and, from their inspiration, have come up with a query that seems to do the job with surprisingly outstanding performance:

SELECT r.* FROM recipes r
LEFT JOIN recipes_ingredients ri ON ri.parent_id = r.id
LEFT JOIN ingredients i ON i.id = ri.ingredient_id AND i.sweetness = 10
GROUP BY r.id HAVING MAX(i.id) IS NULL

The joins with the condition inside (inspired by @Donnie) bring out recipe-ingredient combinations, with NULL rows if the ingredient is not of sweetness 10. We then group by recipe ID, and select the "max" ingredient ID. (The MAX function will return null if and only if there are no actual IDs to select, i.e., there are absolutely no non-sweetness-10 items associated with this recipe to choose instead.) If that "max" ingredient ID is null, then there were no sweetness-10 items for the MAX function to select, and, therefore, rows HAVING a null MAX(i.id) are selected.

I ran both the NOT EXISTS version of the query and the above version of the query a number of times with the query cacher disabled. Against about 400 recipes, the NOT EXISTS query would consistently take about 1.0 seconds to complete, whereas this query's runtime was usually around 0.1 seconds. Against about 5000 recipes, the NOT EXISTS query took about 30 seconds, whereas the above query usually still took 0.1 seconds, and was almost always under 1.0.

It's worth noting that, checking EXPLAINs on each, the query listed here is able to run almost entirely on the indices I've given these tables, which probably explains why it is able to do all sorts of joining and grouping without batting an eye. The NOT EXISTS query, on the other hand, has to do dependent subqueries. The two might perform more equally if these indices weren't in place, but that query optimizer is pretty darn powerful when given the chance to use raw joins, it would seem.

Moral of the story: well-formed JOINs are super-duper powerful :) Thanks, all!

0

精彩评论

暂无评论...
验证码 换一张
取 消