Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at Blue Smoke (New York City, NY) with:

Ask Ben: Pulling Unique SQL Records That Match ALL Join Conditions

By Ben Nadel on

The reader question is far too long to put here, but the jist of it was that the reader had a primary table that joined to several "property" tables in a One-to-Many kind of relationship. These properties could be selected in a search form and the reader wanted to be able to return all primary table records that matched at least every selected property. To demonstrate the way that I handle this, I set up a simple SQL database with the following tables:

girl
-----------------
id (int)
name (varchar)

trait
-----------------
id (int)
name (varchar)

girl_trait_jn
-----------------
girl_id (int)
trait_id (int)

So that we are all on the same page, I made this video walk through:

 
 
 
 
 
 
 
 
 
 

This is a simple, one page form that posts to itself:

  • <!--- Param form value. --->
  • <cfparam
  • name="FORM.list_trait_id"
  • type="regex"
  • pattern="[\d,]*"
  • default=""
  • />
  •  
  •  
  • <!--- Check to see if we have any trait. --->
  • <cfif Len( FORM.list_trait_id )>
  •  
  • <!--- Query for girls. --->
  • <cfquery name="qGirl" datasource="ben">
  • SELECT
  • g.id,
  • g.name
  • FROM
  • girl g
  • INNER JOIN
  • (
  •  
  • <!---
  • Get only girl IDs from the JOIN table where
  • ALL of the trait IDs match. Because we know
  • that the JOIN records are unique, we know
  • that the COUNT(*) on the GROUP BY must equal
  • the length of the trait ID list.
  • --->
  • SELECT
  • gtjn.girl_id
  • FROM
  • girl_trait_jn gtjn
  • WHERE
  • gtjn.trait_id IN ( <cfqueryparam value="#FORM.list_trait_id#" cfsqltype="cf_sql_integer" list="true" /> )
  • GROUP BY
  • gtjn.girl_id
  • HAVING
  • COUNT( * ) = <cfqueryparam value="#ListLen( FORM.list_trait_id )#" cfsqltype="cf_sql_integer" />
  •  
  • ) AS filter
  • ON
  • g.id = filter.girl_id
  • ORDER BY
  • g.name ASC
  • </cfquery>
  •  
  • </cfif>
  •  
  •  
  • <!--- Query for all traits. --->
  • <cfquery name="qTrait" datasource="ben">
  • SELECT
  • id,
  • name
  • FROM
  • trait
  • ORDER BY
  • name ASC
  • </cfquery>
  •  
  •  
  • <cfoutput>
  •  
  • <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  • <html>
  • <head>
  • <title>Select Records My ID Set Matching</title>
  • </head>
  • <body>
  •  
  • <h1>
  • Select Traits
  • </h1>
  •  
  • <form action="#CGI.script_name#" method="post">
  •  
  • <p>
  • <cfloop query="qTrait">
  •  
  • <label>
  •  
  • <input
  • type="checkbox"
  • name="list_trait_id"
  • value="#qTrait.id#"
  • <cfif ListFind( FORM.list_trait_id, qTrait.id )>
  • checked="true"
  • </cfif>
  • />
  •  
  • #qTrait.name#
  •  
  • </label>
  • <br />
  •  
  • </cfloop>
  • </p>
  •  
  • <p>
  • <input type="submit" value="Get Girls!" />
  • </p>
  •  
  • </form>
  •  
  •  
  • <!--- Check to see if we have a search. --->
  • <cfif Len( FORM.list_trait_id )>
  •  
  • <h2>
  • Matching Girls
  • </h2>
  •  
  • <ul>
  • <cfloop query="qGirl">
  •  
  • <li>
  • #qGirl.name#
  • </li>
  •  
  • </cfloop>
  • </ul>
  •  
  • </cfif>
  •  
  • </body>
  • </html>
  •  
  • </cfoutput>

The technique that I use leverages the fact that the join table has unique records, meaning that the combinations of girl_id and trait_id are all unique. Because of this, I know that I can filter the join table using the list of selected trait IDs and then GROUP it by the girl_id. If the COUNT(*) of the resultant GROUP is equal to the list length of the selected traits, I know that the current girl matches all necessary traits. If the GROUP COUNT(*) is less than the length of the trait ID list, then I know that at least one of the selected traits was not found in the join table for the given girl.

The benefit of doing it this way is that you only have to filter the join table once for the query and it allows you to do some filtering early on. If we put any of this in the WHERE clause, it would mean we'd have to be running queries for every girl returned. Yes, the WHERE-style execution might make better use of indexes, but I feel that the sheer number of sub-queries required outweighs that; of course, that will probably depend on the number of joins and the expected result set size.

There are a number of ways to solve this problem and this is the one that I have found to be the most easy to write, read, and maintain. Of course, I expect Rick Osborne to come in at some point and comment on how this is completely inefficient; but, until that happens, the above is my best advice.



Reader Comments

Thanks for the insight Ben!

We have an advanced search page we're building for a client, and were thinking of ways of doing something similar to your example. Your solution's nice and streamlined. I look forward to hearing what others have to say.

Your blog's a huge help and great training reference. Thanks again.

Reply to this Comment

Ben,

you really rock !! Whatever I get to do in CF, i search in your site first than in Google and i find a solution for sure.

I was working on a search with ranking based on relevance, and there you go. You readily had my solution, though my search was not for girls ;)

Reply to this Comment

Absolutely wonderful! I searched online for 3 days trying to figure this out - trying combinations of ANDs, CONTAINS, etc.

I highly recommend visitors watch the short video, as I didn't realize the importance of the COUNT(*) initially and excluded it.

It's very clever to pull in a list matching "any" (IN clause) and then filter it by checking to see if the list of "any" items equals the amount (COUNT) of pre selected options - in this case "traits"

BRAVO!

Reply to this Comment

I forgot to mention that I did not get a unique result-set, I did have to use a DISTINCT keyword on the first SELECT

SQL Server 2005

Reply to this Comment

I am providing my statement as thanks :)
I have a tblDocument, tblTag, and a tblDocument_Tag linking them - be sure to adjust the COUNT when you add, remove values to match

SELECT DISTINCT doctag.dt_tagKey
FROM tblDocument_Tag AS doctag
INNER JOIN
(
SELECT dt_tagKey
FROM tblDocument_Tag AS tmp
WHERE (dt_docKey IN (17, 18, 44, 21)
)
GROUP BY dt_tagKey
HAVING (COUNT(*) = 4)) AS filter
ON doctag.dt_tagKey = filter.dt_tagKey

Reply to this Comment

Crap! I just rewatched the video at home (as opposed to work) and discovered that the GROUP BY needs to be inside the "inline\temp" selection, no wonder it worked perfectly in SQL Manager Studio yet returned seemingly random results in my application :)

I can't gaurentee the validity of this statement, (because I'm at home) but I believe using my dataset from before the CORRECT statement would be:

SELECT DISTINCT doctag.dt_tagKey
FROM tblDocument_Tag AS doctag
INNER JOIN
(
SELECT dt_tagKey
FROM tblDocument_Tag AS tmp
WHERE (dt_docKey IN (17, 18, 44, 21)
GROUP BY dt_tagKey
HAVING (COUNT(*) = 4)) AS filter
)
ON doctag.dt_tagKey = filter.dt_tagKey

and then to go one step further and actually get all the necessary "Tag" information by packaging this SQL inside one more, which is simple, since you already have a reference to the tag you want, so you don't need any fancy filtration tricks, since the only data available to you now...is the correct data.

SELECT *
FROM tblTAG
WHERE (tagKey IN
(
SELECT DISTINCT doctag.dt_tagKey
FROM tblDocument_Tag AS doctag
INNER JOIN
(
SELECT dt_tagKey
FROM tblDocument_Tag AS tmp
WHERE (dt_docKey IN (17, 18, 44, 21)
GROUP BY dt_tagKey
HAVING (COUNT(*) = 4)) AS filter
)
ON doctag.dt_tagKey = filter.dt_tagKey
)
)

This *should* return all the tags which have document(s) in common with ALL the existing\selected Tags - thereby allowing you to filter out documents based on which tags you've selected - so if you chose: "vegetable", you would get a list of veggies, and if you chose "red" you would retrieve a list of all the "vegetables" that are "red" - i.e. Tomatoes, Radishes, and Red Beans.

if you were then to select an additional tag: "fruit" you would see only "Tomatoes" since that one could be logically filed under "Vegetable", "Red" and "Fruit"

Am I right Ben?

Reply to this Comment

@Ryan,

You have some parenthesis in places that I don't fully understand (wrapping the WHERE clause to rename as filter). I am not sure what that is doing. My gut was thinking you are renaming the deeply nested SELECT result as filter, but the parens are confusing me.

I don't quite follow what the middle SELECT is doing. All you are gonna do is end up joining one tag back to itself. I am not sure the middle SELECT is adding anything.

Reply to this Comment

This is what ended up working for me, I'll explain how I see it, and if you can tell me where my logic is incorrect I would really appreciate
it...first the lump:

Application files Documents by "Tags" SO theres a Document, Tag, and Document_Tag table joining them

I want the user to be able to select a Tag and have it populate which tags have a document in common with ALL previously selected Tag (a similar, easier query fetches the documents)

SELECT t.*
FROM tblTag AS t
INNER JOIN
(
SELECT DISTINCT dt_tagKey
FROM tblDocument_Tag AS tmp2
WHERE (dt_docKey IN
(
SELECT dt_docKey FROM tblDocument_Tag AS tmp
WHERE (dt_tagKey IN
(4)
)
GROUP BY dt_docKey
HAVING (COUNT(*) = 1)
)
)
) AS filter ON t.tagKey = filter.dt_tagKey
WHERE (tagKey <> 4)

------

So, first I get the keys of the documents that are connected to Tag #4

-QUERY A-
SELECT dt_docKey FROM tblDocument_Tag AS tmp
WHERE (dt_tagKey IN (4))
^-List,of,Tags

Then using that record set I grab all the tags which have a connected doc that matches ALL the selected tags (thanks to the COUNT TRICK)-

-QUERY B-
SELECT DISTINCT dt_tagKey
FROM tblDocument_Tag AS tmp2
WHERE (dt_docKey IN
(

-QUERY A-

)
GROUP BY dt_docKey
HAVING (COUNT(*) = 1)
^-amount of selected tags

Now I should have all the Keys of Tags that have a document associated with all the previously selected tags (filtering down results)

Finnaly, I use the INNER JOIN to relate the keys to the actual Tag Table and grab everything I want to know about them, excluding the Tag I Just clicked, and ALL previous Tags By adding AND <>'s after the initial WHERE

SELECT t.*
FROM tblTag AS t
INNER JOIN
(

-QUERY B-

) AS filter ON t.tagKey = filter.dt_tagKey
WHERE (tagKey <> 4)

...AND (tagKey <> 16)
AND (tagKey <> 22)
AND (tagKey <> 19)...

It appears to work, except that it doesn't remove tags that have the exact SAME documents, so you can click more Tags and the results don't change - I assume the solution would be to test and see if the matching amount of docs in a Tag is equal to what is already displayed, and then just exclude that tag from being printed

As for all the extra parenthesis around WHERE's and AND's, these were auto inserted by SQL Server Management Studio, when I was developing the query probably their own take on "correct" T-SQL

If I excluded them, it would probably work - but I wonder if the database would parse the preferred formating on execution of the query (in my application, because it certainly does in the SQL pane in SQL SMS, and if that would cause some sort of performance hit?

Any help clearing up any misnomers I have would be appreciated.

Reply to this Comment

Ben, you are my superhero!!

Been pulling my hair for the whole of this evening looking for the solution to this problem...

Virtual beers on me...

Reply to this Comment

I was looking for some ideas while I'm working on a similar query, and this is really helpful. I offer this comment in a spirit of constructive criticism; I feel compelled to mention that I could have done without the casual sexism in your choice of example.

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.