[thesite] rudy, this may intersest you..

rudy r937 at interlog.com
Fri Oct 5 12:46:54 CDT 2001

you guys kill me

it takes a lot longer than 21 days, eh

i still haven't got it right   ;o)

>  select who,  lastlogin, createdate
>      from users
>    group by email
>  having count(*) > 1
> is giving me a "Not a group by" expression

as well it should

i understand what you want, though

> I don't think that'll work because you cannot
> return this sort of row level detail when you have
> a group by in there.

exactly, joshua

think about what happens with a group by -- the rows
in the group are collapsed, preserving the values of
the group by columns

> select who,  lastlogin, createdate
>      from users
>     where exists
>   (select 1 from users u2
>       where u2.email = users.email
>         and u2.userid != user.userid)
> Or something like that.

indeed, that's pretty close -- in fact it does give you the
right answer, except that it includes every user twice!

(he said, confidently, not having bothered to test it)

> something like:
>   select email, who,  lastlogin, createdate
>          from users
>  group by email, who, lastlogin, createdate
>   having count(email) > 1

good try scott, but this will probably come up empty -- you're looking for
groups where they have at least two rows with the same email and who and
lastlogin and createdate

and way to go, hinting at null emails... as if group by wasn't tricky
enough, now you gotta bring up the issue of nulls?

i love nulls, but they are really hard for a lot of people to grasp

> I do not think this will work either because the GROUP BY
> forces all rows that have the same email to be reduced down
> into one row.

exactly -- i say collapsed, you say reduced   ;o)

key point:  the database gives you one row per group

the only actual column values you can access are the ones that are
guaranteed to be the same for every row in the group -- and that
would be the columns in the group by column list

you can also access column functions for the group

column functions are most often called aggregate functions, as they operate
on a set of values -- the column of values for the rows of the group

(scalar functions like substring, month, cast, and so on, operate on a
single, scalar value in a single row)

so definitely you want to group by email

the trick is to make the grouping a subquery, similar to where joshua was
going, but using IN instead of EXISTS

   select email, who,  lastlogin, createdate
          from users
    where email in
       (select email
           from users
        group by email
     having count(*) > 1)

the subquery returns all the duplicated emails, while the outer query
returns all the rows that have those emails

a good optimizer will do it in a single pass using a join, but don't you
try to write it as a join, you'll only hurt yourself

i think count(*) is faster, but count(email) -- and i'd want to test this
to make sure -- produces the same result, only it never returns a null

if there were tons of null emails in the table (and in our case there
aren't), this might be a good idea, as you wouldn't want the outer query to
return all the users without an email (assuming there was more than one),

but it would probably not occur to me to write count(email), i'd probably
put a filter in the subquery --

   select email, who,  lastlogin, createdate
          from users
    where email in
       (select email
           from users      where email is not null
        group by email
     having count(*) > 1)


More information about the thesite mailing list