[thelist] MySQL queries and AUTO_INCREMENT

Fri Apr 7 13:57:29 CDT 2006

Quoting Matt Warden <mwarden at gmail.com>:

> On 4/7/06, Mattias Thorslund <mattias at thorslund.us> wrote:
>> But it looks like some rows would be more likely to be chosen than
>> others. Consider if the IDs are [1, 2, 3, 100]. Then, picking a random
>> number between 1 and 100 will result displaying the row with ID 100 for
>> 97% of the time. Of course that's an extreme example but for each
>> missing number, the likely selection of the following record is much
>> larger than the others (one missing number doubles the likelihood of
>> selecting the next record, two missing numbers will triple it)...
>
> His disclaimer was for > 10,000 rows, and it doesn't sound like this
> is an application where uniformity is a huge issue.

Agreed. Probably won't be an issue, practically speaking. RAND() is 
much simpler
to implement.

>> What's the use of point 3? The (second) point 2 should return a row if
>> there are rows in the table.
>
> Only if you lock tables for writing in between these queries.

Anomalies would be pretty unlikely (they would happen only if someone 
managed to
delete the chosen picture in the microseconds between getting 
first/last ID and
selecting the random record). If we can assume that deletes are relatively
rare, this would not be a big issue, practically speaking.

But in principle I agree, the two queries should be made into one transaction,
either by lock/unlock, a transaction, or a stored procerdure, etc... Putting
them in a stored procedure (if the MySQL version supports it) would probably
give the best performance since there'd be just one trip to the database.

>> I agree with John Brooking: Use a separate sequence field, and update
>> it. Or, if the number of records will be not too large, use ORDER BY
>> RAND() LIMIT 1.
>
> I would suggest this either way. I'm not sure issuing 3 queries over
> the network is more efficient than using this method (although, I
> haven't done any testing).

Database server and web server are likely the same box here, so I 
assume there's
no huge latency. Besides, I only see the need for two queries.

But all in all this isn't much to argue over.  In practice, I would try
whichever approach that seems easy to implement and see whether the 
results are
acceptable to the purposes of the application...

Mattias