Bold dream
Imagination is limitless. So is stupidity.

Getting a random row from a relational database

January 22nd 2010 in Python

Problem

From time to time one needs to fetch a random row from a table in a db.

Solution

Take one

The “obvious” solution is to order by RAND() and get the first:

SELECT * FROM users ORDER BY RAND() LIMIT 1; /* Don't do that */

It does the job, problem solved!

Well, no, not exactly! While it works, the performance will be awful for any table of significant size.

Take two

There is another, simple, but efficient, method to get a random row:

  1. Fetch the number of rows using COUNT().
    SELECT COUNT(*) FROM users;
  2. Get a random positive number, that is less than (but not equal) to that count.
  3. Select the row using OFFSET (on MySQL – there are ways to do the same on all major RDBMS)
    SELECT * FROM users LIMIT 1 OFFSET :rand

    where :rand is the number you computed.

While this uses two queries – both of them are very efficient.

Here is the concrete code on how to do that in Django (using a custom manager):

class UsersManager(models.Manager):
    def random(self):
        count = self.aggregate(count=Count('id'))['count']
        random_index = randint(0, count - 1)
        return self.all()[random_index]
 
class User(models.Model):
    objects = UsersManager()
    # ... [fields] ...

Usage is as simple as

User.objects.random()

Just for the record – the Django way to shoot yourself in the foot do ORDER BY RAND() is

User.objects.order_by('?')  # If you do that in production - you may as well forget about sleep!

Proof

Here is the evidence that doing ORDER BY RAND() is very slow.

For the test we will have a simple Django model and a few of scripts – one that populates the database with bulk data, one that fetches a user using ORDER BY RAND() and one that fetches it using a second query. We populate the db with one million records and get a random one.

Here are the results (done using Python 2.6.4 and MySQL 5.1/SQLite 3.6.16 on Ubuntu 9.10):

Using the naive method by doing ORDER BY RAND():

$ time python naive.py
#       MySQL        SQLite
real	0m4.519s     0m2.061s # seconds?!?! 
user	0m0.152s     0m1.952s
sys	0m0.028s     0m0.056s

(executing the query directly results in times close to those).

Using the smart method with two queries:

$ time python smart.py 
#       MySQL        SQLite
real	0m0.367s     0m0.339s # much better
user	0m0.156s     0m0.284s
sys	0m0.032s     0m0.052s

The more records you add to the table, the slower the naive method will get, while the smart one will run at about the same speed.

If you need to get a random record out of a filtered set (using WHERE) is’s basically the same. You just need to add the WHERE clause to both queries.

Details

Below is the code that is used to make the benchmark.

This is the models.py file of a simple Django app.

from random import randint
from django.db import models
from django.db.models import Count
 
class UserManager(models.Manager):
    def random(self):
        count = self.aggregate(ids=Count('id'))['ids']
        random_index = randint(0, count - 1)
        return self.all()[random_index]
 
    def random_naive(self):
        return self.all().order_by('?')[0]
 
class User(models.Model):
    username = models.CharField(max_length=128)
    password = models.CharField(max_length=128)
 
    objects = UserManager()

The bluk.py script that will fill the db with junk.

import os
import settings
 
os.environ['DJANGO_SETTINGS_MODULE'] = os.path.join(
    os.path.dirname(__file__), 'settings')
 
from users.models import User
if __name__ == '__main__':
    for i in range(0, 1000000):
        u = User(username="user{0}".format(i), password="pass{0}".format(i))
        u.save()

This is naive.py – the script that uses the slow method.

import os
import settings
 
os.environ['DJANGO_SETTINGS_MODULE'] = os.path.join(
    os.path.dirname(__file__), 'settings')
 
from users.models import User
 
if __name__ == '__main__':
    u = User.objects.random_naive()
    print(u.username)

And smart.py that uses an additional query.

import os
import settings
 
os.environ['DJANGO_SETTINGS_MODULE'] = os.path.join(
    os.path.dirname(__file__), 'settings')
 
from users.models import User
 
if __name__ == '__main__':
    u = User.objects.random()
    print(u.username)

6 comments to...
“Getting a random row from a relational database”
Avatar
Travis L

In your solution, part 1, you do a SELECT COUNT(*) rather than SELECT COUNT( col_name )? I’ve always believed that COUNT(*) is slower than counting 1 column.


Avatar
Saiyine

Don’t you think a benchmark with just one query it’s a bit naive? What about doing it with a thousand querys?


Avatar
Emil Vladev

@Travis: In my experience there is no real difference between both – databases usually optimize COUNT(*) (I think that in MyISAM the row number is stored with the table and is not computed)

@Saiyine: Even with one query the difference is evident.


Avatar
TheQuux

@emil: Count(*) and Count(column) are actually different.

SELECT COUNT(*) FROM table;

gives you the number of rows in table, whereas

SELECT COUNT(column) FROM table;

is equivalent to:

SELECT COUNT(*) FROM table WHERE column IS NOT NULL;

So, the first would be fast if your database keeps track of the number of rows in the table, whereas the second would need to consult an index or actually run a sequential scan over the table. Incidentally, figuring out how many rows are in the table with PostgreSQL takes right about as long as the initial “select * from table order by rand() limit 1″, so it doesn’t leave you any better off.


Avatar
Emil Vladev

@TheQuux: Thanks for the clarification. About ORDER BY RAND() in PostgreSQL being the same – this is not right. ORDER BY RAND() will have to generate a second copy of the table, sort it and then return a record, while SELECT COUNT(*) will need to check the rows but that’s about it. If you have a better method I would love to here it – this one is far from perfect.


Avatar
Saiyine

I’ve done the tests myself, and I can confirm your results.

Statistics on getting a random row from a table:
http://www.saiyine.com/post.901.php




required



required - won't be displayed


Your Comment:

Just finished watching this presentation of Guy Kawasaki. Simple advice – whatever you do for a living – Watch It!

Also, if you can, grab a copy of his book The Art of the Start. It’s the best 200 pages of entrepreneur advices you can get.

Previous Entry

I work on a simple Django Q&A app and decided that Questions and Answers should be on the same page in the admin. Django already provides that using inlines, but after using it a strange error started appearing (usually on server restart or code reload).

type object ‘AnswerInline’ has no attribute ‘date_hierarchy’

After a bit of digging, [...]

Next Entry

Blogroll