Getting a random row from a relational database
Problem
From time to time one needs to fetch a random row from a table in a db.
Solution
Take one
The “obvious” solution is to order by RAND() and get the first:
SELECT * FROM users ORDER BY RAND() LIMIT 1; /* Don't do that */
It does the job, problem solved!
Well, no, not exactly! While it works, the performance will be awful for any table of significant size.
Take two
There is another, simple, but efficient, method to get a random row:
- Fetch the number of rows using
COUNT().SELECT COUNT(*) FROM users;
- Get a random positive number, that is less than (but not equal) to that count.
- Select the row using
OFFSET(on MySQL – there are ways to do the same on all major RDBMS)SELECT * FROM users LIMIT 1 OFFSET :rand
where
:randis the number you computed.
While this uses two queries – both of them are very efficient.
Here is the concrete code on how to do that in Django (using a custom manager):
class UsersManager(models.Manager): def random(self): count = self.aggregate(count=Count('id'))['count'] random_index = randint(0, count - 1) return self.all()[random_index] class User(models.Model): objects = UsersManager() # ... [fields] ...
Usage is as simple as
User.objects.random()
Just for the record – the Django way to shoot yourself in the foot do ORDER BY RAND() is
User.objects.order_by('?') # If you do that in production - you may as well forget about sleep!
Proof
Here is the evidence that doing ORDER BY RAND() is very slow.
For the test we will have a simple Django model and a few of scripts – one that populates the database with bulk data, one that fetches a user using ORDER BY RAND() and one that fetches it using a second query. We populate the db with one million records and get a random one.
Here are the results (done using Python 2.6.4 and MySQL 5.1/SQLite 3.6.16 on Ubuntu 9.10):
Using the naive method by doing ORDER BY RAND():
$ time python naive.py # MySQL SQLite real 0m4.519s 0m2.061s # seconds?!?! user 0m0.152s 0m1.952s sys 0m0.028s 0m0.056s
(executing the query directly results in times close to those).
Using the smart method with two queries:
$ time python smart.py # MySQL SQLite real 0m0.367s 0m0.339s # much better user 0m0.156s 0m0.284s sys 0m0.032s 0m0.052s
The more records you add to the table, the slower the naive method will get, while the smart one will run at about the same speed.
If you need to get a random record out of a filtered set (using WHERE) is’s basically the same. You just need to add the WHERE clause to both queries.
Details
Below is the code that is used to make the benchmark.
This is the models.py file of a simple Django app.
from random import randint from django.db import models from django.db.models import Count class UserManager(models.Manager): def random(self): count = self.aggregate(ids=Count('id'))['ids'] random_index = randint(0, count - 1) return self.all()[random_index] def random_naive(self): return self.all().order_by('?')[0] class User(models.Model): username = models.CharField(max_length=128) password = models.CharField(max_length=128) objects = UserManager()
The bluk.py script that will fill the db with junk.
import os import settings os.environ['DJANGO_SETTINGS_MODULE'] = os.path.join( os.path.dirname(__file__), 'settings') from users.models import User if __name__ == '__main__': for i in range(0, 1000000): u = User(username="user{0}".format(i), password="pass{0}".format(i)) u.save()
This is naive.py – the script that uses the slow method.
import os import settings os.environ['DJANGO_SETTINGS_MODULE'] = os.path.join( os.path.dirname(__file__), 'settings') from users.models import User if __name__ == '__main__': u = User.objects.random_naive() print(u.username)
And smart.py that uses an additional query.
import os import settings os.environ['DJANGO_SETTINGS_MODULE'] = os.path.join( os.path.dirname(__file__), 'settings') from users.models import User if __name__ == '__main__': u = User.objects.random() print(u.username)
“Getting a random row from a relational database”