mysql手冊中存在rand()命令,能獲取到隨機(jī)行, 并使用limit 10 只采取其中幾行。
SELECT id FROM user ORDER BY RAND() LIMIT 10;
數(shù)據(jù)量小于1000行的時候,上面的 sql 執(zhí)行的快。但是當(dāng)數(shù)據(jù)大于10000行, 排序的開銷就變得很重。上面的操作中,我們在排序完就把幾乎所有的行都丟掉了。
只要我們有一個數(shù)字主鍵,我們可以有更好的方式去實現(xiàn)這個功能,不需要對所有數(shù)據(jù)進(jìn)行排序。
在上面的例子中, 我們假設(shè) id 從1開始, 并且在1和 id 的最大值之間是連續(xù)的。
通過應(yīng)用程序解決問題
可以在應(yīng)用程序中計算隨機(jī)id, 簡化整個計算。
SELECT MAX(id) FROM user;
## 在應(yīng)用程序中生成區(qū)間內(nèi)的隨機(jī)數(shù):random-id
SELECT name FROM user WHERE id = <random-id>
由于MAX(id) == COUNT(id),我們只是生成1和 max (id) 之間的隨機(jī)數(shù), 并將其傳遞到數(shù)據(jù)庫中檢索隨機(jī)行。
第一個select語句是NO-OP,并一直在被優(yōu)化。第二個是針對常量的 eq 速度也很快。
通過數(shù)據(jù)庫解決問題
# 生成一個隨機(jī)ID
> SELECT RAND() * MAX(id) FROM user;
+------------------+
| RAND() * MAX(id) |
+------------------+
| 689.37582507297 |
+------------------+
# 返回值是double,但是我們需要的是 int
> SELECT CEIL(RAND() * MAX(id)) FROM user;
+-------------------------+
| CEIL(RAND() * MAX(id)) |
+-------------------------+
| 1000000 |
+-------------------------+
# 返回值是 int,分析性能
> EXPLAIN
SELECT CEIL(RAND() * MAX(id)) FROM random;
+----+-------------+-------+-------+------+-------------+
| id | select_type | table | type | rows | Extra |
+----+-------------+-------+-------+------+-------------+
| 1 | SIMPLE | random| index |1000000| Using index |
+----+-------------+-------+-------+------+-------------+
## 全表掃描?由于使用 MAX()函數(shù)了,導(dǎo)致優(yōu)化丟失。
> EXPLAIN
SELECT CEIL(RAND() * (SELECT MAX(id) FROM random));
+----+-------------+-------+------+------+------------------------------+
| id | select_type | table | type | rows | Extra |
+----+-------------+-------+------+------+------------------------------+
| 1 | PRIMARY | NULL | NULL | NULL | No tables used |
| 2 | SUBQUERY | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+------+------------------------------+
## 子查詢可以將性能損失挽回
通過上面的 sql 已經(jīng)能夠生成隨機(jī) id, 但如何獲得行?
> EXPLAIN
SELECT name
FROM user
WHERE id = (SELECT CEIL(RAND() *
(SELECT MAX(id)
FROM user));
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
| 1 | PRIMARY | user | ALL | NULL | NULL | NULL | NULL | 1000000 | Using where |
| 3 | SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
> show warnings;
+-------+------+------------------------------------------+
| Level | Code | Message |
+-------+------+------------------------------------------+
| Note | 1249 | Select 2 was reduced during optimization |
+-------+------+------------------------------------------+
上面的方法是最明顯的, 但也是最錯誤的做法。原因是:where子查詢中的select為外部select每一行都會執(zhí)行。具體解釋參考:sql語句嵌套查詢性能低
要找一種方法,保證random-id只生成一次:
SELECT name
FROM user JOIN
(SELECT CEIL(RAND() *
(SELECT MAX(id)
FROM user)) AS id
) AS r2
USING (id);
+----+-------------+------------+--------+------+------------------------------+
| id | select_type | table | type | rows | Extra |
+----+-------------+------------+--------+------+------------------------------+
| 1 | PRIMARY | <derived2> | system | 1 | |
| 1 | PRIMARY | user | const | 1 | |
| 2 | DERIVED | NULL | NULL | NULL | No tables used |
| 3 | SUBQUERY | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+------------+--------+------+------------------------------+
內(nèi)部select生成一個常量臨時表, join 只在單行上執(zhí)行。沒有使用排序,沒有通過應(yīng)用程序,查詢的大多數(shù)部分都被優(yōu)化了。
非連續(xù)數(shù)據(jù)
刪除一些行,構(gòu)造ID非連續(xù)的記錄。
SELECT name
FROM random AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(id)
FROM random)) AS id)
AS r2
WHERE r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1;
+----+-------------+------------+--------+------+------------------------------+
| id | select_type | table | type | rows | Extra |
+----+-------------+------------+--------+------+------------------------------+
| 1 | PRIMARY | <derived2> | system | 1 | |
| 1 | PRIMARY | r1 | range | 689 | Using where |
| 2 | DERIVED | NULL | NULL | NULL | No tables used |
| 3 | SUBQUERY | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+------------+--------+------+------------------------------+
join現(xiàn)在獲取所有大于或等于我們隨機(jī)值的ID,如果不能直接匹配則選擇鄰居。 但是一旦找到一行,就停止執(zhí)行(LIMIT 1)。根據(jù)索引(ORDER BY id ASC)讀取行。 當(dāng)使用 >= 而不是a = 時,我們可以擺脫CEIL并以更少的工作獲得相同的結(jié)果。
平等分配
當(dāng)我們的ID分布不再相等時,我們選擇的行也不是真正隨機(jī)的。
> select * from holes;
+----+----------------------------------+----------+
| id | name | accesses |
+----+----------------------------------+----------+
| 1 | d12b2551c6cb7d7a64e40221569a8571 | 107 |
| 2 | f82ad6f29c9a680d7873d1bef822e3e9 | 50 |
| 4 | 9da1ed7dbbdcc6ec90d6cb139521f14a | 132 |
| 8 | 677a196206d93cdf18c3744905b94f73 | 230 |
| 16 | b7556d8ed40587a33dc5c449ae0345aa | 481 |
+----+----------------------------------+----------+
RAND方法會生成9到15之類的ID,這些ID都會導(dǎo)致id 16被選為下一個更高的數(shù)字。
這個問題沒有真正的解決方案,但是由于你的數(shù)據(jù)大多是不變的,你可以添加一個映射表,將行號映射到id:
> create table holes_map ( row_id int not NULL primary key, random_id int not null);
> SET @id = 0;
> INSERT INTO holes_map SELECT @id := @id + 1, id FROM holes;
> select * from holes_map;
+--------+-----------+
| row_id | random_id |
+--------+-----------+
| 1 | 1 |
| 2 | 2 |
| 3 | 4 |
| 4 | 8 |
| 5 | 16 |
+--------+-----------+
row_id現(xiàn)在再次是連續(xù),我們可以再次運行隨機(jī)查詢
SELECT name FROM holes
JOIN (SELECT r1.random_id
FROM holes_map AS r1
JOIN (SELECT (RAND() *
(SELECT MAX(row_id)
FROM holes_map)) AS row_id)
AS r2
WHERE r1.row_id >= r2.row_id
ORDER BY r1.row_id ASC
LIMIT 1) as rows ON (id = random_id);
1000次提取后,我們再次看到平均分布:
> select * from holes;
+----+----------------------------------+----------+
| id | name | accesses |
+----+----------------------------------+----------+
| 1 | d12b2551c6cb7d7a64e40221569a8571 | 222 |
| 2 | f82ad6f29c9a680d7873d1bef822e3e9 | 187 |
| 4 | 9da1ed7dbbdcc6ec90d6cb139521f14a | 195 |
| 8 | 677a196206d93cdf18c3744905b94f73 | 207 |
| 16 | b7556d8ed40587a33dc5c449ae0345aa | 189 |
+----+----------------------------------+----------+
維護(hù)連續(xù)的表
DROP TABLE IF EXISTS r2;
CREATE TABLE r2 (
id SERIAL,
name VARCHAR(32) NOT NULL UNIQUE
);
DROP TABLE IF EXISTS r2_equi_dist;
CREATE TABLE r2_equi_dist (
id SERIAL,
r2_id bigint unsigned NOT NULL UNIQUE
);
當(dāng)我們在r2中更改某些內(nèi)容時,我們希望r2_equi_dist也會更新。
DELIMITER $$
DROP TRIGGER IF EXISTS tai_r2$$
CREATE TRIGGER tai_r2
AFTER INSERT ON r2 FOR EACH ROW
BEGIN
DECLARE m BIGINT UNSIGNED DEFAULT 1;
SELECT MAX(id) + 1 FROM r2_equi_dist INTO m;
SELECT IFNULL(m, 1) INTO m;
INSERT INTO r2_equi_dist (id, r2_id) VALUES (m, NEW.id);
END$$
DELIMITER ;
DELETE FROM r2;
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
SELECT * FROM r2;
+----+----------------------------------+
| id | name |
+----+----------------------------------+
| 1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
| 2 | a09a3959d68187ce48f4fe7e388926a9 |
| 3 | 4e1897cd6d326f8079108292376fa7d5 |
| 4 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
+----+-------+
INSERT非常簡單,DELETE操作我們必須更新equi-dist-id以保持id的連續(xù)設(shè)置:
DELIMITER $$
DROP TRIGGER IF EXISTS tad_r2$$
CREATE TRIGGER tad_r2
AFTER DELETE ON r2 FOR EACH ROW
BEGIN
DELETE FROM r2_equi_dist WHERE r2_id = OLD.id;
UPDATE r2_equi_dist SET id = id - 1 WHERE r2_id > OLD.id;
END$$
DELIMITER ;
DELETE FROM r2 WHERE id = 2;
SELECT * FROM r2;
+----+----------------------------------+
| id | name |
+----+----------------------------------+
| 1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
| 3 | 4e1897cd6d326f8079108292376fa7d5 |
| 4 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
| 1 | 1 |
| 2 | 3 |
| 3 | 4 |
+----+-------+
update操作需要維護(hù)外鍵約束:
DELIMITER $$
DROP TRIGGER IF EXISTS tau_r2$$
CREATE TRIGGER tau_r2
AFTER UPDATE ON r2 FOR EACH ROW
BEGIN
UPDATE r2_equi_dist SET r2_id = NEW.id WHERE r2_id = OLD.id;
END$$
DELIMITER ;
UPDATE r2 SET id = 25 WHERE id = 4;
SELECT * FROM r2;
+----+----------------------------------+
| id | name |
+----+----------------------------------+
| 1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
| 3 | 4e1897cd6d326f8079108292376fa7d5 |
| 25 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
| 1 | 1 |
| 2 | 3 |
| 3 | 25 |
+----+-------+
一次多行
如果要返回多行,您可以:
- 多次執(zhí)行查詢
- 編寫執(zhí)行查詢的存儲過程并將結(jié)果存儲在臨時表中
存儲過程
存儲過程為你了程序語言結(jié)構(gòu):
- 循環(huán)
- 控制結(jié)構(gòu)
- 程序
...
對于此任務(wù),我們只需要一個循環(huán):
ELIMITER $$
DROP PROCEDURE IF EXISTS get_rands$$
CREATE PROCEDURE get_rands(IN cnt INT)
BEGIN
DROP TEMPORARY TABLE IF EXISTS rands;
CREATE TEMPORARY TABLE rands ( rand_id INT );
loop_me: LOOP
IF cnt < 1 THEN
LEAVE loop_me;
END IF;
INSERT INTO rands
SELECT r1.id
FROM random AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(id)
FROM random)) AS id)
AS r2
WHERE r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1;
SET cnt = cnt - 1;
END LOOP loop_me;
END$$
DELIMITER ;
CALL get_rands(4);
SELECT * FROM rands;
+---------+
| rand_id |
+---------+
| 133716 |
| 702643 |
| 112066 |
| 452400 |
+---------+
性能
我們有3個不同的查詢來解決我們的問題:
- Q1. ORDER BY RAND()
- Q2. RAND() * MAX(ID)
- Q3. RAND() * MAX(ID) + ORDER BY ID
Q1預(yù)計成本為N * log2(N),Q2和Q3幾乎恒定。
我們用N行(一千到一百萬)填充表格并執(zhí)行每次查詢1000次。
100 1.000 10.000 100.000 1.000.000
Q1 0:00.718s 0:02.092s 0:18.684s 2:59.081s 58:20.000s
Q2 0:00.519s 0:00.607s 0:00.614s 0:00.628s 0:00.637s
Q3 0:00.570s 0:00.607s 0:00.614s 0:00.628s 0:00.637s
正如您所看到的那樣,簡單的ORDER BY RAND()已經(jīng)落后于表中僅100 行的優(yōu)化查詢。
更詳細(xì)的查詢分析,請看:analyzing-complex-queries.
參考