MySQL 中隨機(jī)選擇10條記錄

原文地址

mysql手冊中存在rand()命令,能獲取到隨機(jī)行, 并使用limit 10 只采取其中幾行。

SELECT id FROM user ORDER BY RAND() LIMIT 10;

數(shù)據(jù)量小于1000行的時候,上面的 sql 執(zhí)行的快。但是當(dāng)數(shù)據(jù)大于10000行, 排序的開銷就變得很重。上面的操作中,我們在排序完就把幾乎所有的行都丟掉了。

只要我們有一個數(shù)字主鍵,我們可以有更好的方式去實現(xiàn)這個功能,不需要對所有數(shù)據(jù)進(jìn)行排序。

在上面的例子中, 我們假設(shè) id 從1開始, 并且在1和 id 的最大值之間是連續(xù)的。

通過應(yīng)用程序解決問題

可以在應(yīng)用程序中計算隨機(jī)id, 簡化整個計算。

SELECT MAX(id) FROM user;
## 在應(yīng)用程序中生成區(qū)間內(nèi)的隨機(jī)數(shù):random-id
SELECT name FROM user WHERE id = <random-id>

由于MAX(id) == COUNT(id),我們只是生成1和 max (id) 之間的隨機(jī)數(shù), 并將其傳遞到數(shù)據(jù)庫中檢索隨機(jī)行。

第一個select語句是NO-OP,并一直在被優(yōu)化。第二個是針對常量的 eq 速度也很快。

通過數(shù)據(jù)庫解決問題

# 生成一個隨機(jī)ID
> SELECT RAND() * MAX(id) FROM user;
+------------------+
| RAND() * MAX(id) |
+------------------+
|  689.37582507297 |
+------------------+
# 返回值是double,但是我們需要的是 int

> SELECT CEIL(RAND() * MAX(id)) FROM user;
+-------------------------+
| CEIL(RAND() * MAX(id)) |
+-------------------------+
|                    1000000  |
+-------------------------+
# 返回值是 int,分析性能

> EXPLAIN 
   SELECT CEIL(RAND() * MAX(id)) FROM random;
+----+-------------+-------+-------+------+-------------+
| id | select_type | table | type  | rows | Extra       |
+----+-------------+-------+-------+------+-------------+
|  1 | SIMPLE      | random| index |1000000| Using index |
+----+-------------+-------+-------+------+-------------+
## 全表掃描?由于使用 MAX()函數(shù)了,導(dǎo)致優(yōu)化丟失。

> EXPLAIN 
   SELECT CEIL(RAND() * (SELECT MAX(id) FROM random));
+----+-------------+-------+------+------+------------------------------+
| id | select_type | table | type | rows | Extra                        |
+----+-------------+-------+------+------+------------------------------+
|  1 | PRIMARY     | NULL  | NULL | NULL | No tables used               |
|  2 | SUBQUERY    | NULL  | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+------+------------------------------+
## 子查詢可以將性能損失挽回

通過上面的 sql 已經(jīng)能夠生成隨機(jī) id, 但如何獲得行?

> EXPLAIN
SELECT name
  FROM user
 WHERE id = (SELECT CEIL(RAND() *
                         (SELECT MAX(id)
                            FROM user));
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows    | Extra                        |
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
|  1 | PRIMARY     | user   | ALL  | NULL          | NULL | NULL    | NULL | 1000000 | Using where                  |
|  3 | SUBQUERY    | NULL   | NULL | NULL          | NULL | NULL    | NULL |    NULL | Select tables optimized away |
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
> show warnings;
+-------+------+------------------------------------------+
| Level | Code | Message                                  |
+-------+------+------------------------------------------+
| Note  | 1249 | Select 2 was reduced during optimization |
+-------+------+------------------------------------------+

上面的方法是最明顯的, 但也是最錯誤的做法。原因是:where子查詢中的select為外部select每一行都會執(zhí)行。具體解釋參考:sql語句嵌套查詢性能低

要找一種方法,保證random-id只生成一次:

SELECT name
  FROM user JOIN
       (SELECT CEIL(RAND() *
                    (SELECT MAX(id)
                       FROM user)) AS id
        ) AS r2
       USING (id);
+----+-------------+------------+--------+------+------------------------------+
| id | select_type | table      | type   | rows | Extra                        |
+----+-------------+------------+--------+------+------------------------------+
|  1 | PRIMARY     | <derived2> | system |    1 |                              |
|  1 | PRIMARY     | user       | const  |    1 |                              |
|  2 | DERIVED     | NULL       | NULL   | NULL | No tables used               |
|  3 | SUBQUERY    | NULL       | NULL   | NULL | Select tables optimized away |
+----+-------------+------------+--------+------+------------------------------+

內(nèi)部select生成一個常量臨時表, join 只在單行上執(zhí)行。沒有使用排序,沒有通過應(yīng)用程序,查詢的大多數(shù)部分都被優(yōu)化了。

非連續(xù)數(shù)據(jù)

刪除一些行,構(gòu)造ID非連續(xù)的記錄。

SELECT name
  FROM random AS r1 JOIN
       (SELECT (RAND() *
                     (SELECT MAX(id)
                        FROM random)) AS id)
        AS r2
 WHERE r1.id >= r2.id
 ORDER BY r1.id ASC
 LIMIT 1;
+----+-------------+------------+--------+------+------------------------------+
| id | select_type | table      | type   | rows | Extra                        |
+----+-------------+------------+--------+------+------------------------------+
|  1 | PRIMARY     | <derived2> | system |    1 |                              |
|  1 | PRIMARY     | r1         | range  |  689 | Using where                  |
|  2 | DERIVED     | NULL       | NULL   | NULL | No tables used               |
|  3 | SUBQUERY    | NULL       | NULL   | NULL | Select tables optimized away |
+----+-------------+------------+--------+------+------------------------------+

join現(xiàn)在獲取所有大于或等于我們隨機(jī)值的ID,如果不能直接匹配則選擇鄰居。 但是一旦找到一行,就停止執(zhí)行(LIMIT 1)。根據(jù)索引(ORDER BY id ASC)讀取行。 當(dāng)使用 >= 而不是a = 時,我們可以擺脫CEIL并以更少的工作獲得相同的結(jié)果。

平等分配

當(dāng)我們的ID分布不再相等時,我們選擇的行也不是真正隨機(jī)的。

> select * from holes;
+----+----------------------------------+----------+
| id | name                             | accesses |
+----+----------------------------------+----------+
|  1 | d12b2551c6cb7d7a64e40221569a8571 |      107 |
|  2 | f82ad6f29c9a680d7873d1bef822e3e9 |       50 |
|  4 | 9da1ed7dbbdcc6ec90d6cb139521f14a |      132 |
|  8 | 677a196206d93cdf18c3744905b94f73 |      230 |
| 16 | b7556d8ed40587a33dc5c449ae0345aa |      481 |
+----+----------------------------------+----------+

RAND方法會生成9到15之類的ID,這些ID都會導(dǎo)致id 16被選為下一個更高的數(shù)字。

這個問題沒有真正的解決方案,但是由于你的數(shù)據(jù)大多是不變的,你可以添加一個映射表,將行號映射到id:

> create table holes_map ( row_id int not NULL primary key, random_id int not null);
> SET @id = 0;
> INSERT INTO holes_map SELECT @id := @id + 1, id FROM holes;
> select * from holes_map;
+--------+-----------+
| row_id | random_id |
+--------+-----------+
|      1 |         1 |
|      2 |         2 |
|      3 |         4 |
|      4 |         8 |
|      5 |        16 |
+--------+-----------+

row_id現(xiàn)在再次是連續(xù),我們可以再次運行隨機(jī)查詢

SELECT name FROM holes
  JOIN (SELECT r1.random_id
         FROM holes_map AS r1
         JOIN (SELECT (RAND() *
                      (SELECT MAX(row_id)
                         FROM holes_map)) AS row_id)
               AS r2
        WHERE r1.row_id >= r2.row_id
        ORDER BY r1.row_id ASC
        LIMIT 1) as rows ON (id = random_id);

1000次提取后,我們再次看到平均分布:

> select * from holes;
+----+----------------------------------+----------+
| id | name                             | accesses |
+----+----------------------------------+----------+
|  1 | d12b2551c6cb7d7a64e40221569a8571 |      222 |
|  2 | f82ad6f29c9a680d7873d1bef822e3e9 |      187 |
|  4 | 9da1ed7dbbdcc6ec90d6cb139521f14a |      195 |
|  8 | 677a196206d93cdf18c3744905b94f73 |      207 |
| 16 | b7556d8ed40587a33dc5c449ae0345aa |      189 |
+----+----------------------------------+----------+

維護(hù)連續(xù)的表

DROP TABLE IF EXISTS r2;
CREATE TABLE r2 (
  id SERIAL,
  name VARCHAR(32) NOT NULL UNIQUE
);

DROP TABLE IF EXISTS r2_equi_dist;
CREATE TABLE r2_equi_dist (
  id SERIAL,
  r2_id bigint unsigned NOT NULL UNIQUE
);

當(dāng)我們在r2中更改某些內(nèi)容時,我們希望r2_equi_dist也會更新。

DELIMITER $$
DROP TRIGGER IF EXISTS tai_r2$$
CREATE TRIGGER tai_r2
 AFTER INSERT ON r2 FOR EACH ROW
BEGIN
  DECLARE m BIGINT UNSIGNED DEFAULT 1;

  SELECT MAX(id) + 1 FROM r2_equi_dist INTO m;
  SELECT IFNULL(m, 1) INTO m;
  INSERT INTO r2_equi_dist (id, r2_id) VALUES (m, NEW.id);
END$$
DELIMITER ;

DELETE FROM r2;

INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );

SELECT * FROM r2;
+----+----------------------------------+
| id | name                             |
+----+----------------------------------+
|  1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
|  2 | a09a3959d68187ce48f4fe7e388926a9 |
|  3 | 4e1897cd6d326f8079108292376fa7d5 |
|  4 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
|  1 |     1 |
|  2 |     2 |
|  3 |     3 |
|  4 |     4 |
+----+-------+

INSERT非常簡單,DELETE操作我們必須更新equi-dist-id以保持id的連續(xù)設(shè)置:

DELIMITER $$
DROP TRIGGER IF EXISTS tad_r2$$
CREATE TRIGGER tad_r2
 AFTER DELETE ON r2 FOR EACH ROW
BEGIN
  DELETE FROM r2_equi_dist WHERE r2_id = OLD.id;
  UPDATE r2_equi_dist SET id = id - 1 WHERE r2_id > OLD.id;
END$$
DELIMITER ;

DELETE FROM r2 WHERE id = 2;

SELECT * FROM r2;
+----+----------------------------------+
| id | name                             |
+----+----------------------------------+
|  1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
|  3 | 4e1897cd6d326f8079108292376fa7d5 |
|  4 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
|  1 |     1 |
|  2 |     3 |
|  3 |     4 |
+----+-------+

update操作需要維護(hù)外鍵約束:

DELIMITER $$
DROP TRIGGER IF EXISTS tau_r2$$
CREATE TRIGGER tau_r2
 AFTER UPDATE ON r2 FOR EACH ROW
BEGIN
  UPDATE r2_equi_dist SET r2_id = NEW.id WHERE r2_id = OLD.id;
END$$
DELIMITER ;

UPDATE r2 SET id = 25 WHERE id = 4;

SELECT * FROM r2;
+----+----------------------------------+
| id | name                             |
+----+----------------------------------+
|  1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
|  3 | 4e1897cd6d326f8079108292376fa7d5 |
| 25 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
|  1 |     1 |
|  2 |     3 |
|  3 |    25 |
+----+-------+

一次多行

如果要返回多行,您可以:

  • 多次執(zhí)行查詢
  • 編寫執(zhí)行查詢的存儲過程并將結(jié)果存儲在臨時表中

存儲過程

存儲過程為你了程序語言結(jié)構(gòu):

  • 循環(huán)
  • 控制結(jié)構(gòu)
  • 程序
    ...

對于此任務(wù),我們只需要一個循環(huán):

ELIMITER $$
DROP PROCEDURE IF EXISTS get_rands$$
CREATE PROCEDURE get_rands(IN cnt INT)
BEGIN
  DROP TEMPORARY TABLE IF EXISTS rands;
  CREATE TEMPORARY TABLE rands ( rand_id INT );

loop_me: LOOP
    IF cnt < 1 THEN
      LEAVE loop_me;
    END IF;

    INSERT INTO rands
       SELECT r1.id
         FROM random AS r1 JOIN
              (SELECT (RAND() *
                            (SELECT MAX(id)
                               FROM random)) AS id)
               AS r2
        WHERE r1.id >= r2.id
        ORDER BY r1.id ASC
        LIMIT 1;

    SET cnt = cnt - 1;
  END LOOP loop_me;
END$$
DELIMITER ;

CALL get_rands(4);
SELECT * FROM rands;
+---------+
| rand_id |
+---------+
|  133716 |
|  702643 |
|  112066 |
|  452400 |
+---------+

性能

我們有3個不同的查詢來解決我們的問題:

  • Q1. ORDER BY RAND()
  • Q2. RAND() * MAX(ID)
  • Q3. RAND() * MAX(ID) + ORDER BY ID

Q1預(yù)計成本為N * log2(N),Q2和Q3幾乎恒定。
我們用N行(一千到一百萬)填充表格并執(zhí)行每次查詢1000次。

   100        1.000      10.000     100.000    1.000.000
Q1  0:00.718s  0:02.092s  0:18.684s  2:59.081s  58:20.000s
Q2  0:00.519s  0:00.607s  0:00.614s  0:00.628s   0:00.637s
Q3  0:00.570s  0:00.607s  0:00.614s  0:00.628s   0:00.637s

正如您所看到的那樣,簡單的ORDER BY RAND()已經(jīng)落后于表中僅100 行的優(yōu)化查詢。

更詳細(xì)的查詢分析,請看:analyzing-complex-queries.


參考

  1. MySQL select 10 random rows from 600K rows fast
  2. ORDER BY RAND()
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容