如何用 SQL 求出「都关注了某100人」和「都收藏了某100帖子」的所有用户?
『回复列表(165|隐藏机器人聊天)』
@老虎会游泳,哦,我最后还是用 Python 了,因为加上不能重复的功能后,tcc
很慢了,范围改大一点儿,更是没眼看
不如直接python
:
from datetime import datetime
from random import seed, randint
seed(int(datetime.fromisoformat('2022-08-11 20:00:00').timestamp()))
nums = set()
for uid in range(1, 10000000 + 1):
nums.clear()
size = randint(15, 25)
while len(nums) < size:
nums.add(randint(1, 10000))
print('\n'.join(f'{num},{uid}' for num in sorted(nums)))
SQLite
建表a
、表b
,结构都是(num INT, uid INT, PRIMARY KEY(num, uid)) WITHOUT ROWID
Python
为每个表生成并导入约两亿行数据(一千万个uid
, num
范围[0, 255]
)
给每个表喂 1, 2, ..., 10
个(之前未用过)数据,每个方法测试两次(第一次没有缓存,第二次可利用缓存)。如:
方法一,a
表需要匹配16
,b
表需匹配88
方法一,a
表需要匹配16
,b
表需匹配88
(这次会利用操作系统缓存)
方法二,a
表34
,b
表127
方法二,a
表34
,b
表127
(利用缓存)
方法一,a
表251, 93
,b
表19, 29
方法一,a
表251, 93
,b
表19, 29
(缓存)
……
SQL
模板:
SELECT count(*)
FROM (SELECT uid FROM a WHERE num = 10)
JOIN (SELECT uid FROM b WHERE num = 60) USING(uid)
JOIN (SELECT uid FROM a WHERE num = 20) USING(uid)
JOIN (SELECT uid FROM b WHERE num = 70) USING(uid)
(见下面代码)
关系除法。但我测试了一下,3分钟都没结束,直接放弃。。
每个表参数个数 | 方法一结果个数 | 方法二结果个数 | 方法一用时/秒 | 方法二用时/秒 |
---|---|---|---|---|
1 | 61215 | 60618 | 50.37 | 33.373 |
2 | 364 | 344 | 78.277 | 65.338 |
3 | 3 | 1 | 80.83 | 98.5 |
4 | 0 | 0 | 76.423 | 132.935 |
5 | 0 | 0 | 76.119 | 161.369 |
6 | 0 | 0 | 80.988 | 195.302 |
7 | 0 | 0 | 78.959 | 228.609 |
8 | 0 | 0 | 80.601 | 258.55 |
9 | 0 | 0 | 76.376 | 143.318 |
10 | 0 | 0 | 77.137 | 158.763 |
每个表参数个数 | 方法一结果个数 | 方法二结果个数 | 方法一用时/s | 方法二用时/s |
---|---|---|---|---|
1 | 61215 | 60618 | 0.454 | 2.753 |
2 | 364 | 344 | 0.539 | 3.485 |
3 | 3 | 1 | 0.521 | 4.954 |
4 | 0 | 0 | 0.513 | 6.599 |
5 | 0 | 0 | 0.489 | 8.049 |
6 | 0 | 0 | 0.497 | 9.688 |
7 | 0 | 0 | 0.512 | 11.084 |
8 | 0 | 0 | 0.517 | 12.429 |
9 | 0 | 0 | 0.502 | 6.906 |
10 | 0 | 0 | 0.497 | 7.516 |
DB_FILE='test.db'
seq 0 255 | shuf | for count in {1..10}; do
readarray -n $((count*2)) -t arr
IFS=$'\n'
for times in {1..2}; do
echo "Index: $count, Times: $times, Method: 1"
time sed \
-e '2,$s/^.*$/JOIN (#&) USING(uid)/' \
-e '1s/^.*$/SELECT count(*) FROM (#&)/' \
-e 's/#/SELECT uid FROM # WHERE num = /' \
-e '1~2s/#/a/; 2~2s/#/b/' <<<"${arr[*]}" |
sqlite3 -echo "$DB_FILE"
done
readarray -n "$count" -t arr1
readarray -n "$count" -t arr2
IFS=','
for times in {1..2}; do
echo "Index: $count, Times: $times, Method: 2"
time sqlite3 -echo "$DB_FILE" <<-EOF
WITH
query_a(num) AS (
SELECT value
FROM json_each('[${arr1[*]}]')
),
query_b(num) AS (
SELECT value
FROM json_each('[${arr2[*]}]')
),
find_a(uid) AS (
SELECT a.uid
FROM a
JOIN query_a USING(num)
GROUP BY a.uid
HAVING count(*) = (SELECT count(*) FROM query_a)
),
find_b(uid) AS (
SELECT b.uid
FROM b
JOIN query_b USING(num)
GROUP BY b.uid
HAVING count(*) = (SELECT count(*) FROM query_b)
)
SELECT count(*)
FROM find_a
JOIN find_b USING(uid)
EOF
done
done
re.findall(
r'^Count: (?P<index>\d+), Times: (?P<times>\d+), SQL: (?P<method>\d+)$.*?'
r'^(?P<result>\d+)$\s*?'
r'^real\s*?(?P<min>\d+)m(?P<sec>[\d.]+)s$',
data,
re.S | re.M,
)
SELECT * FROM 员工信息表 WHERE
员工ID IN (SELECT 员工ID FROM 员工技能表 WHERE 技能 = '技能1')
AND 员工ID IN (SELECT 员工ID FROM 员工技能表 WHERE 技能 = '技能2')
AND 员工ID IN (SELECT 员工ID FROM 员工手机表 WHERE 手机 = '手机1')
AND 员工ID IN (SELECT 员工ID FROM 员工手机表 WHERE 手机 = '手机2');
它和
SELECT count(*)
FROM (SELECT uid FROM a WHERE num = 10)
JOIN (SELECT uid FROM b WHERE num = 60) USING(uid)
JOIN (SELECT uid FROM a WHERE num = 20) USING(uid)
JOIN (SELECT uid FROM b WHERE num = 70) USING(uid)
不一样,你没有“员工信息表”,而且JOIN
和AND IN
也不等价。
@无名啊,为什么我的方案可能会更快?
因为按照「都关注了某100人」和「都收藏了某100帖子」这种数量级,表a
和表b
的行数应该远多于那个主键唯一的“员工信息表”。所以在“员工信息表”里不断排除范围可能比让其他表互相做JOIN更快。
而且
SELECT * FROM 员工信息表 WHERE
员工ID IN (SELECT 员工ID FROM 员工技能表 WHERE 技能 = '技能1')
AND 员工ID IN (SELECT 员工ID FROM 员工技能表 WHERE 技能 = '技能2')
AND 员工ID IN (SELECT 员工ID FROM 员工手机表 WHERE 手机 = '手机1')
AND 员工ID IN (SELECT 员工ID FROM 员工手机表 WHERE 手机 = '手机2');
和
SELECT * FROM 员工信息表 WHERE
员工ID IN (SELECT DISTINCT 员工ID FROM 员工技能表 WHERE 技能 = '技能1')
AND 员工ID IN (SELECT DISTINCT 员工ID FROM 员工技能表 WHERE 技能 = '技能2')
AND 员工ID IN (SELECT DISTINCT 员工ID FROM 员工手机表 WHERE 手机 = '手机1')
AND 员工ID IN (SELECT DISTINCT 员工ID FROM 员工手机表 WHERE 手机 = '手机2');
以及
SELECT count(*)
FROM (SELECT uid FROM a WHERE num = 10)
JOIN (SELECT uid FROM b WHERE num = 60) USING(uid)
JOIN (SELECT uid FROM a WHERE num = 20) USING(uid)
JOIN (SELECT uid FROM b WHERE num = 70) USING(uid)
和
SELECT count(*)
FROM (SELECT DISTINCT uid FROM a WHERE num = 10)
JOIN (SELECT DISTINCT uid FROM b WHERE num = 60) USING(uid)
JOIN (SELECT DISTINCT uid FROM a WHERE num = 20) USING(uid)
JOIN (SELECT DISTINCT uid FROM b WHERE num = 70) USING(uid)
之间,可能存在性能差异(取决于优化器聪不聪明)。
因为 表结构是a(num INT, uid INT, PRIMARY KEY(num, uid))
,不会出现 num = 10
时,有多个重复的 uid
?自然也没必要DISTINCT
?
SELECT count(*)
FROM (SELECT DISTINCT uid FROM a WHERE num = 10)
JOIN (SELECT DISTINCT uid FROM b WHERE num = 60) USING(uid)
JOIN (SELECT DISTINCT uid FROM a WHERE num = 20) USING(uid)
JOIN (SELECT DISTINCT uid FROM b WHERE num = 70) USING(uid)
补充:
这个SQL
应该也是同理
SELECT * FROM 员工信息表 WHERE
员工ID IN (SELECT DISTINCT 员工ID FROM 员工技能表 WHERE 技能 = '技能1')
AND 员工ID IN (SELECT DISTINCT 员工ID FROM 员工技能表 WHERE 技能 = '技能2')
AND 员工ID IN (SELECT DISTINCT 员工ID FROM 员工手机表 WHERE 手机 = '手机1')
AND 员工ID IN (SELECT DISTINCT 员工ID FROM 员工手机表 WHERE 手机 = '手机2');