正文
今儿发现AWS有一个mysql5.6.40数据库从库(2核16G,最大IOPS 2000)的CPU很不正常,如图:
先查看了一下mysql的总体情况:
1
2
3
4
5
6
7
8
9Uptime: 10471685 Threads: 894 Questions: 58466385559 Slow queries: 2466 Opens: 321642 Flush tables: 1 Open tables: 2000 Queries per second avg: 5583.283 #Questions是已执行的由客户端发出的语句
MySQL [easyip]> show status where `variable_name` = 'Threads_running';
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| Threads_running | 2 |
+-----------------+-------+
1 row in set (0.00 sec)
跑到里用show full processlist
一看,发现几乎都是sleep,只有偶尔才会刷出来一两条Query。
使用tcpdump -s 65535 -x -nn -q -tttt -c 500000 -i any port 3306 > mysql.txt
抓一个500M大小的包,然后wget percona.com/get/pt-query-digest
和chmod u+x pt-query-digest
来安装pt-query-digest
,一会我们用它分析一下这个包。
插播一句,如果使用pt-query-digest
的时候爆Can't locate Digest/MD5.pm in @INC
这样的错误,请执行yum install perl-Digest-MD5.x86_64
。
然后执行./pt-query-digest --type tcpdump --watch-server='mysql的IP地址:3306' mysql.txt >3306.log
,打开这个3306.log
一探究竟:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63# 91.2s user time, 4.1s system time, 38.76M rss, 203.35M vsz
# Current date: Wed Sep 11 13:25:02 2019
# Hostname: ip-10-10-153-64.eu-central-1.compute.internal
# Files: mysql.txt
# Overall: 107.03k total, 96 unique, 2.94k QPS, 2.89x concurrency ________
# Time range: 2019-09-11 13:22:21.774093 to 13:22:58.211771
# Attribute total min max avg 95% stddev median
# ============ ======= ======= ======= ======= ======= ======= =======
# Exec time 105s 0 73ms 984us 3ms 2ms 424us
# Rows affecte 0 0 0 0 0 0 0
# Query size 19.41M 18 1.59k 190.15 346.17 109.64 158.58
# Warning coun 0 0 0 0 0 0 0
# Profile
# Rank Query ID Response time Calls R/Call V/M Item
# ==== ============================= ============= ===== ====== ===== ====
# 1 0x17439B73926DAFE4B1F7504E... 62.6518 59.5% 62408 0.0010 0.01 SELECT macuserinfo pcs_sys_user
# 2 0x221ABFF7BE95D0B83774ECE7... 7.7099 7.3% 4374 0.0018 0.01 SELECT device_channel_config_info
# 3 0x8D042CE60A67A30092CD7CFD... 6.9497 6.6% 4978 0.0014 0.01 SELECT push_client_info pcs_user_dev_link pcs_sys_user
# 4 0xEBBF1DEAACA1353A4DD445F3... 4.5713 4.3% 3952 0.0012 0.00 SELECT pcs_adm_dev_storage_strategy
# 5 0x49348052DEA8C471AD64477C... 4.1641 4.0% 5227 0.0008 0.00 SELECT pcs_user_dev_link pcs_sys_user
# 6 0xF96D9A42E1EFE0A84717B273... 3.8453 3.7% 4970 0.0008 0.00 SELECT user_authorize_info
# 7 0x40FC0DACAACA50F36466B100... 2.7108 2.6% 4132 0.0007 0.00 SELECT pcs_adm_dev_storage_strategy
# 8 0x24A6818C09D884D5B2DD04FB... 2.2897 2.2% 2095 0.0011 0.01 SELECT device_config_info macuserinfo device_config_info
# 9 0x6E5549936C772E40EBAE1FEE... 1.8566 1.8% 3355 0.0006 0.00 SELECT macinfo
# 10 0x238C2BAB786EE59102CF0654... 1.7066 1.6% 2077 0.0008 0.01 SELECT pcs_user_dev_link pcs_sys_user
# 11 0x327C32C0EBADC5D17957E04C... 0.9152 0.9% 1771 0.0005 0.00 SELECT device_config_info
# 12 0x7190F54D663D68CFE1680408... 0.6966 0.7% 1021 0.0007 0.00 SELECT pcs_adm_dev_storage_strategy
# MISC 0xMISC 5.2609 5.0% 6669 0.0008 0.0 <84 ITEMS>
# Query 1: 2.71k QPS, 1.72x concurrency, ID 0x17439B73926DAFE4B1F7504E191657C8 at byte 287906107
# This item is included in the report because it matches --limit.
# Scores: V/M = 0.01
# Time range: 2019-09-11 13:22:21.774093 to 13:22:58.211613
# Attribute pct total min max avg 95% stddev median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count 58 62408
# Exec time 59 63s 171us 69ms 1ms 3ms 2ms 424us
# Rows affecte 0 0 0 0 0 0 0 0
# Query size 50 9.88M 159 179 166.01 158.58 0.63 158.58
# Warning coun 0 0 0 0 0 0 0 0
# String:
# Hosts 10.10.153.64
# Query_time distribution
# 1us
# 10us
# 100us ################################################################
# 1ms #############
# 10ms #
# 100ms
# 1s
# 10s+
# Tables
# SHOW TABLE STATUS LIKE 'macuserinfo'\G
# SHOW CREATE TABLE `macuserinfo`\G
# SHOW TABLE STATUS LIKE 'pcs_sys_user'\G
# SHOW CREATE TABLE `pcs_sys_user`\G
# EXPLAIN /*!50100 PARTITIONS*/
select b.id As user_id, b.PROJECT As project, a.user AS user from macuserinfo a, pcs_sys_user b where a.user=b.username and a.devsequence='4L05B49PAZE5806' limit 1;\G
# Query 2: 120.17 QPS, 0.21x concurrency, ID 0x221ABFF7BE95D0B83774ECE7A62FCB96 at byte 319100665
...
...
...
从文件里可以看出这个select语句在本次分析中总的时间占比高达60%,但是单次调用耗时极少,不过QPS(每秒执行的查询次数)却高达2.71k QPS,这个应该是CPU高的原因。
执行一下explain看一下:
然后有用show columns from pcs_sys_user;
、show index from pcs_sys_user;
和show columns from macuserinfo;
、show index from macuserinfo;
查看一下是否有漏建索引,发现没有,比如:
这种就是典型的“QPS高导致CPU使用率高”的场景:查询语句比较简单、执行效率高、SQL优化余地小。一般解决的方法就是从应用架构、实例规格等方面入手:
- 升级实例规格,增加CPU资源;
- 做了分库分表,将查询压力分担到多个RDS实例上;
- 使用批量操作,将多个操作合并为一次请求,但此种方式需要考虑是否可以一次批量的数据有多大,避免造成慢sql;
后续措施
如上文所说,这个数据库里的sleep的进程也太多了,大量的sleep进程无法及时释放,也拖累系统性能,不过也不能把这个指设置的过小,否则可能会遭遇到MySQL has goneaway
之类的问题,默认的等待时间是28000秒,如下:
1
2
3
4
5
6
7MySQL [easyip]> show global variables like 'wait_timeout';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wait_timeout | 28800 |
+---------------+-------+
1 row in set (0.00 sec)
改成3600就够用了…
注意!本文中的例子里where条件是经常变化的,所以这种场合没有打开query_cache
。对于经常修改的表,使用查询缓存可能会加大副本滞后,因为缓存已锁定且会频繁刷新。而对于查询数据比较静态、查询重复度高、查询结果集小于3MB的场景,才考虑开启查询缓存。
最后,本次故障排查感谢arstercz大神的鼎力相助!
参考资料
https://docs.aws.amazon.com/zh_cn/AmazonRDS/latest/UserGuide/CHAP_Troubleshooting.html
https://aws.amazon.com/cn/blogs/china/pt-query-digest-rds-mysql-slow-searchnew/
https://jin-yang.github.io/post/mysql-monitor.html
https://tech.kujiale.com/shu-ju-ku-cpushi-yong-lu-100-pai-cha-ji/
http://mysql.taobao.org/monthly/2015/05/02/ (强力推荐!)