Устранение проблем c PG Ceph кластера HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
После замены патч-кордов на тестовом кластере обнаружил ошибку в Ceph.
$ ceph health detail HEALTH_ERR 1 pgs inconsistent; 1 scrub errors pg 2.2e is active+clean+inconsistent, acting [8] 1 scrub errors
Проверяем лог-файлы OSD:
$ grep 2.2e /var/log/ceph/* /var/log/ceph/ceph.audit.log:2017-04-24 15:38:46.124303 mon.2 192.168.2.120:6789/0 103771 : audit [INF] from='client.? 192.168.2.101:0/2439225090' entity='client.admin' cmd=[{"prefix": "pg repair", "pgid": "2.2e"}]: dispatch /var/log/ceph/ceph.log:2017-04-24 10:10:24.576558 osd.3 192.168.2.100:6804/4914 4445 : cluster [INF] 2.2e deep-scrub starts /var/log/ceph/ceph.log:2017-04-24 10:10:37.434117 osd.3 192.168.2.100:6804/4914 4446 : cluster [ERR] 2.2e shard 8: soid 2:74433408rbd_data.a12342ae8944a.000000000000071f:head candidate had a read error /var/log/ceph/ceph.log:2017-04-24 10:10:48.940079 osd.3 192.168.2.100:6804/4914 4447 : cluster [ERR] 2.2e deep-scrub 0 missing, 1 inconsistent objects /var/log/ceph/ceph.log:2017-04-24 10:10:48.940085 osd.3 192.168.2.100:6804/4914 4448 : cluster [ERR] 2.2e deep-scrub 1 errors /var/log/ceph/ceph.log:2017-04-24 15:38:46.717506 osd.3 192.168.2.100:6804/4914 4459 : cluster [INF] 2.2e repair starts /var/log/ceph/ceph.log:2017-04-24 15:38:56.741299 osd.3 192.168.2.100:6804/4914 4460 : cluster [ERR] 2.2e shard 8: soid 2:74433408rbd_data.a12342ae8944a.000000000000071f:head candidate had a read error
Запускаем восстановление Placement Group:
$ ceph pg repair 2.2e instructing pg 2.2e on osd.3 to repair
Спустя несколько секунд наблюдаем, что PG успешно восстановлена и состояние кластера вернулось в нормальный режим работы.
$ ceph health detail HEALTH_OK
Проверим информацию о PG:
$ ceph pg 2.2e query { "state": "active+clean", "snap_trimq": "[]", "epoch": 516, "up": [ 3, 8 ], "acting": [ 3, 8 ], "actingbackfill": [ "3", "8" ],
Проверяем еще раз лог и наблюдаем, что проблема устранена:
$ grep 2.2e /var/log/ceph/* /var/log/ceph/ceph-osd.3.log:2017-04-24 15:38:46.717501 7f2070c50700 0 log_channel(cluster) log [INF] : 2.2e repair starts /var/log/ceph/ceph-osd.3.log:2017-04-24 15:38:56.741297 7f2070c50700 -1 log_channel(cluster) log [ERR] : 2.2e shard 8: soid 2:74433408rbd_data.a12342ae8944a.000000000000071f:head candidate had a read error /var/log/ceph/ceph-osd.3.log:2017-04-24 15:39:07.692446 7f206e44b700 -1 log_channel(cluster) log [ERR] : 2.2e repair 0 missing, 1 inconsistent objects /var/log/ceph/ceph-osd.3.log:2017-04-24 15:39:07.752099 7f206e44b700 -1 log_channel(cluster) log [ERR] : 2.2e repair 1 errors, 1 fixed
Правила хорошего тона - использовать на серверах LACP.
Но поскольку кластер тестовый, то такие ошибки не исключение.