AvengerMoJo
AvengerMoJo

現居台灣,曾經居住北京,Dallas,香港出生,開源工程師,軟件顧問,重複創業者,愛做飯,博雅Mentor

Introduce SUSE Enterprise Storage 6 (part 4)

Troubleshooting in day 2:

No matter how robust a system is we still need to monitor it and from time to time there may be an accident. Keeping track of the log and debug messages are the basic tools for administrators and application developers to understand what may be happening to the cluster. However, when we enable extensive logging there will be a performance impact on the cluster. There are 0 to 20 levels of logging that could be adjusted 20 being more verbose. So in a lot of performance testing, we may even turn off the logging or authorization to gain a better understanding of the impact of logging and debugging. But in normal operation hours, moderate logging will be needed. It can also be adjusted at run time so when an accident happened, the log level can turn up to more details. SES also provides a system tool to allow system engineers to collect system configuration and debug log, supportconfig. The collected information will able to send back to the developer to help understand the cluster incident.

Supportconfig is SUSE Linux Enterprise Server standard support tool as well so if you are interested how does it work you can take a look at the SLES 15 SP1 Administration Guide as well. Beware it may take a while and large disk space.

$ supportconfig /var/log/nts__.txz

Depends on which service is having an issue or trouble, the admin needs to get to the actual daemon node to access the log file.

/var/log/cephceph-mon.NODENAME.logceph-mgr.NODENAME.logceph-osd.X.logceph-mds.NODENAME.logceph-client.rgw.NODENAME.log

Setting the different level of log for osd you need to either use tell in admin node or you need to ssh into the daemon node and use ceph daemon

$ ceph tell osd.0 config set debug_osd 20$ ceph daemon osd.0 config show | grep debug_osd$ ceph daemon osd.0 config set debug_osd 0/20

Some time you see the config file has debug-mon 0/10, the first 0 mean file log and the second 10 is memory log.

$ ceph tell mon. config set debug_mon 1/10$ ceph daemon mon. config show | grep debug_mon

Maybe your hard drive is having physical error and needs to understand what is wrong. You should have a baseline record when you first install the system. And smartctl long run may take a long time from a couple of minutes to a couple hours.

$ hdparm -tT /dev/sdX$ smartctl -a /dev/sdX$ smartctl -t long /dev/sdX

When the cluster is in trouble, we may also want to disable scrubbing

$ ceph osd set noscrub$ ceph osd unset noscrub

Before repairing osd, PG may also be the cause of the problem. We can examine the PG as following

$ ceph pg dump summary$ ceph pg dump pools$ ceph pg dump_jason$ ceph pg dump | less

Then you can start repairing individual PG accordingly.

$ ceph pg repair
CC BY-NC-ND 2.0 版权声明

喜欢我的文章吗?
别忘了给点支持与赞赏,让我知道创作的路上有你陪伴。

加载中…

发布评论