Well hello again, I have just learned that the host that recently had both nvme drives fail upon drive replacement, now has new problems: the filesystem report permanent data errors affecting the database of both, Matrix server and Telegram bridge.
I have just rented a new machine and am about to restore the database snapshot of the 26. of july, just in case. All the troubleshooting the recent days was very exhausting, however, i will try to do or at least prepare this within the upcoming hours.
Update
After a rescan the errors have gone away, however the drives logged errors too. It’s now the question as to whether the data integrety should be trusted.
Status august 1st
Well … good question… optimizations have been made last night, the restore was successful and … we are back to debugging outgoing federation :(
The new hardware also will be a bit more powerful… and yes, i have not forgotten that i wanted to update that database. It’s just that i was busy debugging federation problems.
References
- federation issues after restore: https://github.com/matrix-org/synapse/issues/16025
- why we had to restore initially: https://text.tchncs.de/tchncs/about-the-matrix-incident-on-july-26-2023
Thanks for putting in the work. Is there anything we can help you with? From what I understood the domain is german, is the server in germany as well? I‘m located in germany and do sysadmin work. Fighting with hosting companies is part of my job. ;) let me know if I can do anything. Have a good one!
Thank you :) Well i am not sure if there was something to fight over except maybe some sort of refund… for now it seems to be fine one the new machine. – yes, i am from germany, however i think its a helsinki dc from hetzner.
You’re very welcome. Hetzner is generally a good host afaik. It does depend on the configuration I suppose. Are you using the shared vps or something else? If the storage is guaranteed (as in not custom hardware) they are technically responsible for its condition. A host I‘m working with (also located at hetzner but in falkenstein) does 2 backups a day which also prevents having to revert far back.
on hetzner its all dedicated servers – out goes an ax51-nvme, in comes an ax102. they have tried a connector cable swap in order to try to bring the nvme(s) back to life, i was wondering if this could have something to do with the smart errors logged and the temp zpool errors, however i think the cpu upgrade now at least is very welcomed by the matrix server 😅
Hm. In that case I‘m not sure what their obligations are. It’s very rare that I hear of nvmes downright failing.
If your smart error rates start going up, that is a clear indicator that something is gonna happen. I have a graph on my server showing the error rates. Actually, there is a „bad sectors“ or „reallocated sectors“ reading that should be more telling. Once they go up its critical I think.
I didn’t even know you also ran a matrix server. I recently started looking into matrix but I cant really say anything yet. Is it federated as well? Or do you need to make a new account for each one?
Dang the old host was deleted from the monitoring – however looking on at least one smart thing from my emails, there were no errors logged before the drives gave up on life during replacement. They just had a ton read/written and the used counter at 255% (even tho rw and age were not equal, its weird and one reason why i wanted to have at least one replaced in the first place). This is the one that had more:
SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x04 Temperature: 53 Celsius Available Spare: 98% Available Spare Threshold: 10% Percentage Used: 255% Data Units Read: 7,636,639,249 [3.90 PB] Data Units Written: 2,980,551,083 [1.52 PB] Host Read Commands: 87,676,174,127 Host Write Commands: 28,741,297,023 Controller Busy Time: 705,842 Power Cycles: 7 Power On Hours: 17,437 Unsafe Shutdowns: 1 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 53 Celsius Temperature Sensor 2: 64 Celsius Error Information (NVMe Log 0x01, 16 of 64 entries) No Errors Logged
The new ones now, where the zpool errors happened look like this
SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 24 Celsius Available Spare: 100% Available Spare Threshold: 5% Percentage Used: 3% Data Units Read: 122,135,021 [62.5 TB] Data Units Written: 31,620,076 [16.1 TB] Host Read Commands: 1,014,224,069 Host Write Commands: 231,627,064 Controller Busy Time: 3,909 Power Cycles: 2 Power On Hours: 117 Unsafe Shutdowns: 0 Media and Data Integrity Errors: 0 Error Information Log Entries: 4 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 24 Celsius Error Information (NVMe Log 0x01, 16 of 256 entries) Num ErrCount SQId CmdId Status PELoc LBA NSID VS 0 4 0 0x0000 0x8004 0x000 0 0 -
SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 24 Celsius Available Spare: 100% Available Spare Threshold: 5% Percentage Used: 2% Data Units Read: 153,193,333 [78.4 TB] Data Units Written: 29,787,075 [15.2 TB] Host Read Commands: 1,262,977,843 Host Write Commands: 230,135,280 Controller Busy Time: 4,804 Power Cycles: 11 Power On Hours: 119 Unsafe Shutdowns: 5 Media and Data Integrity Errors: 0 Error Information Log Entries: 14 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 24 Celsius Error Information (NVMe Log 0x01, 16 of 256 entries) Num ErrCount SQId CmdId Status PELoc LBA NSID VS 0 14 0 0x100d 0x8004 0x000 0 0 -
You‘re not telling me telling me you‘re reading 62 TB in 117 hours, right? Right? xD the old ones were even petabytes.
Those numbers are just insane. I have worked with AI training and storage. I have never seen such numbers.
Well, I suppose that nvme was very much eol. Now I understand the behavior. This many operations in such a short time will put serious strain on your system. No wonder parts can give up. Are you using a raid config? Sorry if you already mentioned it.
i am not sure about those numbers on the new ones … it was one db restore and a few hrs of uptime … a scrub… , then i rsynced some stuff over and since then the thing is in idle 🤷
sample of the current active system … i think at time of arrival it was 2+tb written or something
SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 37 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0% Data Units Read: 88,116,921 [45.1 TB] Data Units Written: 43,968,235 [22.5 TB] Host Read Commands: 689,015,212 Host Write Commands: 409,762,513 Controller Busy Time: 1,477 Power Cycles: 4 Power On Hours: 248 Unsafe Shutdowns: 0 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 37 Celsius Temperature Sensor 2: 46 Celsius Error Information (NVMe Log 0x01, 16 of 64 entries) No Errors Logged
Yes, it is federated – however since there is no SSO on the Lemmy instance, you need to make a new account. Like you need to make new accounts between email providers. :) However it is a different federation protocol: Matrix vs ActivityPub.
Cool! Thanks! I will check it out.
I feel for you. Is #tchncs:tchncs.de still the correct toon when it’s running fine again ?
what do you mean?
Milan, in compensation for all this hassle you should take the next three Mondays off.
i like that idea