[Matrix server] Upcoming maintenance and backup restore July 31

Milan@discuss.tchncs.de · edit-2 1 year ago

[Matrix server] Upcoming maintenance and backup restore July 31

Haui@discuss.tchncs.de · 1 year ago

Thanks for putting in the work. Is there anything we can help you with? From what I understood the domain is german, is the server in germany as well? I‘m located in germany and do sysadmin work. Fighting with hosting companies is part of my job. ;) let me know if I can do anything. Have a good one!

Milan@discuss.tchncs.de · 1 year ago

Thank you :) Well i am not sure if there was something to fight over except maybe some sort of refund… for now it seems to be fine one the new machine. – yes, i am from germany, however i think its a helsinki dc from hetzner.

Haui@discuss.tchncs.de · 1 year ago

You’re very welcome. Hetzner is generally a good host afaik. It does depend on the configuration I suppose. Are you using the shared vps or something else? If the storage is guaranteed (as in not custom hardware) they are technically responsible for its condition. A host I‘m working with (also located at hetzner but in falkenstein) does 2 backups a day which also prevents having to revert far back.

Milan@discuss.tchncs.de · 1 year ago

on hetzner its all dedicated servers – out goes an ax51-nvme, in comes an ax102. they have tried a connector cable swap in order to try to bring the nvme(s) back to life, i was wondering if this could have something to do with the smart errors logged and the temp zpool errors, however i think the cpu upgrade now at least is very welcomed by the matrix server 😅

Haui@discuss.tchncs.de · 1 year ago

Hm. In that case I‘m not sure what their obligations are. It’s very rare that I hear of nvmes downright failing.

If your smart error rates start going up, that is a clear indicator that something is gonna happen. I have a graph on my server showing the error rates. Actually, there is a „bad sectors“ or „reallocated sectors“ reading that should be more telling. Once they go up its critical I think.

I didn’t even know you also ran a matrix server. I recently started looking into matrix but I cant really say anything yet. Is it federated as well? Or do you need to make a new account for each one?

Milan@discuss.tchncs.de · edit-2 1 year ago

Dang the old host was deleted from the monitoring – however looking on at least one smart thing from my emails, there were no errors logged before the drives gave up on life during replacement. They just had a ton read/written and the used counter at 255% (even tho rw and age were not equal, its weird and one reason why i wanted to have at least one replaced in the first place). This is the one that had more:

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x04
Temperature:                        53 Celsius
Available Spare:                    98%
Available Spare Threshold:          10%
Percentage Used:                    255%
Data Units Read:                    7,636,639,249 [3.90 PB]
Data Units Written:                 2,980,551,083 [1.52 PB]
Host Read Commands:                 87,676,174,127
Host Write Commands:                28,741,297,023
Controller Busy Time:               705,842
Power Cycles:                       7
Power On Hours:                     17,437
Unsafe Shutdowns:                   1
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               53 Celsius
Temperature Sensor 2:               64 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

The new ones now, where the zpool errors happened look like this

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        24 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    3%
Data Units Read:                    122,135,021 [62.5 TB]
Data Units Written:                 31,620,076 [16.1 TB]
Host Read Commands:                 1,014,224,069
Host Write Commands:                231,627,064
Controller Busy Time:               3,909
Power Cycles:                       2
Power On Hours:                     117
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      4
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               24 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0          4     0  0x0000  0x8004  0x000            0     0     -

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        24 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    2%
Data Units Read:                    153,193,333 [78.4 TB]
Data Units Written:                 29,787,075 [15.2 TB]
Host Read Commands:                 1,262,977,843
Host Write Commands:                230,135,280
Controller Busy Time:               4,804
Power Cycles:                       11
Power On Hours:                     119
Unsafe Shutdowns:                   5
Media and Data Integrity Errors:    0
Error Information Log Entries:      14
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               24 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0         14     0  0x100d  0x8004  0x000            0     0     -

Haui@discuss.tchncs.de · 1 year ago

You‘re not telling me telling me you‘re reading 62 TB in 117 hours, right? Right? xD the old ones were even petabytes.

Those numbers are just insane. I have worked with AI training and storage. I have never seen such numbers.

Well, I suppose that nvme was very much eol. Now I understand the behavior. This many operations in such a short time will put serious strain on your system. No wonder parts can give up. Are you using a raid config? Sorry if you already mentioned it.

Milan@discuss.tchncs.de · edit-2 1 year ago

i am not sure about those numbers on the new ones … it was one db restore and a few hrs of uptime … a scrub… , then i rsynced some stuff over and since then the thing is in idle 🤷

sample of the current active system … i think at time of arrival it was 2+tb written or something

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        37 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    88,116,921 [45.1 TB]
Data Units Written:                 43,968,235 [22.5 TB]
Host Read Commands:                 689,015,212
Host Write Commands:                409,762,513
Controller Busy Time:               1,477
Power Cycles:                       4
Power On Hours:                     248
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               37 Celsius
Temperature Sensor 2:               46 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Milan@discuss.tchncs.de · 1 year ago

Yes, it is federated – however since there is no SSO on the Lemmy instance, you need to make a new account. Like you need to make new accounts between email providers. :) However it is a different federation protocol: Matrix vs ActivityPub.

Haui@discuss.tchncs.de · 1 year ago

Cool! Thanks! I will check it out.

jasondaigo@discuss.tchncs.de · 1 year ago

I feel for you. Is #tchncs:tchncs.de still the correct toon when it’s running fine again ?

Milan@discuss.tchncs.de · 1 year ago

what do you mean?

erAck@discuss.tchncs.de · 1 year ago

Milan, in compensation for all this hassle you should take the next three Mondays off.

Milan@discuss.tchncs.de · 1 year ago

i like that idea

[Matrix server] Upcoming maintenance and backup restore July 31

[Matrix server] Upcoming maintenance and backup restore July 31

Update

Status august 1st

References

The new ones now, where the zpool errors happened look like this