Well hello again, I have just learned that the host that recently had both nvme drives fail upon drive replacement, now has new problems: the filesystem report permanent data errors affecting the database of both, Matrix server and Telegram bridge.

I have just rented a new machine and am about to restore the database snapshot of the 26. of july, just in case. All the troubleshooting the recent days was very exhausting, however, i will try to do or at least prepare this within the upcoming hours.

Update

After a rescan the errors have gone away, however the drives logged errors too. It’s now the question as to whether the data integrety should be trusted.

Status august 1st

Well … good question… optimizations have been made last night, the restore was successful and … we are back to debugging outgoing federation :(


The new hardware also will be a bit more powerful… and yes, i have not forgotten that i wanted to update that database. It’s just that i was busy debugging federation problems.

References

  • Haui@discuss.tchncs.de
    link
    fedilink
    arrow-up
    0
    ·
    1 year ago

    Thanks for putting in the work. Is there anything we can help you with? From what I understood the domain is german, is the server in germany as well? I‘m located in germany and do sysadmin work. Fighting with hosting companies is part of my job. ;) let me know if I can do anything. Have a good one!

    • Milan@discuss.tchncs.deOPM
      link
      fedilink
      arrow-up
      2
      ·
      1 year ago

      Thank you :) Well i am not sure if there was something to fight over except maybe some sort of refund… for now it seems to be fine one the new machine. – yes, i am from germany, however i think its a helsinki dc from hetzner.

      • Haui@discuss.tchncs.de
        link
        fedilink
        arrow-up
        1
        ·
        1 year ago

        You’re very welcome. Hetzner is generally a good host afaik. It does depend on the configuration I suppose. Are you using the shared vps or something else? If the storage is guaranteed (as in not custom hardware) they are technically responsible for its condition. A host I‘m working with (also located at hetzner but in falkenstein) does 2 backups a day which also prevents having to revert far back.

        • Milan@discuss.tchncs.deOPM
          link
          fedilink
          arrow-up
          2
          ·
          1 year ago

          on hetzner its all dedicated servers – out goes an ax51-nvme, in comes an ax102. they have tried a connector cable swap in order to try to bring the nvme(s) back to life, i was wondering if this could have something to do with the smart errors logged and the temp zpool errors, however i think the cpu upgrade now at least is very welcomed by the matrix server 😅

          • Haui@discuss.tchncs.de
            link
            fedilink
            arrow-up
            1
            ·
            1 year ago

            Hm. In that case I‘m not sure what their obligations are. It’s very rare that I hear of nvmes downright failing.

            If your smart error rates start going up, that is a clear indicator that something is gonna happen. I have a graph on my server showing the error rates. Actually, there is a „bad sectors“ or „reallocated sectors“ reading that should be more telling. Once they go up its critical I think.

            I didn’t even know you also ran a matrix server. I recently started looking into matrix but I cant really say anything yet. Is it federated as well? Or do you need to make a new account for each one?

            • Milan@discuss.tchncs.deOPM
              link
              fedilink
              arrow-up
              1
              ·
              edit-2
              1 year ago

              Dang the old host was deleted from the monitoring – however looking on at least one smart thing from my emails, there were no errors logged before the drives gave up on life during replacement. They just had a ton read/written and the used counter at 255% (even tho rw and age were not equal, its weird and one reason why i wanted to have at least one replaced in the first place). This is the one that had more:

              SMART/Health Information (NVMe Log 0x02)
              Critical Warning:                   0x04
              Temperature:                        53 Celsius
              Available Spare:                    98%
              Available Spare Threshold:          10%
              Percentage Used:                    255%
              Data Units Read:                    7,636,639,249 [3.90 PB]
              Data Units Written:                 2,980,551,083 [1.52 PB]
              Host Read Commands:                 87,676,174,127
              Host Write Commands:                28,741,297,023
              Controller Busy Time:               705,842
              Power Cycles:                       7
              Power On Hours:                     17,437
              Unsafe Shutdowns:                   1
              Media and Data Integrity Errors:    0
              Error Information Log Entries:      0
              Warning  Comp. Temperature Time:    0
              Critical Comp. Temperature Time:    0
              Temperature Sensor 1:               53 Celsius
              Temperature Sensor 2:               64 Celsius
              
              Error Information (NVMe Log 0x01, 16 of 64 entries)
              No Errors Logged
              

              The new ones now, where the zpool errors happened look like this

              SMART/Health Information (NVMe Log 0x02)
              Critical Warning:                   0x00
              Temperature:                        24 Celsius
              Available Spare:                    100%
              Available Spare Threshold:          5%
              Percentage Used:                    3%
              Data Units Read:                    122,135,021 [62.5 TB]
              Data Units Written:                 31,620,076 [16.1 TB]
              Host Read Commands:                 1,014,224,069
              Host Write Commands:                231,627,064
              Controller Busy Time:               3,909
              Power Cycles:                       2
              Power On Hours:                     117
              Unsafe Shutdowns:                   0
              Media and Data Integrity Errors:    0
              Error Information Log Entries:      4
              Warning  Comp. Temperature Time:    0
              Critical Comp. Temperature Time:    0
              Temperature Sensor 1:               24 Celsius
              
              Error Information (NVMe Log 0x01, 16 of 256 entries)
              Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
                0          4     0  0x0000  0x8004  0x000            0     0     -
              
              SMART/Health Information (NVMe Log 0x02)
              Critical Warning:                   0x00
              Temperature:                        24 Celsius
              Available Spare:                    100%
              Available Spare Threshold:          5%
              Percentage Used:                    2%
              Data Units Read:                    153,193,333 [78.4 TB]
              Data Units Written:                 29,787,075 [15.2 TB]
              Host Read Commands:                 1,262,977,843
              Host Write Commands:                230,135,280
              Controller Busy Time:               4,804
              Power Cycles:                       11
              Power On Hours:                     119
              Unsafe Shutdowns:                   5
              Media and Data Integrity Errors:    0
              Error Information Log Entries:      14
              Warning  Comp. Temperature Time:    0
              Critical Comp. Temperature Time:    0
              Temperature Sensor 1:               24 Celsius
              
              Error Information (NVMe Log 0x01, 16 of 256 entries)
              Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
                0         14     0  0x100d  0x8004  0x000            0     0     -
              
              • Haui@discuss.tchncs.de
                link
                fedilink
                arrow-up
                1
                ·
                1 year ago

                You‘re not telling me telling me you‘re reading 62 TB in 117 hours, right? Right? xD the old ones were even petabytes.

                Those numbers are just insane. I have worked with AI training and storage. I have never seen such numbers.

                Well, I suppose that nvme was very much eol. Now I understand the behavior. This many operations in such a short time will put serious strain on your system. No wonder parts can give up. Are you using a raid config? Sorry if you already mentioned it.

                • Milan@discuss.tchncs.deOPM
                  link
                  fedilink
                  arrow-up
                  1
                  ·
                  edit-2
                  1 year ago

                  i am not sure about those numbers on the new ones … it was one db restore and a few hrs of uptime … a scrub… , then i rsynced some stuff over and since then the thing is in idle 🤷

                  sample of the current active system … i think at time of arrival it was 2+tb written or something

                  SMART/Health Information (NVMe Log 0x02)
                  Critical Warning:                   0x00
                  Temperature:                        37 Celsius
                  Available Spare:                    100%
                  Available Spare Threshold:          10%
                  Percentage Used:                    0%
                  Data Units Read:                    88,116,921 [45.1 TB]
                  Data Units Written:                 43,968,235 [22.5 TB]
                  Host Read Commands:                 689,015,212
                  Host Write Commands:                409,762,513
                  Controller Busy Time:               1,477
                  Power Cycles:                       4
                  Power On Hours:                     248
                  Unsafe Shutdowns:                   0
                  Media and Data Integrity Errors:    0
                  Error Information Log Entries:      0
                  Warning  Comp. Temperature Time:    0
                  Critical Comp. Temperature Time:    0
                  Temperature Sensor 1:               37 Celsius
                  Temperature Sensor 2:               46 Celsius
                  
                  Error Information (NVMe Log 0x01, 16 of 64 entries)
                  No Errors Logged
                  
            • Milan@discuss.tchncs.deOPM
              link
              fedilink
              arrow-up
              1
              ·
              1 year ago

              Yes, it is federated – however since there is no SSO on the Lemmy instance, you need to make a new account. Like you need to make new accounts between email providers. :) However it is a different federation protocol: Matrix vs ActivityPub.