How do you all monitor your server performance?

Michaelscarn69-@alien.top · 2 years ago

How do you all monitor your server performance?

Dizzybro@alien.top · 2 years ago

The fastest way? Probably netdata

SadanielsVD@alien.top · 2 years ago

This. If you have more servers you can also get them all connected to a single UI where you can see all the Infos at once. With netdata cloud

Spaceman_Splff@alien.top · 2 years ago

Just set this up yesterday. I used a parent node and then have all my vms point to that. Took like an hour to figure it out

scotrod@alien.top · 2 years ago

Hey, did you use the cloud functionality or not? I’m tryna go all local with parent-child kind of capability but so far unable to.

Spaceman_Splff@alien.top · 2 years ago

The parent still is visible to the cloud portal. My understanding is the data all resides local, but when you login to their cloud portal, it connects to the parent to display the information. I’m still playing with it to confirm. My parent node shows all the child nodes on the local interface but the cloud still shows them all.

Spaceman_Splff@alien.top · 2 years ago

I don’t know if I’ll keep running this. Already the child nodes are complaining about increase write delays since installing the agents on them.

Michaelscarn69-@alien.top · 2 years ago

I’ll look into this too. Thank you.

weller_rocks@alien.top · 2 years ago

agreed … BY FAR the fastest. Easiest learning curve as well

HCharlesB@alien.top · 2 years ago

Checkmk (Raw - free version.) Some setup aspects are a bit annoying (wants to monitor every last ZFS dataset and takes too long to ‘ignore’ them one by one.) It does alert me to things that could cause issues, like the boot partition almost full. I run it in a Docker container on my (primarily) file server.

TheDeepTech@alien.top · 2 years ago

I use this as well! Works well and has built in intelligence for thresholds.

Theon@alien.top · 2 years ago

Netdata, I’ve meant to look into Grafana but it always seemed way too overcomplicated and heavy for my purposes. Maybe one day, though…

weller_rocks@alien.top · 2 years ago

I thought the same thing but it’s not bad actually, there are some pre build dashboards you can import for common metrics from Linux, windows, firewalls etc …

netdata is much better though (IMHO)

Mother_Construction2@alien.top · 2 years ago

I know that it needs a fix when my dad complaining that he can’t watch TV and the rolling door doesn’t open in the morning.

AstrologicalMob@alien.top · 2 years ago

I currently use thr classic “Hu seems slow, checks basic things like disk usage and process CPU/RAM usage I’ll do a reboot to fix it for now”.

Nagashitw@alien.top · 2 years ago

This is me. Can’t hurt to just do a reboot

dibu28@alien.top · 2 years ago

Windows Server? )

Cylian91460@alien.top · 2 years ago

I use btop, I use arch btw

2000nesman@alien.top · 2 years ago

Prometheus and grafana

Dogeek@alien.top · 2 years ago

Oh lord, I have so much info to give ! For the setup, it’s running on kubernetes 1.28.2, so YMMV. My monitoring stack is :

Grafana – Dashboards
Alertmanager – Alerting
Prometheus – Time series Database
Loki – Logs database
Promtail – Log collector
Mimir – Long term metrics&logs storage
Tempo – Datadog APM, but with Grafana, allows you to track requests through a network of services, invaluable to link your reverse proxy, to your apps, to your SSO to your database…
SMTP Relay – A homemade SMTP relay that eases setting up mail alerts, allows me to push mail through mailjet using my domain
Node-exporter – exports metrics for the server
Exportarr – exports metrics for sonarr/radarr etc
pihole-exporter – exports pihole metrics for prometheus scraping
smart-exporter – exports S.M.A.R.T metrics (for HDD health)
ntfy – for notifications to my phone (other than mail)

The rest is pretty much the same, if the service exports prometheus metrics by default, I use that, and write a ServiceMonitor and a Service manifest for that, it usually looks like that

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: traefik
  labels:
    app.kubernetes.io/component: traefik
    app.kubernetes.io/instance: traefik
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/name: traefik
    app.kubernetes.io/part-of: traefik
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: traefik-metrics
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
    scheme: http
    tlsConfig:
      insecureSkipVerify: true
  namespaceSelector:
    matchNames:
    - traefik
---
apiVersion: v1
kind: Service
metadata:
  name: traefik-metrics
  namespace: traefik
  labels:
    app.kubernetes.io/name: traefik-metrics
spec:
  type: ClusterIP
  ports:
    - protocol: TCP
      name: metrics
      port: 8082
  selector:
    app.kubernetes.io/name: traefik

If the app doesn’t include a prometheus endpoint, I just find an existing exporter for that app, most popular ones have that, and ready made grafana dashboards.

For alerting, I create PrometheusRule object with the prometheus query and the message to alert me (depending on the severity, it’s either a mail for med-low severity incidents, phone notification for high sev). I try to keep mails / notifications to a minimum, just alerts on load, CPU, RAM, and potential SMART errors as well give me alerts.

Nasach@alien.top · 2 years ago

I use net data for both dashboards and alerts. Works great and easy to setup.

lunakoa@alien.top · 2 years ago

Its not well liked but I use nagios core for alerts and jump to grafana which has data in prometheus, influxdb, and mysql backend for trends like cpu usage hard drive Temps etc.

borouhin@alien.top · 2 years ago

Alerts are much more important than fancy dashboards. You won’t be staring at your dashboard 24/7 and you probably won’t be staring at it when bad things happen.

Creating your alert set not easy. Ideally, every problem you encounter should be preceded by corresponding alert, and no alert should be false positive (require no action). So if you either have a problem without being alerted from your monitoring, or get an alert which requires no action - you should sit down and think carefully what should be changed in your alerts.

As for tools - I recommend Prometheus+Grafana. No need for separate AletrManager, as many guides recommend, recent versions of Grafana have excellent built-in alerting. Don’t use those ready-to-use dashboards, start from scratch, you need to understand PromQL to set everything up efficiently. Start with a simple dashboard (and alerts!) just for generic server health (node exporter), then add exporters for your specific services, network devices (snmp), remote hosts (blackbox), SSL certs etc. etc. Then write your own exporters for what you haven’t found :)

AttitudeImportant585@alien.top · 2 years ago

When you’ve got a lot of variables, especially when dealing with a distributed system, that importance leans the other way. Visualization and analytics are practically required to debug and tune large systems

io-x@alien.top · 2 years ago

I was looking at loki+grafana. is prometheus a replacement for loki in this setup and is it preferred?

borouhin@alien.top · 2 years ago

No, they serve different purposes. Loki is for logs, Prometheus is for metrics. Grafana helps to visualize data from both.

Jacksaur@alien.top · 2 years ago

What about InfluxDB? I hear that mentioned around Grafana a lot.

borouhin@alien.top · 2 years ago

InfluxDB is just a storage. If you have a service that saves metrics to InfluxDB (IIRC, Proxmox can do that), Grafana can read it from there. Grafana can aggregate data from many sources, Prometheus+Loki+InfluxDB+even queries to arbitrary JSON APIs etc.

Michaelscarn69-@alien.top · 2 years ago

Thank you for this. I think I need a deeper understanding of Prometheus. I’ll look into it. You are awesome

borouhin@alien.top · 2 years ago

Good luck, if you get into it, you’ll be unable to stop. Perfecting your monitoring system is a kind of mania :)

One more advice for another kind of monitoring. When you are installing / configuring something on your server - it’s handy if you can monitor it’s resource usage in real time. And that’s why I use MobaXterm as my terminal program. It has many drawbacks, and competitors such as XShell, RoyalTS or Tabby look better in many ways… but it has one killer feature. It shows a status bar with current server load (CPU, RAM, disk usage, traffic) right below your SSH session, so that you don’t have to switch to another window to see the effect of your actions. Saved me a lot of potential headache.

atheken@alien.top · 2 years ago

One thing about using Prometheus alerting is that it’s one less link in the chain that can break, and you can also keep your alerting configs in source control. So it’s a little less “click-ops,” but easier to reproduce if you need to rebuild it at a later date.

borouhin@alien.top · 2 years ago

When you have several Prometheus instances (HA or in different datacenters), setting up separate AlertManagers for each of them is a good idea. But as OP is only beginning his journey to monitoring, I guess he will be setting up a single server with both Prometheus and Grafana on it. In this scenario a separate AlertManager doesn’t add reliability, but adds complexity.

As for source control, you can write a simple script using Grafana API to export alert rules (and dashboards as well) and push them to git. Not ideal, sure, but it will work.

Anyway, it’s never too late to go further and add AlertManager, Loki, Mimir and whatever else. But to flatten the learning curve I’d recommend starting with Grafana alerts that are much more user-friendly.

Cylian91460@alien.top · 2 years ago

Alerts are much more important than fancy dashboards.

It depends, If you have to install lot of stuff or manage a lot of thing it’s a good idea to have one but if you mainly do maintenance and you want to have something reliable yes you should have an alerts, for exemple I don’t have a lot of thing install and doesn’t rly care about reliability so I do everything in terminal, I use arch btw

LNDN91@alien.top · 2 years ago

Rainmeter if it’s directly on their desktop/background.

xardoniak@alien.top · 2 years ago

I use Uptime Kuma to monitor particular services and NetData for server performance. I then pipe the alerts through to Pushover

krysinello@alien.top · 2 years ago

Grafana. Have alerts set up and get data with node exporter and cadvisor with some other containers giving some metrics.

I have alerts setup and they just ping me on a discord server I setup. High cpu and temps low disk space memory things like that. Mostly get high CPU or temp alerts and that’s usually when plex does its automated things at 4am.

dom9301k@alien.top · 2 years ago

Prometheus + Grafana, the same I use at my job.