I use some batch scripts in my proxmox installation. They are in cron.hourly and daily checking for virus and ram/CPU load of my LXC containers. An email is send on condition.
What are your tipps or solution without unnecessary load on disc io or CPU time. Lets keep it simple.
I was kind of the same, but I still collected metrics, because I just love graphs.
Over time I ended up setting alerts for failures I wish I was aware of earlier. Some examples:
What do you use to collect these metrics?
I use Telegraf for most of the metrics.