How to Collect Kubernetes Node Metrics with Node-Exporter Using CronJobs

How to Collect Kubernetes Node Metrics with Node-Exporter Using CronJobs

This article focuses on leveraging cron-jobs in Kubernetes to pull host metrics from all nodes in Kubernetes and save them as log files.

A few days ago I was asked to solve this problem as part of an interview process for an SRE position.

The problem is a very nice way to brush up your skills in monitoring. So, I decided to share it with all readers.

TL;DR

If you want a minified version of this solution then you can head over to this Github Repository Here it also includes all the code shown in this article.

Audience

This article is intended for Intermediate Kubernetes users with basic knowledge of Kubernetes Architecture, Kubernetes Networking, Containerization and Basic Linux Administration.

The article assumes that the readers can access remote VMs using SSH.

Also, some knowledge of how applications are developed in general will aid in understanding a lot of terminology used here.

Also, I assume the reader understands the meaning of Push-based and Pull-based metrics collection architectures.

Prior experience (basic) using Observability/Logging/Monitoring Stacks such as ELK, EFK or Prometheus-Loki-Grafana is required.

Problem statement

Create a Kubernetes cron job that pulls node metrics like (CPU, Memory, Disk usage) and stores them in a file.

  • Every time the cron runs, it should create a new file.

  • The filename should have the current timestamp.

  • By default, cron should be minute, but it should be configurable with the slightest changes to code.

  • Choose a tool of your choice to collect and expose metrics. Preferable is node exporter.

  • The instances involved here are Kubernetes nodes themselves.

Expected Output:

  • The actual program code pulls the metrics and writes them to a file.

  • The Dockerfile to containerize the code.

  • Kubernetes YAML or HELM Chart.

  • A README file explaining the design, deployment and other details.

Note :

  • Pick the choice of your language to write the cron job. Preferable is bash.

  • Treat the output files generated as essential and should be retained on pod restarts.

  • Deployment can either be Kube yamls, helm charts, kustomize.

  • You can make necessary assumptions if required and document them.

  • Choose local kubernetes setup like minikube, kind. Other option is to pick any cloud platform's kubernetes flavour.

Prerequisites

  1. A K8s cluster

  2. An AWS S3 Bucket

  3. An AWS IAM User with Access Keys configured

I am using a K3d cluster which is a very lightweight Kubernetes distribution based on Rancher K3s that runs a K8s cluster on Docker.

The following article uses a K3D cluster but this solution can be implemented using any Kubernetes Distribution with minimal configuration changes.

I have the following configuration: -

  1. 1 Master Node (Control Plane)

  2. 2 Worker Nodes

$ kubectl get nodes
NAME                        STATUS   ROLES                  AGE    VERSION
k3d-multicluster-server-0   Ready    control-plane,master   7h1m   v1.28.8+k3s1
k3d-multicluster-agent-1    Ready    <none>                 7h1m   v1.28.8+k3s1
k3d-multicluster-agent-0    Ready    <none>                 7h1m   v1.28.8+k3s1

The Solution Architecture

To tackle this situation we need an agent that would run on all nodes and continuously gather metrics. So, I decided to go with node-exporter designed to expose the metrics from a Linux Host.

Since the targets here are Kubernetes Nodes we need to deploy node-exporter Pods as DaemonSets allowing us to run a node-exporter Pod on all nodes.

Now to expose endpoints I created a Service to make endpoints easier to scrape from our Cron-Job Pods.

[Optional] I also added a ServiceMonitor using the prometheus operator for Kubernetes so that this solution could be later integrated with the same. But, it is not needed here.

The Cron Job itself is a simple Python script that scrapes the metrics from all node-exporter Pods one at a time by calling the /metrics endpoint.

To save the metrics as logs I attached a PVC (Persistent Volume Claim) to each pod.

Also, we will push these logs to an AWS S3 Bucket as a backup.

Step 1: [Optional] Install Prometheus Operator

This step is only needed if you want to use this configuration alongside Prometheus later without the hassle of configuring things yourselves.

File:-operator-values.yml

This is a HELM Values Manifest that ensures that only the CRDs are installed

defaultRules:
  create: false
alertmanager:
  enabled: false
grafana:
  enabled: false
kubeApiServer:
  enabled: false
kubelet:
  enabled: false
kubeControllerManager:
  enabled: false
coreDns:
  enabled: false
kubeEtcd:
  enabled: false
kubeScheduler:
  enabled: false
kubeStateMetrics:
  enabled: false
nodeExporter:
  enabled: false
prometheus:
  enabled: false
helm upgrade --install prometheus-operator prometheus-community/kube-prometheus-stack --namespace node-exporter --values operator-values.yml --create-namespace

This will install the following CRDs on your cluster

$ kubectl get crds | grep monitoring.coreos.com 
alertmanagerconfigs.monitoring.coreos.com        2024-05-07T11:49:39Z
alertmanagers.monitoring.coreos.com              2024-05-07T11:49:40Z
podmonitors.monitoring.coreos.com                2024-05-07T11:49:40Z
probes.monitoring.coreos.com                     2024-05-07T11:49:40Z
prometheusagents.monitoring.coreos.com           2024-05-07T11:49:40Z
prometheuses.monitoring.coreos.com               2024-05-07T11:49:40Z
prometheusrules.monitoring.coreos.com            2024-05-07T11:49:40Z
scrapeconfigs.monitoring.coreos.com              2024-05-07T11:49:40Z
servicemonitors.monitoring.coreos.com            2024-05-07T11:49:40Z
thanosrulers.monitoring.coreos.com               2024-05-07T11:49:41Z

Step 2: Configure S3 Bucket storing these logs.

We are using S3 Buckets to push logs. This is meant to serve as a backup for storing the logs in the event of data corruption on the node holding the Persistent Volume (PV).

Also, configure an AWS IAM User to allow access to the S3 bucket via the Python Boto3 Library.

Also, you should configure Access Keys for accessing your AWS IAM User and grant sufficient permissions for your user to access the

Step 3: Writing the CronJob Code

I chose Python as the Kubernetes API integrates very well without the hassle of configuring kubectl inside the container.

Head to this directory in linked repository code here.

Here I have also included tests for the same code which you can run.

from kubernetes import client, config
from dotenv import load_dotenv
import os, sys
import requests
from typing import List, Dict
import asyncio
import datetime
import logging
import boto3


load_dotenv()
LOGS_DIR = os.path.join(os.curdir, "logs") 
ENV=os.environ.get("ENV", "cluster")
SERVICE_NAME=os.environ.get("SERVICE_NAME", "node-exporter")
NAMESPACE=os.environ.get("NAMESPACE", "default")
PORT=os.environ.get("PORT", "9100")
S3_BUCKET=os.environ.get("S3_BUCKET", "")
AWS_DEFAULT_REGION=os.environ.get("AWS_DEFAULT_REGION", "")
AWS_SECRET_ACCESS_KEY=os.environ.get("AWS_SECRET_ACCESS_KEY", "")
AWS_ACCESS_KEY_ID=os.environ.get("AWS_ACCESS_KEY_ID", "")
logging.basicConfig(level = logging.INFO)
logger=logging.getLogger(__name__)
logger.setLevel(logging.INFO)

if ENV=="dev":
    config.load_kube_config()
    logger.info(ENV)
    logger.info(SERVICE_NAME)
    logger.info(NAMESPACE)
    logger.info(f"KUBECONFIG={os.environ.get('KUBECONFIG', 'not set')}")
    if os.environ.get("KUBECONFIG", "") and os.path.exists(os.environ.get("KUBECONFIG")):
        logger.info("KUBECONFIG FOUND")
    else:
        logger.critical("Missing Kubeconfig")
else:
    config.load_incluster_config()


v1 = client.CoreV1Api()

def get_endpoint_info(service_name: str, namespace: str) -> List[Dict[str, str]]:
    """
    This function retrieves endpoint IP and nodeNames for a given Service Name.

    Args:
        service_name: The name of the Kubernetes service.

    Returns:
        A dictionary containing endpoint IP and corresponding node names.
    """

    endpoints = v1.read_namespaced_endpoints(service_name, namespace)
    endpoints_info = []
    for subset in endpoints.subsets:
        for address in subset.addresses:
            ip = address.ip
            nodeName = address.node_name
            endpoints_info.append({"ip": ip, "node_name": nodeName})
            logger.info(f"[INFO]Discovered {nodeName} at {ip}")

    return endpoints_info

def __get_logs_data(ip: str, port: str) -> str:
    """
    Makes a GET request to the provided URL (http://{ip}:{port}/metrics) and returns the response.

    Args:
        ip (str): The IP address of the Node-Exporter Pod.
        port (str): The port number on which the Node-Exporter Pod is listening.

    Returns:
        requests.Response: The response object from the GET request, or None if an error occurs.
    """

    url = f"http://{ip}:{port}/metrics"
    try:
        response = requests.get(url)
        response.raise_for_status()
        return response.text
    except requests.exceptions.RequestException as e:
        logger.error(f"Error fetching metrics: {e}")
        sys.exit(1)

def upload_to_s3_bucket(bucket_name: str, filepath: str):
    if not bucket_name:
        return
    try:
        s3_resource = boto3.resource(
            's3', 
            region_name = AWS_DEFAULT_REGION, 
            aws_access_key_id = AWS_ACCESS_KEY_ID,
            aws_secret_access_key = AWS_SECRET_ACCESS_KEY
        )

        s3_resource.Bucket(bucket_name).put_object(
            Key = filepath, 
            Body = open(filepath, 'rb')
        )
    except Exception as e:
        logger.fatal(f"Error while uploading to s3 {e}")
        sys.exit(1)


def fetch_and_store_logs_data(ip: str, node_name: str, port: str) -> None:
    """
    Fetch Logs and Write data to log files
    Args:
        ip (str): IP Address of the Node-Exporter Pod
        node_name (str): Name of K8s node
        port (str): The port number on which the Node-Exporter Pod is listening.
    Returns:
        None
    """
    data = __get_logs_data(ip, port)
    time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
    try:
        filepath = os.path.join(LOGS_DIR, f"{node_name}-{time}.log")
        with open(filepath, "w+") as f:
            f.write(data)
        upload_to_s3_bucket(S3_BUCKET, filepath)        
    except:
        sys.exit(1)

async def pull_logs_data(endpoints_info: List[Dict[int, int]]):
    loop = asyncio.get_event_loop()
    tasks = []
    for endpoint in endpoints_info:
        tasks.append(loop.run_in_executor(None, fetch_and_store_logs_data, endpoint["ip"], endpoint["node_name"], PORT))
    await asyncio.gather(*tasks)

if __name__ == '__main__':
    try:
        endpoints = get_endpoint_info(service_name=SERVICE_NAME, namespace=NAMESPACE)
        asyncio.run(pull_logs_data(endpoints))
    except Exception as e:
        logger.fatal(f"Fatal Error Occured: {e}")
        sys.exit(1)

Dockerfile for the Image

Let's Dockerize the code. We will make sure we do not use the root user for running the scripts inside the container.

FROM python:3.11.9-alpine3.19 as build


RUN mkdir -p /app/logs \
    && addgroup app \
    && adduser -D -G app -h /app app \
    && chown -R app:app /app

WORKDIR /app

USER app

COPY --chown=app:app requirements.txt .

RUN python -m pip install --no-cache-dir -r ./requirements.txt

COPY --chown=app:app ./app.py ./app.py

ENTRYPOINT [ "python", "app.py" ]

Step 4: Deploy Node Exporter

Step 4.1: Create a DaemonSet

A DaemonSet is a Kubernetes Controller that runs a Pod on all (or some) nodes in the Kubernetes Cluster.

 ---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: node-exporter
  name: node-exporter
  namespace: node-exporter
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
      labels:
        app: node-exporter
    spec:
      containers:
      - args:
        - --web.listen-address=0.0.0.0:9100
        - --path.procfs=/host/proc
        - --path.sysfs=/host/sys
        image: quay.io/prometheus/node-exporter:v0.18.1
        imagePullPolicy: IfNotPresent
        name: node-exporter
        ports:
        - containerPort: 9100
          hostPort: 9100
          name: metrics
          protocol: TCP
        resources:
          limits:
            cpu: 200m
            memory: 50Mi
          requests:
            cpu: 100m
            memory: 30Mi
        volumeMounts:
        - mountPath: /host/proc
          name: proc
          readOnly: true
        - mountPath: /host/sys
          name: sys
          readOnly: true
      hostNetwork: true
      hostPID: true
      restartPolicy: Always
      tolerations:
      - effect: NoSchedule
        operator: Exists
      - effect: NoExecute
        operator: Exists
      volumes:
      - hostPath:
          path: /proc
          type: ""
        name: proc
      - hostPath:
          path: /sys
          type: ""
        name: sys

On Applying you will see the following Pods in your cluster

$ kubectl get po -n node-exporter
NAME                                                   READY   STATUS    RESTARTS       AGE
node-exporter-ssmnh                                    1/1     Running   3 (179m ago)   30h
node-exporter-rtjvz                                    1/1     Running   3 (179m ago)   30h
node-exporter-cfglb                                    1/1     Running   3 (179m ago)   30h
prometheus-operator-kube-p-operator-78b5f48bfd-zm88h   1/1     Running   3 (179m ago)   32h

Step 4.2: Create a Service

Why create a service in the first place? Can't you access Pods directly by assigning endpoints or better access them using their IP Address inside the cluster? And you might be partially correct.

All you need is endpoints to connect to the pods in the DaemonSet.

The answer is short Services make it easy to manage endpoints in the Kubernetes Cluster. Here we will create a service of type ClusterIP for the pods.

---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: node-exporter
  name: node-exporter
  namespace: node-exporter
spec:
  ports:
  - name: node-exporter
    port: 9100
    protocol: TCP
    targetPort: 9100
  selector:
    app: node-exporter
  sessionAffinity: None
  type: ClusterIP

You may want to check the endpoints via this command

$ kubectl get endpoints node-exporter -n node-exporter
NAME            ENDPOINTS                                         AGE
node-exporter   172.20.0.2:9100,172.20.0.3:9100,172.20.0.4:9100   4h55m

The CronJob does not need to access the Service. Rather, it has to scrape logs from each Pod that runs node-exporter this is why we need to directly query the endpoints instead of the Service.

Each Pod exposes a /metrics API endpoint. Which can be tested as follows let us call one of the endpoints:-

curl http://172.20.0.2:9100/metrics

Here's the output:-

Step 4.3: [Optional] Create a ServiceMonitor

This step is only needed if you have installed the Prometheus Operator as discussed above and later want to deploy Prometheus.

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app: node-exporter
    serviceMonitorSelector: node-exporter
  name: node-exporter
  namespace: node-exporter
spec:
  endpoints:
  - honorLabels: true
    interval: 30s
    path: /metrics
    targetPort: 9100
  jobLabel: node-exporter
  namespaceSelector:
    matchNames:
    - node-exporter
  selector:
    matchLabels:
      app: node-exporter

Step 5: Let's Deploy Our CronJob

Step 5.1: Create a Service Account for our CronJob Pods

We will create a Service Account, a Role and a RoleBinding to allow read-only access only to endpoint objects in the namespace where the node-exporter Pods are

Service Account

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: node-exporter-cron-job-account
  namespace: node-exporter

Role

---          
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: node-exporter
  name: pull-logs-role
rules:
- apiGroups: ["*"]
  resources: ["endpoints"]
  verbs: ["get", "list", "watch"]

Role Binding

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: node-exporter-cron-job-account-controller
  namespace: node-exporter
subjects:
- kind: ServiceAccount
  name: node-exporter-cron-job-account
  apiGroup: ""
roleRef:
  kind: Role
  name: pull-logs-role
  apiGroup: rbac.authorization.k8s.io

Step 5.2: Configure Secrets and Configurations

This way of configuring Secrets is not very secure consider using a Secrets Manager in Production

Secrets

apiVersion: v1
kind: Secret
metadata:
  namespace: node-exporter
  name: node-exporter-s3-secrets
data:
  AWS_ACCESS_KEY_ID: <YOUR AWS ACCESS KEY>
  AWS_DEFAULT_REGION: <REGION WHERE BUCKET EXISTS >
  AWS_SECRET_ACCESS_KEY: <YOUR AWS SECRET ACCESS KEY>
  S3_BUCKET: <YOUR AWS S3 BUCKET>

ConfigMap

---
apiVersion: v1
kind: ConfigMap
metadata:
  namespace: node-exporter
  name: node-exporter-cron-job-config
data:
  ENV: cluster
  NAMESPACE: node-exporter #Namespace where Node Exporter Pods are deployed
  PORT: "9100"
  SERVICE_NAME: node-exporter #Service which exposes endpoints of node-exporter pods

Step 5.3: Create the Persistent Volume and Persistent Volume Claims

We will be utilising Persistent Volumes and Persistent Volume Claims.
They will be used to store our logs on nodes.

Here I am using the rancher.io/local-path storage class on K3D. Change this configuration according to the CSI driver that you are using on your cluster.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  namespace: node-exporter
  name: node-exporter-cron-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: local-path

Step 5.4: Create the CronJob

Now the CronJob will schedule a pod every 5 min which is configured via the cron expression */5 * * * *

apiVersion: batch/v1
kind: CronJob
metadata:
  namespace: node-exporter
  name: node-exporter-cron
spec:
  schedule: "*/5 * * * *"
  jobTemplate:
    spec:
      ttlSecondsAfterFinished: 10
      template:
        spec:
          containers:
          - name: node-exporter-cron-job
            image: hmx098/node-exporter-scraper:latest
            imagePullPolicy: Always
            env:
              - name: ENV
                valueFrom:
                  configMapKeyRef:
                    key: ENV
                    name: node-exporter-cron-job-config
              - name: NAMESPACE
                valueFrom:
                  configMapKeyRef:
                    key: NAMESPACE
                    name: node-exporter-cron-job-config
              - name: PORT
                valueFrom:
                  configMapKeyRef:
                    key: PORT
                    name: node-exporter-cron-job-config
              - name: SERVICE_NAME
                valueFrom:
                  configMapKeyRef:
                    key: SERVICE_NAME
                    name: node-exporter-cron-job-config
              - name: AWS_ACCESS_KEY_ID
                valueFrom:
                  secretKeyRef:
                    key: AWS_ACCESS_KEY_ID
                    name: node-exporter-s3-secrets
              - name: AWS_SECRET_ACCESS_KEY
                valueFrom:
                  secretKeyRef:
                    key: AWS_SECRET_ACCESS_KEY
                    name: node-exporter-s3-secrets
              - name: AWS_DEFAULT_REGION
                valueFrom:
                  secretKeyRef:
                    key: AWS_DEFAULT_REGION
                    name: node-exporter-s3-secrets
              - name: S3_BUCKET
                valueFrom:
                  secretKeyRef:
                    key: S3_BUCKET
                    name: node-exporter-s3-secrets

            volumeMounts:
              - name: logs-volume
                mountPath: /app/logs

          volumes:
            - name: logs-volume
              persistentVolumeClaim:
                claimName: node-exporter-cron-pvc

          serviceAccountName: node-exporter-cron-job-account
          automountServiceAccountToken: true
          restartPolicy: OnFailure
          terminationGracePeriodSeconds: 10

Verify the deployment of CronJob

$ kubectl get cronjobs -n node-exporter
NAME                 SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
node-exporter-cron   */5 * * * *   False     0        115s            27m

Wait, for the Pods to come up as scheduled by the CronJob we will find that the pods exit after a successful execution.

Accessing the PVs containing the Logs

Now, we need to access these logs on our nodes. I am using K3d so my nodes are docker containers but on production-level clusters, these nodes are generally VMs on which you may need to SSH.

Determining the node where our Persistent Volume is provisioned

Execute the following commands to get the name of the Node where the PV is deployed

Step 0: Get the name of PV

$ kubectl get pvc -n node-exporter
NAME                     STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
node-exporter-cron-pvc   Pending                                     local-path     2s
$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                  STORAGECLASS            REASON   AGE
pvc-9f6288f3-75d2-4c4b-b39a-28d823ac2252   1Gi        RWO            Delete           Bound    node-exporter/node-exporter-cron-pvc   local-path

Step 1: Get the Node and the Mount Point of the Volume on the Node.

We see in this case PV is mounted inside the control plane node k3d-multicluster-server-0 at the directory /var/lib/rancher/k3s/storage/pvc-9f6288f3-75d2-4c4b-b39a-28d823ac2252_node-exporter_node-exporter-cron-pvc

$ kubectl describe pv pvc-9f6288f3-75d2-4c4b-b39a-28d823ac2252 
Name:              pvc-9f6288f3-75d2-4c4b-b39a-28d823ac2252
Labels:            <none>
Annotations:       local.path.provisioner/selected-node: k3d-multicluster-server-0
                   pv.kubernetes.io/provisioned-by: rancher.io/local-path
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      local-path
Status:            Bound
Claim:             node-exporter/node-exporter-cron-pvc
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          1Gi
Node Affinity:     
  Required Terms:  
    Term 0:        kubernetes.io/hostname in [k3d-multicluster-server-0]
Message:           
Source:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/rancher/k3s/storage/pvc-9f6288f3-75d2-4c4b-b39a-28d823ac2252_node-exporter_node-exporter-cron-pvc
    HostPathType:  DirectoryOrCreate
Events:            <none>

Step 3: SSH into the node (in my case k3d node was a docker container so I had to do adocker exec).

/ # cd /var/lib/rancher/k3s/storage/
/var/lib/rancher/k3s/storage 
# ls
pvc-9f6288f3-75d2-4c4b-b39a-28d823ac2252_node-exporter_node-exporter-cron-pvc
/var/lib/rancher/k3s/storage 
# cd pvc-9f6288f3-75d2-4c4b-b39a-28d823ac2252_node-exporter_node-exporter-cron-pvc/
/var/lib/rancher/k3s/storage/pvc-9f6288f3-75d2-4c4b-b39a-28d823ac2252_node-exporter_node-exporter-cron-pvc # ls
k3d-multicluster-agent-0-20240508-201015.log  k3d-multicluster-agent-1-20240508-202012.log
k3d-multicluster-agent-0-20240508-201507.log  k3d-multicluster-agent-1-20240508-202506.log
k3d-multicluster-agent-0-20240508-202012.log  k3d-multicluster-server-0-20240508-201015.log
k3d-multicluster-agent-0-20240508-202505.log  k3d-multicluster-server-0-20240508-201507.log
k3d-multicluster-agent-1-20240508-201015.log  k3d-multicluster-server-0-20240508-202012.log
k3d-multicluster-agent-1-20240508-201507.log  k3d-multicluster-server-0-20240508-202505.log

Step 4: Let's access the logs

cat k3d-multicluster-agent-0-20240508-201015.log | less

Output:

Viewing Logs

Accessing the Logs inside the S3 Console

Viewing Logs In S3 Console

Issues with this architecture

Although this architecture has an S3 Backup configured, it still faces Single Point of Failure problems that need to be addressed.

  • There can be issues with getting rate-limited by S3 (although hitting the current limit of 3,500 PUT/COPY/POST/DELETE is difficult, still let's consider a hypothetical case) if your CronJob runs at very short intervals uploading lots of small files.

  • At the same time, we need to keep in mind, that a Persistent Volume, once provisioned on a node, is Bound to that node, until and unless using Network Attached Storage on your On-Prem cluster or something like Amazon EBS/EKS on EKS, GCEPersistent Disk on GCP and so on.

This means if data on a node is destroyed the logs for all nodes are lost which defeats the very purpose of Logging and Metrics Collection.

This can be scaled using the following approaches: -

  1. On EKS, prefer EBS/EFS CSI Driver which can be used to provision EBS/EFS Volumes that could be mounted to other EC2 Instances and logs can be accessed using EFS mounts for EFS volumes and for EBS prefer using multi-attach with EC2 Instances to access logs.

  2. Similarly with GKE, prefer using GCE PersistentDisk CSI Driver.

  3. On, on-prem K8s, use NFS Mounts.

  4. Another option for HA is to use Distributed Storage CSI Drivers such as those provided with openEBS and longhorn.