Table of contents
- Story:
- Prerequisites:
- Collect the log file from Cron job
- Custom Metric: Setting Up the Textfile Collector
- Configure Cron Jobs
- Verify Custom Metrics
- Create a dashboard
- Setup and Email notification for alert
- Setting up grafana Gamail SMTP setting
- Prometheus Alert Rule for Cron Jobs
- Grafana Alert Rule for Cron Jobs
- Testing the Setup
Story:
It all started when I decided to monitor my cron jobs and set up email alerts using Grafana and Prometheus. Everything seemed smooth until I encountered an unexpected hurdle—my alerts weren’t going through. After hours of troubleshooting, I realized my system couldn’t connect to the internet. Debugging network issues felt like unravelling a mystery, from checking DNS settings to fixing a broken gateway configuration. The moment I sent a successful test email was incredibly rewarding—it proved that persistence pays off. This journey taught me not just technical skills but also the value of problem-solving and patience.
Prerequisites:
Install Prometheus
Install Grafana
Install Node Exporter
Collect the log file from Cron
job
Prepare Log Directory and Metrics Script
Create a Log Directory & provide the appropriate permissions: This directory will store logs for your cron jobs
sudo mkdir -p /var/log/cron_jobs sudo chmod 777 /var/log/cron_jobs
Create the Metrics Script: Create a script to monitor
cron
jobssudo vi /usr/local/bin/cron_job_metrics.sh
Add the following content:
#!/bin/bash log_dir="/var/log/cron_jobs" metric_file="/var/lib/node_exporter/textfile_collector/cron_jobs.prom" # Clear or create the metric file > "$metric_file" # List of cron jobs to monitor declare -A cron_jobs=( ["job1"]="/path/to/your_script1.sh" ["job2"]="/path/to/your_script2.sh" ) for job in "${!cron_jobs[@]}"; do log_file="$log_dir/${job}.log" if [ ! -f "$log_file" ]; then echo "cron_job_status{job=\"$job\"} 0" >> "$metric_file" # Log file missing elif tail -n 1 "$log_file" | grep -q "error"; then echo "cron_job_status{job=\"$job\"} 0" >> "$metric_file" # Last run had an error else echo "cron_job_status{job=\"$job\"} 1" >> "$metric_file" # Last run was successful fi done
Make the script executable:
sudo chmod +x /usr/local/bin/cron_job_metrics.sh
Until here we have performed the steps to collect cron job logs in the prom
file of node exporter
. Now let’s let the node
exporter
know to collect custom metrics.
Custom Metric: Setting Up the Textfile Collector
Textfile collector:
The textfile collector is a feature of the Prometheus Node Exporter that allows you to add custom metrics by writing them to a file. Node Exporter reads these metrics from the file and makes them available to Prometheus. Read more about it here https://www.robustperception.io/monitoring-directory-sizes-with-the-textfile-collector/#more-1230
Steps to configure the textfile collector in Node Exporter:
Create a directory for custom metrics and change ownership to the user by which we are running the Prometheus service, In my case
prometheus
sudo mkdir -p /var/lib/node_exporter/textfile_collector sudo chown -R prometheus:prometheus /var/lib/node_exporter/textfile_collector
Edit the Node Exporter service file to include the path.
--
collector.textfile.directory
=/var/lib/node_exporter/textfile_collector
Create a file
cron_jobs.prom
where our metrics will be added If it’s not created don’t worry our script will create it.After making the changes in the service file. We should reload the
daemon
and restart thenode
exporter
.sudo systemctl daemon-reload sudo systemctl status node_exporter
Configure Cron Jobs
For simplicity, I am creating 2 scripts which will run after every 1 minute.
script1.sh
echo "This is from script1"
echo $(date)
script2.sh
echo "this is from script 2"
echo $(date)
echo "error"
Now, that we have scripts ready let’s create cron jobs.
Set Up Cron Jobs with Logging:
Edit your crontab:
crontab -e
During the creation of
cron
jobs
Ensure you are following the proper script file path to the cron file. Also, Note that I am adding thecron_jobs_metrics.sh
to collect the logs and write them intoprom
file after every minute in the same crontab.If you want to create cron expressions visit https://crontab.guru/ or ask Chatgpt
So let’s check the metrics and logs from the respective files,
Logs: logs
/var/log/cron_jobs/
are getting
Metrics:
cat /var/lib/node_exporter/textfile_collector/cron_jobs.prom
So we are receiving the metrics successfully
Verify Custom Metrics
To verify the custom metric we can visit locahost:9100
which is the node-exporter
default port and search the metrics we have created.
Also, We can query the same metric in Prometheus
too by hitting localhost:9090
in the browser
job2
has 0
value because we added an echo “error“
statement in the script2
which will captured as an error in the job2.log
file. As per the script cron_job_metrics.sh
if the logs have an error then the metric cron_job_status
will return 0
value.
Create a dashboard
- Create a dashboard by selecting
prometheus
asdata source
andcron_job_status
as the metric
Setup and Email notification for alert
To set up and mail alerts follow the steps mentioned below.
To set up an email alert we are using Gmail’s SMTP details. So first we should create an app password using our Gmail account. The best article to create a Gmail app password is this, https://itsupport.umd.edu/itsupport?id=kb_article_view&sysparm_article=KB0015112 Follow the steps mentioned in this article. I have created one for
grafana-alert
Setting up grafana Gamail SMTP setting
Now that we have set up an app password let’s modify the grafana configuration file.
sudo vi /etc/grafana/grafana.ini
enabled = true host = smtp.gmail.com:587 user = sachin.bmp@gmail.com # If the password contains # or ; you have to wrap it with triple quotes. Ex """#password;""" password = <your gmail app password> skip_verify = true from_address = admin@grafana.localhost from_name = Grafana
Note: When you mention password remove the space between the Gmail app password and the paste here.
Prometheus Alert Rule for Cron Jobs
Update Prometheus Alert Rules Edit your
sudo vi /etc/prometheus/prometheus.yml/ert_rules.yml
file to include a rule that applies to allcron_job_status
metrics where the value equals0
groups: - name: cron-job-alerts rules: - alert: CronJobFailure expr: cron_job_status == 0 for: 1m labels: severity: critical annotations: summary: "Cron Job Failure Detected" description: | The cron job '{{ $labels.job }}' has failed. Status: {{ $value }}. Please investigate.
Reload Prometheus Configuration After saving the updated rule file, reload Prometheus sudo
systemctl restart prometheus
Grafana Alert Rule for Cron Jobs
Create a Panel for All Jobs
Go to Grafana and create a new dashboard.
Add a panel to visualize the metric for all cron jobs.
Query:
cron_job_status
Set Up Alert for All Jobs
Go to the Alert tab in the panel settings.
Click Create Alert Rule.
Set the condition to trigger when any job fails:
WHEN max() OF query (A, 1m, now) IS BELOW 1
Grafana will apply this condition to all
cron_job_status
metrics.
Save the Panel Save the dashboard with the alert rule configured.
Testing the Setup
Simulate Failures for Multiple Jobs
Add the following to your
cron_jobs.prom
file to simulate failures for multiple jobs:echo 'cron_job_status{job="job1"} 0' > /var/lib/node_exporter/textfile_collector/cron_jobs.prom echo 'cron_job_status{job="job2"} 0' >> /var/lib/node_exporter/textfile_collector/cron_jobs.prom
Verify Alerts
Wait for Prometheus to scrape the metrics and Grafana to evaluate the alert rule.
Ensure alerts are triggered for both
job1
andjob2
in Prometheus and Grafana.Check your email inbox for detailed notifications.