System Design Series: Observability and SRE without Prometheus and Grafana

Ritesh Shergill
6 min readJan 8, 2024

System stability and scalability are 2 of the most desirable aspects of a System that serves its clients reliably and with 99.99% availability.

To guarantee that a system is guaranteed, Observability within a set of interconnected software systems, services and APIs is an important tenet.

What is Observability?

Observability is a property of a system that has been designed, built, tested, deployed, operated, monitored, maintained, and evolved in acknowledgment of the following facts:

  • No complex system is ever fully healthy.
  • Distributed systems are pathologically unpredictable.
  • It’s impossible to predict the myriad states of partial failure various parts of the system might end up in.
  • Failure needs to be embraced at every phase, from system design to implementation, testing, deployment, and, finally, operation.
  • Ease of debugging is a cornerstone for the maintenance and evolution of robust systems.

Observability includes

👁‍🗨Logs, metrics, and traces

👁‍🗨Instrumentation such as CPU or Memory usage

👁‍🗨Alerting and notification

Another important aspect to govern Systems is SRE.

What is SRE?

Site reliability engineering is about the design and development of scalable, distributed, and reliable computing systems.

SRE involves deploying tools for observability to improve the predictability and interoperability of systems. The entire basis of SRE is to identify issues in a timely manner with enough information about the origin, cause and potential fix for the issue.

SRE ensures maintaining systems if less painful and allows engineering teams to focus less on problem solving and more on System building.

Now that we know about SRE and observability, lets get to the CRUX of this article — How I applied my knowledge of SRE to build a reliable network of interconnected systems with observability across the grid.

Preamble

The standard way to implement logging and monitoring within microservices is the following -

Prometheus captures metric, logs etc. to scrape this information which is then visualized in a Grafana dashboard.

Multiple Grafana dashboards can be built for Engineering teams, QA teams, Business analysts and Stakeholders to check metrics or KPIs relevant to them.

This is standard fare for SRE and Observability. Tools like NewRelic are pretty good at providing this tooling out of the box.

The Problem statement

A few teams had created this framework for monitoring their microservice applications. This was possible because these were relatively modern platforms deployed on containers so integrating this out of the box tooling was possible.

Adjacent to these modernized applications sat some CLI based, script based and and monolithic architecture based applications that were created a long time ago on Frameworks like Spring 3. These were war based applications deployed in VMs that had business logic to validate, transform, do batched loading of data and basic exception logging into logs within the log files on the machine.

Applying tooling onto these disparate applications seemed to be challenging and thus, I was tasked with coming up with a solution along with my team to solve for this.

The requirement was to come up with some sort of mechanism that could collect and collate data from all these disparate applications and somehow push them to the same Grafana dashboards that we had.

The Solution

The first thought I had was inspired from what people say when you interview with them and wait for feedback at the end

“Don’t call us, we will call you”

Therefore, my obvious thought was to install agents onto all these disparate machines which would gather data from the Horse’s mouth and push it to a central repository for further processing.

The agent would be a python program that would do the following

1️⃣ Collect application metrics from log files,

2️⃣ Instrumentation from the VM,

3️⃣ Detect error conditions and system down time,

4️⃣ Encapsulate all this information in specific business error formats

And finally,

𝑷𝒖𝒔𝒉 𝒕𝒐 𝒂 𝒄𝒆𝒏𝒕𝒓𝒂𝒍 𝑶𝒓𝒄𝒉𝒆𝒔𝒕𝒓𝒂𝒕𝒊𝒐𝒏 𝒎𝒂𝒄𝒉𝒊𝒏𝒆 𝒕𝒉𝒂𝒕 𝒘𝒐𝒖𝒍𝒅 𝒈𝒂𝒕𝒉𝒆𝒓 𝒅𝒂𝒕𝒂 𝒇𝒓𝒐𝒎 𝒂𝒍𝒍 𝒕𝒉𝒆𝒔𝒆 𝒔𝒐𝒖𝒓𝒄𝒆𝒔 𝒂𝒏𝒅 𝒑𝒖𝒔𝒉 𝒊𝒕 𝒐𝒏𝒕𝒐 𝒂 𝒄𝒆𝒏𝒕𝒓𝒂𝒍 𝒓𝒆𝒑𝒐𝒔𝒊𝒕𝒐𝒓𝒚. 𝑰𝒏 𝒕𝒉𝒊𝒔 𝒄𝒂𝒔𝒆, 𝒕𝒉𝒂𝒕 𝒓𝒆𝒑𝒐𝒔𝒊𝒕𝒐𝒓𝒚 𝒘𝒂𝒔 𝑺𝒑𝒍𝒖𝒏𝒌.

The Grafana ecosystem would then use a Splunk plugin to collect data for these applications onto the Dashboards for visualization.

𝐓𝐡𝐢𝐬 𝐢𝐬 𝐰𝐡𝐚𝐭 𝐭𝐡𝐞 𝐭𝐚𝐫𝐠𝐞𝐭 𝐚𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞 𝐥𝐨𝐨𝐤𝐞𝐝 𝐥𝐢𝐤𝐞:-

Agents scrape the log files, instrumentation, application health metrics (if any were exposed through endpoints), etc., create a format for the logs and push the information to the orchestrator. The agents would run periodically on a Cron, waking up, reading info from the VM, pushing it to the orchestrator and then going back to sleep.

Orchestrator was a collection of VMs maintained in a load balanced cluster — the orchestration services had only one job — Collect the logs and other information, collect the agent id and the timestamps and the ip addresses of the machine the agent pushed the message from and finally — wrap the input into category coded messages to be logged into the Splunk central repository.

Batch jobs to periodically checked for logs not coming from an agent post genesis which would signify that the machine was either down or unable to send messages in which case an alert was raised with the Agent Id, the machine’s IP address and a severity which Splunk would decipher as a P1 and Grafana would reflect this.

Machines that wouldn’t send data for a while were considered decommissioned or could be marked as such in our Repo.

𝐓𝐡𝐢𝐬 𝐢𝐬 𝐚 𝐬𝐚𝐦𝐩𝐥𝐞 𝐩𝐲𝐭𝐡𝐨𝐧 𝐬𝐜𝐫𝐢𝐩𝐭 𝐭𝐨 𝐫𝐞𝐚𝐝 𝐚𝐧𝐝 𝐝𝐞𝐜𝐢𝐩𝐡𝐞𝐫 𝐥𝐨𝐠 𝐟𝐢𝐥𝐞𝐬 (𝐯𝐞𝐫𝐲 𝐬𝐢𝐦𝐢𝐥𝐚𝐫 𝐭𝐨 𝐰𝐡𝐚𝐭 𝐰𝐞 𝐰𝐫𝐨𝐭𝐞 𝐟𝐨𝐫 𝐨𝐮𝐫 𝐚𝐠𝐞𝐧𝐭𝐬):-

import re

# Define a function to extract information from a log line
def extract_log(log_line):
# Customize this regular expression pattern to match your log format
# Here, we assume a simple log format where each log entry contains a timestamp and a message
pattern = r'' #log pattern to extract
match = re.match(pattern, log_line)

if match:
timestamp = match.group(1)
message = match.group(2)
return timestamp, message
else:
return None, None

# Open your log file for reading
log_file_path = 'path/to/your/logfile.log'
with open(log_file_path, 'r') as log_file:
for line in log_file:
timestamp, message = extract_info(line)
if timestamp and message:
# Process the extracted information here
print(f"Timestamp: {timestamp}, Message: {message}")

# Close the log file
log_file.close()

𝐓𝐡𝐢𝐬 𝐢𝐬 𝐚 𝐬𝐚𝐦𝐩𝐥𝐞 𝐩𝐲𝐭𝐡𝐨𝐧 𝐬𝐜𝐫𝐢𝐩𝐭 𝐭𝐨 𝐫𝐞𝐚𝐝 𝐭𝐡𝐞 𝐕𝐌’𝐬 𝐢𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧:-

import psutil
import socket

def get_local_vm_info():
try:
# Get the VM's IP address
hostname = socket.gethostname()
ip_address = socket.gethostbyname(hostname)

# Get system specifications
cpu_info = f"{psutil.cpu_count(logical=False)} physical cores, {psutil.cpu_count(logical=True)} logical cores"
memory_info = f"{psutil.virtual_memory().total / (1024 ** 3):.2f} GB"

# Display VM information
print(f"Local VM Name: {hostname}")
print(f"Local VM IP Address: {ip_address}")
print(f"CPU: {cpu_info}")
print(f"Memory: {memory_info}")

except Exception as e:
print(f"Error: {e}")

if __name__ == "__main__":
get_local_vm_info()

There were some machines that would exchange data with each other and so, to accurately represent those connections, I proposed to use my open source Graph Cache framework to maintain what we called a ‘Reachability matrix’; This was nothing but a periodical snapshot of the system to check whether services could call each other or if there was some problem in calling a service.

My article on Graph Cache:

https://medium.com/@riteshshergill/graph-cache-caching-data-in-n-dimensional-structures-1fc077155154

The graph cache is a directed acyclic graph that stored information in nodes and edges. Nodes store data points and edges store meta data about the relationship, very similar to a graph database.

By simply querying the graph cache with an Agent Id or IP address we could decipher which services were interconnected or unable to send data to each other. This information was updated in the graph cache by the orchestrator as it was the single source of truth for all agent data.

In this manner, we were able to create a comprehensive monitoring solution where we couldn’t apply any out of the box tooling directly (such as New Relic, Prometheus, etc.)

Conclusion

If you are deploying to the cloud, there would be extremely rare occasions when you might need to build a solution like this.

But for an on premises solution where you setup the infra yourself, such use cases could be applicable and there you must apply yourself and create homebrew solutions similar to the one described above.

Follow me Ritesh Shergill

for more articles on

👨‍💻 Tech

👩‍🎓 Career advice

📲 User Experience

🏆 Leadership

I also do

✅ Career Guidance counselling — https://topmate.io/ritesh_shergill/149890

✅ Mentor Startups as a Fractional CTO — https://topmate.io/ritesh_shergill/193786

--

--

Ritesh Shergill

Cybersec and Software Architecture Consultations | Career Guidance | Ex Vice President at JP Morgan Chase | Startup Mentor | Angel Investor | Author