Aiman Ismail

Scrape cAdvisor using Grafana Alloy

Wed, 10 Jul 2024 17:45:00 +0800

I was having some issues figuring out how to scrape cAdvisor metrics using Grafana Alloy. After googling I came across this k8s-monitoring helm chart and inside there is a configuration for scraping the built-in cAdvisor on the k8s kubelet.

I ran Alloy as a single pod Deployment and it’ll scrape all the nodes in the cluster. Here’s the config that I used to get the metrics:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


prometheus.remote_write "default" {
 endpoint {
 url = "https://mimir.example.com/api/v1/push"
 }
}

discovery.kubernetes "nodes" {
 role = "node"
}

discovery.relabel "cadvisor" {
 targets = discovery.kubernetes.nodes.targets

 rule {
 replacement = "/metrics/cadvisor"
 target_label = "__metrics_path__"
 }
}

prometheus.scrape "cadvisor" {
 job_name = "integrations/kubernetes/cadvisor"
 targets = discovery.relabel.cadvisor.output
 scheme = "https"
 scrape_interval = "60s"
 bearer_token_file = "/var/run/secrets/kubernetes.io/serviceaccount/token"
 tls_config {
 insecure_skip_verify = true
 }

 forward_to = [prometheus.remote_write.default.receiver]
}

Alloy cadvisor exporter

Alloy provides the prometheus.exporter.cadvisor components that can be used to start a new cadvisor on the nodes. This is not required if the kubelet running on your nodes already runs cadvisor. This is the case for me on EKS running on Bottlerocket.

The hidden cost of running your own observability stack

Mon, 24 Jun 2024 13:00:00 +0800

At my latest $job, I was tasked of setting up the LGTM stack (Loki, Grafana, Tempo, Mimir) for observability. Fast forward a few months, I noticed there’s a hidden aspect to running the stack that I was not expecting before and that is the network cost, specifically the network transfer cost for cross AZ traffic. At one point we were paying more than $100 per day just for the cross AZ network traffic.

Cross AZ Traffic Amplification

While investigating where does the traffic coming from I compared the load balancer “Processed Bytes” metrics with the Cost Explorer usage for cross AZ traffic and noticed that there’s a 10x increase in the reported values by the load balancer to the actual charged traffic. It baffled me a bit and made me step back and take a deeper look at the possible points where I’m getting charged.

Collector to load balancer node
load balancer node to ingress controller pod
ingress controller pod to distributor
distributor to ingester

Collector to Load Balancer Node: client routing policy

In my setup, the services are exposed through a load balancer and given a DNS name like loki.example.com. The collectors are configured to send the telemetry data to that URL. Here is my fist mistake, I didnt’ enable “Availability Zone affinity” for the client routing policy. When enabled, this will route traffic from the collector to the load balancer node in the same AZ avoiding being charged for cross AZ traffic.

The load balancer node will then forward the traffic to the ingress controller pod in the same AZ.

Ingress controller pod to distributor: Kubernetes Topology Aware Routing

From the load balancer node, the traffic will be forwarded to the k8s pod through the k8s service. The default behavior of service in k8s is it will route the traffic using the round-robin algorithm. This means that from the ingress controller pod to the distributor pod the traffic will go cross AZ. If you have 3 distributor pods, this means 2 out of 3 connections will be routed to pods in different AZ.

To avoid the traffic from crossing AZ, we can use kubernetes topology aware routing feature. Downside of using this is that we need to have at least 3 pods in each AZ but compute is cheaper in my use case since I’m using spot instances through Karpenter and getting up to 70% discount on the node price.

Distributor to Ingester: no workaround

This the only part I haven’t solved. In the LGTM stack, the distributors uses an internal discovery mechanism to get the IP of the ingesters. This means that we cannot use kubernetes topology aware routing here.

To make things worse, depending on the replication_factor configuration, each distributor might be sending the logs to multiple ingesters, each one multiplying the cross AZ cost that we have to pay.

Special use case: getting logs from external source

Other than the above use case, our company also have other use case where the logs comes from external sources instead of from our internal network. In this case, I actually managed to eliminate the cross AZ cost completely by deploying the LGTM stack in just one AZ. The load balancer is also configured to use only one subnet that is in the same AZ.

Conclusion

All the above factor might explain why the amount I was charged for cross AZ traffic is 10x bigger than the amount that is received at the load balancer. I outlined some the possible points where the cross AZ charges are coming from and how to fix it. Hope it helps!

Using Steampipe + DuckDB for VPC Flow Logs Analysis

Fri, 26 Jan 2024 20:00:00 +0800

As a so called Tech Janitor, I’ve been tasked to clean up one of our AWS accounts at work and that account have a bunch of EC2 instances that no one knows what they all do. So, I’ve decided to use one of AWS features, VPC Flow Logs, to first identify which EC2 instances are still being used and which are not.

Setting up the VPC Flow Logs and query using DuckDB

For our purpose, I’ve setup VPC flow logs to send all the traffic data to a S3 bucket that we’ll refer to as vpc-flow-logs-bucket in this post. The flow logs are stored in a Parquet format for querying later using DuckDB.

Once the flow logs file are sent to S3, I’ll be able to query them using DuckDB. To do that we will need to install the aws and httpfs extensions.

From DuckDB shell:

> INSTALL aws;
> INSTALL httpfs;

We also need to load our AWS credentials into DuckDB. Luckily, DuckDB has a built-in command to do that:

> CALL load_aws_credentials();
┌──────────────────────┬──────────────────────────┬──────────────────────┬───────────────┐
│ loaded_access_key_id │ loaded_secret_access_key │ loaded_session_token │ loaded_region │
│ varchar │ varchar │ varchar │ varchar │
├──────────────────────┼──────────────────────────┼──────────────────────┼───────────────┤
│ <redacted> │ <redacted> │ │ eu-west-1 │
└──────────────────────┴──────────────────────────┴──────────────────────┴───────────────┘

This will look for your AWS credentials based on the standard AWS credentials file location. If you have multiple profiles in your credentials file, you can specify which profile to use by passing the profile name as an argument to the load_aws_credentials function.

Now it’s time to load our VPC flow logs from S3 into a table in DuckDB. You can replace the year/month/day/hour with the actual date and hour of the flow logs that you want to load or use * for any or all of them to load all the flow logs. I’ll be loading all the flow log records into a table flow_logs in DuckDB.

This might take a while since DuckDB will have to download the Parquet files from S3 and load them into memory. It took several minutes to finish loading in my case.

> CREATE TABLE flow_logs AS SELECT * from read_parquet('s3://vpc-flow-logs-bucket/AWSLogs/<aws-account-id>/vpcflowlogs/<region>/<year>/<month>/<day>/<hour>/*.parquet')

Now we can see that the flow logs records only contains the network interface ID (ENI) of the EC2 instance but not the EC2 instance ID or name itself. That won’t be enough for my use case since I want to identify which traffic is flowing to which EC2 instance. Therefore, we need to correlate the ENI with the EC2 instance ID and here’s where Steampipe comes in.

Steampipe: directly query your APIs from SQL

Steampipe is a tool that allows you to query APIs from SQL. It supports a lot of different APIs from AWS, GCP, Azure, Github, etc. You can also write your own plugins to support other APIs. I’ll be using it to query my AWS account for the EC2 instance ID and name based on the ENI ID from the VPC flow logs.

Life before Steampipe

Usually to do the things I’m about to show below, I’ll pull the data from AWS using the aws-cli and then massage it using jq/yq/awk/sed, if I’m desperate maybe Python. Then I’ll use some other tools to visualize it or export to CSV. With Steampipe, pulling the data from AWS is so simple and using SQL to correlate the data with other information source is a breeze.

Steampipe is just Postgresql

Under the hood, Steampipe is running PostgreSQL and it even allows you to run it as a standalone instance running in the background and allows connecting to it from any third-party tools that can connect to a Postgresql instance. Here’s where it gets interesting, DuckDB has the capability to connect to any PostgreSQL database through the Postgresql extension and query it as if all the data inside that database is coming from the DuckDB. This means that we can use Steampipe as a data source for DuckDB and access all of the AWS resources data available in Steampipe.

Setting up Steampipe and DuckDB connection

To run Steampipe as a service mode, you’ll need to run the following command to start the PostgreSQL instance and get the credentials for connecting to it:

$ steampipe service start
Database:
Host(s): 127.0.0.1, ::1, 2606:4700:110:8818:e17b:f78c:6c52:dccb, 172.16.0.2, 2001:f40:909:8e2:207a:634a:2070:d99d, 2001:f40:909:8e2:1cdb:75da:2a70:4b05, 192.168.100.23, 127.0.2.3, 127.0.2.2
Port: 9193
Database: steampipe
User: steampipe
Password: ********* [use --show-password to reveal]
Connection string: postgres://steampipe@127.0.0.1:9193/steampipe

Then inside DuckDB shell, you can connect to the Steampipe PostgreSQL instance using the following command:

> ATTACH 'dbname=steampipe user=steampipe password=23e2_4853_bd96 host=127.0.0.1 port=9193' AS steampipe (TYPE postgres);
> use steampipe.aws;
> SHOW tables;
show tables;
┌─────────────────────────────────────────────┐
│ name │
│ varchar │
├─────────────────────────────────────────────┤
│ aws_accessanalyzer_analyzer │
│ aws_account │
│ aws_account_alternate_contact │
...

Now we can see all the tables from the Steampipe Postgresql instance. For my use case I’ll be using the aws_ec2_network_interface table which contains both the network interface ID (ENI) and the EC2 instance ID that I can use JOIN together with the VPC flow logs records to map the records to the EC2 instance ID.

JOIN-ing it all together

Here’s an example query that will give me the count of all incoming traffic to the instances grouped by the port number:

select
i.title,
fl.dstport,
count(fl.dstaddr) traffic
from aws_ec2_network_interface ni
left join memory.flow_logs fl on fl.interface_id = ni.network_interface_id
left join aws_ec2_instance i on i.instance_id = ni.attached_instance_id
where
fl.dstaddr = i.private_ip_address
group by i.instance_name, fl.dstport
order by traffic desc, dstport asc

From this information I’ll be able to guess which service is running on those instances and take the next step towards migrating or depecrating the instances.

Conclusion

It is kinda mindblowing that I can do all this using SQL. Both Steampipe and DuckDB are great products and the flexibility of those tools allows me to pick and choose the best tool for the job. I first came across Steampipe in one of the podcasts that I listen to but haven’t really used it much. Now, after having the opportunity to use it to solve one of my problems, I’ll definitely pay more attention to it to make my tech janitor life easier in the future ;)

Terraform modules: be opinionated

Fri, 08 Dec 2023 17:21:59 +0800

Modules are containers for multiple resources that are used together.

Terraform modules is a way to bundle a bunch of Terraform resources into one group. Although not explicitly mentioned in the definition, it is also a way to provide an abstraction to the resources inside the module and only expose inputs and outputs that are relevant to the users of the module.

Terraform Module as an Abstraction Layer

A Terraform resource usually tends to be generic in that it allows you to configure it multiple ways through the input variables that it can accept. For example, the aws_mskconnect_connector resource has 3 options for log delivery: CloudWatch Logs, Kinesis Data Firehose, or S3. In most cases, your module shouldn’t expose all 3 options to your users. Your organization probably already has some standard place where you send you logs to e.g. S3 for example where it gets forwarded to another logs search service for later use.

Therefore, your module should only expose the S3 option and leave out CloudWatch Logs and Kinesis Data Firehose from your module. By doing this you eliminate the choice from the user and they don’t have to think which one to use. Your Terraform code can be a lot simpler too.

Module Abstraction in Action

Here’s an example module code when including all 3 options:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54


variable "log_delivery_s3" {
 type = object({
 bucket_arn = string
 prefix = optional(string, "mskconnect-logs")
 })

 default = null
}

variable "log_delivery_cloudwatch_logs" {
 type = object({
 log_group = string
 })

 default = null
}

variable "log_delivery_firehose" {
 type = object({
 delivery_stream = string
 })

 default = null
}

resource "aws_mskconnect_connector" "example" {
 name = "example"

 log_delivery {
 worker_log_delivery {
 dynamic "s3" {
 for_each = var.log_delivery_s3 != null ? [1] : []

 enabled = true
 bucket_arn = var.log_delivery_s3.bucket_arn
 prefix = var.log_delivery_s3.prefix
 }

 dynamic "cloudwatch_logs" {
 for_each = var.log_delivery_cloudwatch_logs != null ? [1] : []

 enabled = true
 log_group = var.log_delivery_cloudwatch_logs.log_group
 }

 dynamic "firehose" {
 for_each = var.log_delivery_firehose != null ? [1] : []

 enabled = true
 delivery_stream = var.log_delivery_firehose.delivery_stream
 }
 }
 }
}

And here’s an example when we only offer S3 option:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


variable "log_delivery_s3" {
 type = object({
 bucket_arn = string
 prefix = optional(string, "mskconnect-logs")
 })

 default = null
}

resource "aws_mskconnect_connector" "example" {
 name = "example"

 log_delivery {
 worker_log_delivery {
 s3 {
 enabled = var.log_delivery_s3 != null
 bucket_arn = var.log_delivery_s3.bucket_arn
 prefix = var.log_delivery_s3.prefix
 }
 }
 }
}

See how simple and shorter the code become? No need to use those dynamic blocks just to conditionally add or remove the log delivery option block anymore.

But I might need to use that other option in the future…

YAGNI.

Access internal kubernetes services from anywhere using Tailscale

Tue, 12 Sep 2023 02:05:00 +0800

I recently setup a local kubernetes in my home network to play with and one of the issues that I faced is that it is hard to access the services inside the cluster from my laptop. I don’t have a load-balancer in my setup so everytime I want to access a service from my laptop, I’ll have to run kubectl port-forward first before using the localhost address to access it. It works but it’s annoying.

Usually in cloud environments like AWS, you would setup an ingress-controller that will provision a load balancer for you and use that load balancer to expose your services inside the cluster to the internet using Ingress resources. Your incoming traffic from the internet will then be routed through the load balancer into your cluster onto your pods. Unfortunately, you don’t get the same thing when hosting your cluster locally outside of the cloud environment. You have to manually configure your network to allow access from the internet.

So what options do we have?

One way we can do it is to use a Service with type NodePort to use the host port and access the pods using the host IP, this will allow access to services inside the cluster to your local network but still not from the internet. To allow access from the internet, you’ll have to open a port on your router to route to the host IP from the NodePort service.

I’m not a fan of opening my home network to the internet. Home routers is infamous for being vulnerable and easily exploitable. I don’t want mine to be part of a new legion of botnets that will break a new record for biggest DDoS attack. Using the NodePort service type also is not that great. With NodePort service, you’ll have to specify the node IP to access along with the port assigned and your traffic will always go to that node and the pods running on it. More reason on why NodePort is a bad idea on StackOverflow.

What other option do we have? Tailscale!

Tailscale Subnet Router

Tailscale is a mesh VPN built on top of Wireguard. I’ve been using it for a long time for accessing my personal servers at home while I’m outside and I love it. It is so simple to setup you don’t have to know any networking magic to use it. Tailscale will create a peer-to-peer network from your client to your other Tailscale devices and it is also really smart in figuring out a way to punch a hole through your home network (see the Resources section) to connect to the internet so you don’t have to open a port on your home router anymore. Bot legion problem solved!

One of the ways you can use Tailscale is by configuring a Tailscale node as a subnet router. Usually, when you have 10 devices in your network, you’ll have to install Tailscale on each of those devices to connect it to your VPN network but with a subnet router only one Tailscale node in that network is enough, as long as that subnet router node have network access to all the devices in that network. You’ll have to configure your subnet router to advertise the route of the internal cluster network that the subnet router is in using CIDR range e.g. 10.43.0.0/16 so that other devices outside of that network will know to look for the subnet router if they want to access the IP address from that CIDR range.

ELI5: subnet router advertisement

You’re a postman trying to deliver a parcel. Your parcel destination is set to unit A-1-2-3 in the TRX Exchange 106 building. You’ve never been to TRX before so you don’t know which floor the office actually is but you noticed there’s a big signboard at the reception saying “Come here if you have parcel for unit A-1-0-0 to A-9-9-9”. So, you went the reception and then the nice lady at the reception gave you the direction to reach the office unit A-1-2-3 for you to deliver your parcel.

The reception here is like our subnet router. All the traffic meant for the network have to go through the subnet router first, then they’re passed through to the actual packet destination.

I don’t want to remember all this IP addresses

With a subnet router, you can now reach any of the services inside the cluster using the ClusterIP of that service but IP address is not human-friendly and you don’t want to (you can’t!) memorize all the IP addresses for all the services inside the cluster. So, now we need something that will map our IP addresses to a human-friendly format. Sounds familiar? We can use DNS records.

You can definitely create DNS records manually and map it to each of the ClusterIP for your services. That’s what I did for testing when validating this setup actually. At scale, that won’t work tho. You don’t want to be the one to manually go to your DNS registrar and create the records one by one. Luckily, in kubernetes there is an application called external-dns.

external-dns to the rescue

external-dns is an application that runs inside your kubernetes cluster and it periodically queries the kubernetes API for the list of all Service and Ingress resources. From the list, it checks whether it should create a DNS record for the resources based on the resource annotation. It supports a lot of DNS providers like AWS Route53, Cloudflare, Google Cloud DNS and more. For my setup I’m using Cloudflare.

By default, external-dns will only create DNS records for Ingress resource or Service with type LoadBalancer. For my setup, since I’m self-hosting the cluster inside my home network and don’t have access to a load balancer, I have to add an extra configuration parameter to external-dns so that it will create DNS records for ClusterIP Service type. On the Service resource itself, usually external-dns searches for the external-dns.alpha.kubernetes.io/hostname annotation but since we’re using it with ClusterIP, I have to change it to external-dns.alpha.kubernetes.io/internal-hostname.

Tailscale + external-dns = ❤️

➜ dig prometheus.k8s.pokgak.xyz +short
10.43.170.163

With those changes applied. All new Service resources in my kubernetes cluster that have the annotation will get one DNS record on Cloudflare. Now, if I try to resolve a name for a service inside the cluster, it will return me an internal ClusterIP. Combined with the Tailscale subnet-router we’ve configured earlier, now you can access services inside your cluster from any of your Tailscale devices from any part of the world.

With tailscale, you’ll also have an additional layer of authentication. Only users in your Tailscale networks can access the exposed services. For others, they might be able to guess what you have running in your cluster from your DNS records but they won’t be able to access it since all the IPs will be private IPs.

For the next part, I’m looking into exposing some service inside my cluster to the internet fully without having to be in the Tailscale network. Tailscale Funnel suppose to do just that but I still haven’t tested if it’s working with services inside kubernetes.

Resources

How Tailscale Works: explanation on how Tailscale uses Wireguard to create a mesh VPN network architecture.
How NAT traversal works: recommended read even if you’re not a networking geek. You’ll learn a thing or two about networking for sure.
Full cofiguration for external-dns helm chart
external-dns annotation for the services
Tailscale subnet router inside kubernetes
k8s manifest for my subnet router

On Public Speaking

Fri, 08 Sep 2023 21:37:00 +0800

I had the chance to give a talk at my local DevOps meetup last July and recently, did another one at a Tech Talk event hosted by my current employer. Both talks are related to OPA but this blog post will be more of a personal reflection for me just to document these moments in my career.

There are two common things that I noticed about myself from giving the talk and I’m gonna describe them here.

The First Common Thing

Firstly, on days leading to the event I would slowly get agitated, the realness of having to do that talk creeps on me. I tend to procrastinate when faced with a big assignment like this. I will be so productive and do everything from playing a new game, thinking about a new project to do, cleaning my house, cooking, and eating — except preparing and finishing the slides for the talk.

On the day of the talk itself, my productivity will be out of the window. I’d be reading, or doing something else but the thought of having to do the talk later is always on the back of my head. A few hours before the talk, I’m still polishing my slides, making last-minute changes, and going through the slides a couple more times in my head but around two hours before, I start distracting myself and doing other stuff. I’ll go talk to my colleagues, read up on unrelated topics, and even get the time to catch up on my novels. This is when things are getting real for me.

Usually at these events, there will be food served before the event starts. For the first one, they had pizza and for the second event, they served rice with chicken curry and some kuih. They look really good tbh but I definitely won’t be touching those. I’m already struggling enough to keep my nerves together and the thought of having a taste of the food at the back of my throat when talking later will throw me into a full-blown panic mode. It’s like my sense is heightened and I get so sensitive to my surrounding that even keeping a conversation with others made me feel overwhelmed. My best companion at this moment is a bottle of mineral water — tasteless, and I can fiddle with the bottle cap to distract myself.

It might feel like I’m exaggerating here but I can tell you that me before and after finishing the talk is so different I feel like we’re a totally different person.

The Second Common Thing

As the host of the event invited me to the stage, I’ll bring my laptop, set it up, and connect it to the projector. Then, holding the mic, I’ll take a moment to look over the audience and take a deep breath.

Once I start talking, it’s like a switch flipped and all that nervousness is gone.

Just like how I’d practiced, I would go through the slides, following the flow that I’ve set when putting the slides together. I like my presentation to have a story. I think it is more engaging to the audience and it is also a lot easier for me to remember what to say. I probably need to work on my tempo still (I blazed through 40+ slides in 20+ minutes 😆) but at least I wasn’t too nervous and blanked out mid-speaking. In the most recent talk that I gave, I even managed to slip in a joke during my introduction. Quite proud of that one. Hah!

What’s the magic?

I’m not sure if there’s any science behind this but one factor that I think contributed to this flip in the switch is my confidence in the topic. Before giving my first talk, I’ve been reading, thinking, and playing with the technology for about a year on and off. I won’t say that I mastered the topic but at least for the topic that I’m sharing, I feel like I have something to give to that the audience can benefit.

For my second talk, I’ve been involved in the development process from day one — together with my team members, of course. So, at this point, I can say that I know the topic quite well.

It’s a journey

My journey with public speaking started a long time ago. I used to be so nervous waiting for my turn to introduce myself to the class on the first day of school. Having all the attention on me even for a brief moment when I barged in the conversation with my friend group during our lepak session at mamak can make my ears turn red.

My first time speaking in front of a large audience was in a high school public speaking competition when I was 13. I was chosen because my English was pretty good in the class, not because I was good at public speaking, mind you. Being the inexperienced public speaker that I am, I printed my speech text on a big A4 paper and brought it to the stage, holding it full like that, not folding the paper whatsoever.

I didn’t manage to memorize the whole thing so I kept looking at the text during my speech. After I finished, my friends told me they noticed how nervous I was from how the A4 paper in my hand was shaking so badly during the whole speech. I’ll always remember that one.

Final Note

I can gladly say that I’ve improved a lot since then. Being exposed to these situations countless times made me realize that it is natural to feel nervous and I know that I’ll come out better after going through it.

For those who might struggle with the same issue, just know that others are going through the same thing too and most importantly, take the risk and put yourself out there and eventually you’ll get to the point where you’ll overcome that fear.

Special thanks to the organizers for giving me the chance to put myself out there and my colleagues and friends for supporting me.

Interview Series: Explain How Kubernetes DNS works

Wed, 24 May 2023 23:00:00 +0800

This will be the first in my interview questions series. I’ll compile interesting questions that I got from my experience interviewing for DevOps/SRE role in Malaysia.

Calling a service by its cluster-internal DNS

We’ll go from the highest to the lowest level in this journey. So let’s go through the scenario a bit: you have two services, foo and bar. those two services live in the same namespace app in your cluster. Now, inside service foo code, it makes a HTTP request to service bar. Probably something like so:

http.get("https://bar/")

What happens behind the scene from when the request is made to service bar and until the response is received back by service foo?

What is that weird DNS format?

You might’ve noticed that we’re just calling the service bar by the name using a weird name. Instead of the usual something.com domain, we’re just using bar directly. How is this possible?

Kubernetes allows you to call other services by using the service resource name directly. It does this by automatically appending the full DNS domain to the given service name. So for example here, when you make a request to bar, the application will make a DNS request to the local DNS server. The DNS server then notices that the domain that it received is not “complete” so it automatically appends the rest of the domain name based on the configuration that was given to it. If the service is running inside the namespace app, it will turn bar into bar.app.svc.cluster.local.

This automatic appending to complete the domain name is called “search domain”. In our example the seach domain is configured as app.svc.cluster.local. So, whenever the service makes a call to bar it will automatically try to append the search domain and tries to resolve the domain name.

How (and where) is this configured?

Every pods in kubernetes has a file /etc/resolv.conf that is configured by the kubelet when starting the pod. This file will contain the info where to find the DNS server inside the cluster and also what to use as the search domain. Here’s an example of the file (source):

nameserver 10.32.0.10
search <namespace>.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Which IP will be returned by the DNS query?

The DNS query will return us a virtual service IP. Why virtual? It’s because this IP doesn’t actually points to a pod that runs our services.

In kubernetes, pods can come and go at any time which also means that their IP will change all the time. How do we know then where to send our requests to? The Service resource is used to abstract dynamic nature of pod IPs and provide a consistent IP that your application can use to send requests to it.

How does the service IP maps to pod IPs?

The Service resource always comes with its pair, the Endpoint (or EndpointSlice) resource. This Endpoint resource tracks the pod IPs and also have information which pod IP is ready to receive traffic. This information can be queried using the kubernetes API.

On the node where the pod runs, there is a program called kube-proxy that runs and updates the routing to map from service IP to pod IP. This routing can be done in multiple ways but currently the default is using iptables.

When does this routing happens?

When a request is first sent from the application code, its destination will be set to the service IP but before the request is sent out over the network, iptables modifies the destination and changes the service IP to pod IP. If there are multiple pods that sits behind a service, the pod IP will be load balanced using a round-robin. Once the destination IP is changed, the packet is then sent out over the network.

How do you know which node to send the packet to?

A kubernetes cluster can contain a lot of nodes. Sending the packet to the correct node is important. To know which node to send the packet to, the router in your network will need to know which node to send this packet to. If you setup your own cluster ala kubernetes-the-hard-way, you might need to configure these routes yourself but if you’re using kubernetes on top of any cloud providers, they usually will do these setup for you and you don’t have to do anything here.

Once that is sorted, your packet now can reach the correct node and the packet is sent to the correct pod on the node based on the destination pod IP set in the packet header. The response then will be sent to the source pod IP in the request packet header.

Response now sent back to the source node. All done?

Not yet. There’s one more last thing to do. Remember when we sent the request originally, iptables had rewrote the destination from service IP to pod IP? Now for the response packet to be received back by the pod, the pod IP that we rewrote before needs to be converted back to the service IP.

This is needed because as far as the application knows, it sends a request to the service IP and not the pod IP. If it suddenly receives a response from a pod IP that it doesn’t know of, then it will just drop the response. So, here iptable will have to remember what it did before and convert pod IP on the response packet back to service IP. Finally, our foo service can receive the response that it wants from the bar service.

Instrumenting CI Pipelines using otel-cli

Sat, 08 Apr 2023 12:18:00 +0800

Why?

Why not? Get the whole picture of what is happening in your pipeline. Get notified when something is taking longer than it should.

How?

Use otel-cli, a standalone Go binary that can create OpenTelemetry traces and sends to a tracing backend using the OTLP protocol.

OpenTelemetry?

https://opentelemetry.io/

Tracing Backend?

You collect traces from your application using the OpenTelemetry SDK. To visualize the relationship between the traces, you’ll have to send the traces to a tracing backend, which will provide a UI for exploring your traces. Example of tracing backend:

Self-hosted:

Grafana Tempo
Jaeger
ElasticSearch

Paid:

Honeycomb
Datadog
Grafana Cloud
ElasticSearch Cloud

OTLP Protocol

The OpenTelemetry Protocol (OTLP) specification describes the encoding, transport, and delivery mechanism of telemetry data between telemetry sources, intermediate nodes such as collectors and telemetry backends.

https://opentelemetry.io/docs/reference/specification/protocol/otlp/

otel-cli?

OpenTelemetry (OTel) supports many SDK to create traces from your application but in CI pipelines, you’re usually using a shell script language like Bash which is not supported by any OTel SDKs currently. Therefore, we need a tool create this traces for us.

otel-cli is a tool that will do that. It will generate a trace ID, span ID, and sends the traces in the expected format.

How to use otel-cli?

The simplest way to start using it is first to set the OTEL_EXPORTER_OTLP_ENDPOINT value to tell otel-cli which backend to send our traces to.

Starting a local tracing backend server

otel-cli has a server subcommand that you can use to run a simple tracing backend on your local. You can run the following command in another terminatl to start the server:

otel-cli server tui

Setting the tracing backend endpoint

Now that we have a server running locally to send our traces to, let’s tell otel-cli to send all the traces that it generated to this local server:

export OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317

Here we send it to localhost on port 4317. Port 4317 is the default port when sending traces using grpc.

Sending our first trace

You can use exec subcommand to wrap a command with otel-cli. It will automatically set the start and end time to calculate the run duration for the command:

otel-cli exec --service my-service --name "My First Trace" echo "HELLO WORLD"

Then you should be able to see the a new line in the other terminal that we ran otel-cli server tui just now.

Conclusion

In this article, I showed you the simplest way you can use otel-cli. To get more valuable information from your traces, you’ll usually need to add nested spans to your trace. It’ll help break down the execution of your program to more smaller unit that can be inspected. To get more advanced example, you should refer to the otel-cli examples.

How to use OPA policy from an S3 bucket when using Atlantis policy check

Wed, 16 Nov 2022 22:00:00 +0800

Atlantis is an application used for collaborating on a Terraform code base using pull requests and one of the feature that it has is to run conftest and test a set of defined OPA policies. At the moment I’m writing this article, Atlantis only supports using local sources i.e. local filesystem as the source of the policy. In this article, I’ll show an example of how to use an S3 bucket instead as the source for the policies.

Custom workflow and `run` step

Atlantis supports using custom workflows to override the default commands that it runs and as part of that feature, it supports defining any custom commands to run as part of the steps for each stage. We will be ~~abusing~~using this feature to override the default conftest command that Atlantis uses and specify our policy through the --update flag of conftest

conftest `--update` flag

By using --update you can tell conftest to pull the policy first every time it wants to run the tests. We will be using an S3 bucket as our source but before we can pull from S3, you have to make sure that wherever the Atlantis server is running, it can access and have permission to pull objects from the bucket. In my case, Atlantis is running as a StatefulSet inside a Kubernetes cluster so I have already configured the IAM permission needed for it to access the bucket.

conftest is using the go-getter package underneath to pull these packages so technically it should be possible to also pull from other sources that go-getter supports, other than just S3.

Result

Combining both of the features described above, here’s an example of a simplified repo config that I use:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


# minimal config for brevity; you might need to configure more options to make atlantis works properly
repos:
 - id: github.com/$ORG/$REPO
 workflow: custom

workflows:
 custom:
 policy_check:
 steps:
 - show # important don't skip this step
 - run: conftest test $SHOWFILE --update s3::https://s3-us-east-1.amazonaws.com/$BUCKET_NAME/policy

policies:
 policy_sets:
 - name: policy-from-s3
 path: /home/atlantis/policy
 source: local

In the example above, under the workflows key, I’m defining a custom workflow named custom and inside that custom workflow, I’m overriding the default policy_check steps with my own. My custom policy_check steps consists of the show step and the custom run step. The show step is crucial since this is when Atlantis will run terraform show to convert your Terraform planfile to a JSON formatted file.

When using the custom run step, Atlantis will store the path to this JSON formatted file in variable $SHOWFILE so when I ran my conftest command you can see that I’m using $SHOWFILE to run conftest against the file. Optional: if you want to run conftest against the Terraform files too, you can add *.tf after $SHOWFILE and it will include all the *tf files in that project directory.

Next comes the --update flag, to specify the S3 bucket, I’m using a URL format that is specified by the go-getter package replacing $BUCKET_NAME with the bucket name that I have configured with the correct permission and network access. Inside the S3 bucket, this is how I structured the files. I put all the OPA policies inside a folder policy since conftest complains when I just put all the policies directly at the root level inside the bucket. YMMV.

$BUCKET_NAME/
├─ policy/
│ ├─ stop_it.rego
│ ├─ dont_kill_server.rego

After defining our custom workflow, we can specify the custom worklow as the default workflow for a repo. This is done by setting the repos[].workflow value to the name of our custom workflow, in my case it’s custom.

Next, as part of the using the policy check feature in Atlantis, you are required to set the policies values. You can refer to the docs for the full configuration required. Inside the policies key, there is a required policy_check key that is used to specify where Atlantis can find the OPA policies to use when running conftest. Usually, this is a folder on a local filesystem already containing the policies but in our case, since we’re using the --update flag, we just need to specify any folder on the local filesystem that will be writable by the Atlantis user. You can see in the example above that I’m using /home/atlantis/policy.

Conclusion

That’s all you need to do configure to make Atlantis pulls policies from S3 (and include Terraform source code files in your contest run). Shoutout to a DoorDash engineering blog post which mentioned briefly that they pulled their policies from S3 and made me curios how to do the same using Atlantis. You can mention me on Twitter (@pokgak73) if this article has helped you. That would most definitely made my day :)

Getting started with OPA and conftest

Sat, 12 Nov 2022 13:00:00 +0800

I started using OPA at my $dayJob recently and there are some parts that I think is not intuitive to grok for beginners.

If you want to play around with the rules and input, you can use the Rego Playground. It’s super useful and I used it to test out policies and test my hypothesis when playing around with Rego language.

Rego is a declarative language used to write OPA policies. Then, OPA is the engine that takes in the policies written in Rego and evaluates it, producing a set of documents called “rules”. You can use OPA and the Rego language directly to write policies for your config files but using conftest will make the DX much better. conftest builds on top of OPA and provide some extra functionality that makes using OPA easier.

Rego Basics

The Rego language focuses on querying the input to look for a given condition. If the input satisfies the query, then it will produce the document.

Variable Assignments

Variable assignment in Rego works the same like in other language. The expression foo := "hello" will assign the value "hello" to the variable foo.

One difference in Rego is that it implicitly assigns value true to the document if the condition given evaluates to true. In the example below, there’s two ways to write the Rego expression. Rego actually implicitly assigns the value true so we can also remove

foo := "hello"
# first way: explicitly assigns `true` to `result` when condition is satisfied
result := true if foo == "hello"
# Rego implicitly assigns `true` to `result` when condition is satisfied
result if foo == "hello"

Let’s bring this up to next level. Most of the time, the condition you’re checking is not as straight forward as checking the value against a static value. You might also need to evaluate expressions in between and save the intermediary values in a variable to help improve readability. In previous example we only used a one-liner for the rule body but you can also have more complex rule body like the following using curly braces.

Declarative Rego Language

The Rego language is declarative and useful to query data structures for any value. Consider the following example (Rego playground link):

Let’s assume our input is an array of object, each containing the keys “id” and “name”. In this policy we’re checking that the objects doesn’t have any forbiden value for “name”.

forbidden_names := ["foobar", "john"]
user_forbidden if input.users[i].name == forbidden_names[j]

This code would look something like this in Python:

1
2
3
4
5
6
7
8
9


users = [{"name": "foo"}, {"name": "bar"}]
forbidden_names = ["foobar", "john"]

user_forbidden = []
for i in range(len(input.users):
 for j in range(len(forbidden_names)):
 user_forbidden.push(input.users[i].name == forbidden_names[j])

return any(user_forbidden)

For both codes, user_forbidden will evaluate to true if one of the user name is included in the forbidden_names list. In the Python code, we used for loops with the any() function to check that none of the value is true. In the Rego code, we don’t have to use any for loop or iterate through the user list. forbidden_names[i] means “for any of the values in forbidden_names. So in our Rego code, we essentially tells OPA, if any of the value in input.users is the same as any of the value in forbidden_name, then return set the value of user_forbidden to true.

In this case, since we are not using the index i and j to reference the value at those index anywhere in the policy, we can simplify it more by using _ (underscore) instead for the index. _ is like a throwaway value and we don’t care about the index, we just care if one of the values is the same in user.input and forbidden_names.

user_forbidden if input.users[_].name == forbidden_names[_]

More complex policies

Before this our policies are all simple one liner but Rego also supports writing the rule body in multiple lines. In the example below, we are adding an exception to the rule that the previous rule doesn’t apply to user with id == 5. So if one our user name value is john but have id == 5 then user_forbidden won’t evaluate to true. Note that we are using the same index i when accessing the name and id property. This means we are referring to the same user. If we use _ or a different index when accessing the name and id, the rule will evaluate to true.

If any of the expressions inside the rule body evaluates to false or undefined then it will stop evaluating the rule body and return undefined for user_forbidden.

forbidden_names := ["foobar", "john"]
user_forbidden if {
input.users[i].name == forbidden_names[_]
input.users[i].id != 5
false
print("this will not be printed")
}

Using conftest

Previously, we used arbitrary names for our rules but conftest introduces a few keywords that we must use so that it can detect any failed rules and includes it in the output. Conftest will pick up any rules with name deny, warn, or violation and the summary will be shown in conftest output.

➜ tree conftest
conftest
├── input.json
└── policy
└── names.rego

# input.json
{
"users": [
{
"id": 1,
"name": "john"
},
{
"id": 2,
"name": "bar"
},
{
"id": 3,
"name": "foobar"
}
]
}

# policy/names.rego
package main
import future.keywords.contains
import future.keywords.if
deny contains msg if {
forbidden_names := ["john"]
name := input.users[_].name
name == forbidden_names[_]
msg := sprintf("username %v is not allowed", [name])
}
warn contains msg if {
id := input.users[_].id
id == 2
msg := sprintf("id %v is not allowed", [id])
}

Run conftest against our input file:

➜ conftest test input.json --policy policy/
WARN - input.json - main - id 2 is not allowed
FAIL - input.json - main - username john is not allowed
2 tests, 0 passed, 0 warnings, 2 failures, 0 exceptions

Note the values output here, the deny rule will be output as FAIL if the rule passes while the warn rule is counted as WARN. Here, conftest takes the output values from the OPA engine and formats the output for us to make it easier to interpret or integrate with other tools. You can also change the output format of conftest by passing in the --output flag. I like the github output since it will automatically prints the output in a format that Github Actions understoods and will surface error in Github UI approriately. You can also output it as JSON, which is great if you want to process the result output using tools like jq.

➜ conftest test --help
[...]
-o, --output string Output format for conftest results - valid options are: [stdout json tap table junit github] (default "stdout")

JSON output:

➜ conftest test input.json --output json
[
{
"filename": "input.json",
"namespace": "main",
"successes": 0,
"warnings": [
{
"msg": "id 2 is not allowed"
}
],
"failures": [
{
"msg": "username john is not allowed"
}
]
}
]

parsers: using other format as input files

Until now all our input has been in JSON format but conftest also has built-in parsers that can automatically detect the input format and converts it to JSON for us. As of this moment, here is the list of valid parsers: [cue dockerfile edn hcl1 hcl2 hocon ignore ini json jsonnet properties spdx toml vcl xml yaml dotenv].

Example is for HCL2 code used for Terraform:

# input.tf
resource "aws_imaginary_resource" "this" {
name = "this"
instance_type = "r5.4xlarge"
security_groups = ["12345", "45678"]
}
resource "aws_imaginary_resource" "that" {
name = "that"
instance_type = "t3.medium"
ingress {
port = 1234
cidr = ["0.0.0.0/0"]
}
}

We can use conftest parse to see how conftest will parse the Terraform file and then write our policy based on the parsed input.

➜ conftest parse input.tf
{
"resource": {
"aws_imaginary_resource": {
"that": {
"ingress": {
"cidr": [
"0.0.0.0/0"
],
"port": 1234
},
"instance_type": "t3.medium",
"name": "that"
},
"this": {
"instance_type": "r5.4xlarge",
"name": "this",
"security_groups": [
"12345",
"45678"
]
}
}
}
}

Conclusion

I just got started with OPA but considering the flexibility of it when used with conftest, I feel like you can use this for a lot of use cases. The ability to separate the policy logic from your application is powerful and the declarative nature of the Rego language also helps simplify the policy a lot as demonstrated in my comparison of the Python code above.

Instrumenting a Slack bot with OpenTelemetry

Sat, 13 Aug 2022 17:43:00 +0800

Note: I’m using pseudocode in the code example in this article to keep the article brief. Please refer to the official Slack and OpenTelemetry documentation for the actual code.

I’ve talked about the basics of OpenTelemetry in my previous article. In this one, I’ll explain more on how we’re integrating OpenTelemetry with our Slack-based application.

At the end of this article, this is roughly how the span lifetime and events created will look like:

Slack BoltJS Socket Mode

Compared to the a standard HTTP request/response, we’re using BoltJS with socket mode. This gives us the advantage of not having the application exposed publicly to be able to accept requests from Slack but this also means that we cannot just use the auto-instrumentation for HTTP developed by the community.

Socket mode uses WebSocket to establish connection to Slack and exchange messages through that connection. There is no official auto-instrumentation support for the ws library that is used by BoltJS socket mode but I found opentelemetry-instrumentation-ws, a 3rd-party library for ws library auto-instrumentation.

Spent a few days integrating it into our application and in the end I concluded that the auto-instrumentation provided by the opentelemetry-instrumentation-ws is too low-level. Our goal is to track user interactions with the application - when they use the bot, which option they choose, what were they trying to do, and whether the interaction ends successfully or with an error. The library, however, created spans when a new connection is established between our application and Slack but no spans or events for user interactions.

So, the conclusion? We’ll instrument the application manually.

Creating and Ending Spans

Since this application is used company-wide, it’s highly likely that multiple users will be using it in parallel. To track user interactions independent from each other, we’ll also need separate spans for each user.

I decided to go with an object spanStore storing the user spans. Like a singleton pattern, a new span will be created for that user if it doesn’t exist yet in spanStore, otherwise it will just return the existing user span.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


spanStore = {}

function getUserSpan(username) {
 if (user in spanStore) {
 return spanStore[username]
 }

 span = startSpan(username)
 spanStore[username] = span

 return span
}

Now that we have a function to create the span, when in the lifetime of the incoming event do we create the span? Ideally, as early as possible before anything else so that we can track everything. BoltJS supports setting a global middleware that will be called before the event handler function are called. This is where I call the getUserSpan() function above. For the first event for that user, it will create a new span and for the next events it will just return the existing spans that I can use.

Next, when do you end the span? Due to how the application works, we’re assuming that each user can only have one session at one time and at the end there will always be an finishing event triggered when the user finished their interaction with the application. Based on that fact, I wrote an event listener that will respond to this finishing event by calling the OTel function to end the span and remove the span from the spanStore object above.

1
2
3
4
5


app.event({id: 'finishing_event'}, async ({username}) => {
 span = getUserSpan(username)
 span.end()
 removeSpanFromSpanStore(span)
})

With these, we have a separate span for each user for the whole duration of their interaction with the application. With only one span, we don’t have insights yet into what the user are doing, which actions are taken by the user, so we’ll need a way to track user actions.

With Slack BoltJS, we can trigger a listener function on every user interaction. I wrote a function that will create a new span event using the user input id as the event name. I also passed in the whole payload so that the we can see the payload of user actions later when debugging issues. Add this as another global middleware, now we’re creating a new event for every user actions.

1
2
3
4


app.action('callback_id', async ({username, action_id, payload}) => {
 span = getUserSpan(username)
 span.addEvent(action_id, {payload})
})

Confession

I’m actually not convinced that my way of doing this is correct. One of the reason is that since I use one root span for the whole interaction for a user, I’m also tracking the duration taken by the user to do the next action. From our perpective, this made the duration of the span tracked is now kinda useless for us since it also includes factors that are not controllable by us (time taken for users to do the next action).

Instead of one root span and creating new span events for every user interaction, maybe a new span for each interaction, linked to the previous span would be better since we only track the duration that we are in control of, not how long the user takes to click a button.

Nevertheless, since I already implemented like this now, let’s see how that will turn out. Like the saying, you either die a hero, or you live long enough to see yourself become the villain.

OpenTelemetry Basics

Sat, 13 Aug 2022 17:43:00 +0800

I got to work on integrating OpenTelemetry in an application that our team maintains recently so I’m starting a series documenting my learnings throughout this journey.

A little background info on the application I’m working on: it’s a Slack chatbot written in Typescript using BoltJS. Our goal is to know how many users are using our Slack bot with a breakdown of the percentage of successful and error interactions. When an error happened, we also want to know what exactly the user did and the current state of the application that caused it to error. Based on my reading, the last sentence is exactly what observability promises, So that’s why we’re giving it a try.

OpenTelemetry can be divided into three categories: tracing, logging, and metrics; but I’ll be focusing on tracing in this series.

Tracing Primers

To get started you should know some basic concepts about tracing.

Traces, Spans

A trace consists of multiple spans and a span is a unit of work with a start and end time. In a span, you can create events that marks when something happened in the lifetime of the span.

A span can also have nested spans and these are called child spans. The parent span is usually representing some abstract unit of work, like the lifetime of a HTTP request when it from when it hits the application until the response is sent. Child spans can be used to get more details into the operations done during the lifetime of that parent span ie. API call to another service to fetch more informations.

Span attributes, Status, Errors

To add context to the spans, you can set custom attributes. Ideally, you want to send all the information that will help when debugging your application in the future so that later you don’t have to modify the code and add more attribute when you noticed an issue and realized that you don’t have enough information to debug the issue.

If your application encounters an error, you can set the span status to ERROR and also add the stack trace to the context for use in debugging. By default your span status will be set to OK.

Span Exporter

After the span ends, you’ll want to send it to a backend service that will store and process it so that you can use it later. The sending is done by OTel Exporters. There are multiple backend available that accepts OTel traces as inputs but such as Jaeger, Zipkin but for my testing I’m using Honeycomb with the OLTP Collector.

Debugging

For debugging, there’s also the ConsoleSpanExporter which will print out your spans in the console instead of sending it anywhere. I find this very useful to get fast response on what is being sent over but it’s hard to do analysis with it so in production environment you should configure the exporter to use other backends instead.

Automatic vs Manual Instrumentation

Now we got the basics out of the way, let’s look at how you can start adding spans to your application to build traces.

The easiest way to get started is to use auto instrumentation which will automatically injects code in the HTTP, requests, DNS, libraries that you’re using to create spans and events. In nodejs, this can be done by installing the auto-instrumentations-node NPM package. This package pulls in several other packages to automatically instrument your application.

This is a nice onboarding experience but I get overwhelmed by the amount of data sent when by these auto instrumentation package. Therefore, I recommend to you to start with manual instrumentation instead.

With manual instrumentation, you’re forced to be intentional with the data that you’re sending to the backend. With this I get to decide which information I want to send over and already have in mind what I want to do with it and which information I would like to gain from it.

Initialization

Whatever approach you end up with for the instrumentation, you’ll want to make sure that you’re initializing the OTel libraries at the start of your application. This is required because if you starts it later, your application might already be handling request when your OTel libraries are not initialized yet, causing it to miss some requests, or worse encounter errors.

The recommended way to do it is to use the -r flag from the node command:

-r, –require module Preload the specified module at startup. Follows require()’s module resolution rules. module may be either a path to a file, or a Node.js module name.

So in your package.json you’ll have to add that to your start command:

1
2
3


scripts: {
 "start": "node -r ./tracing.js app.js",
}

If you’re using Typescript like me, you’ll want to use the NODE_OPTIONS shell variable to specify the flag instead:

1
2
3


scripts: {
 "start": "NODE_OPTIONS='-r ./tracing.js' ts-node app.ts",
}

NodeSDK vs NodeTracerProvider Confusion

One thing that made me confused is how different the code for initializing auto instrumentation compared to manual instrumentation.

This is the code provided by Honeycomb to use auto instrumentation. The key there is the getNodeAutoInstrumentation() function which will register all the supported auto instrumentation libraries. One more thing is that it is using the NodeSDK class.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


// tracing.js
("use strict");

const { NodeSDK } = require("@opentelemetry/sdk-node");
const { getNodeAutoInstrumentations } = require("@opentelemetry/auto-instrumentations-node");
const { OTLPTraceExporter } = require("@opentelemetry/exporter-trace-otlp-proto");

// The Trace Exporter exports the data to Honeycomb and uses
// the environment variables for endpoint, service name, and API Key.
const traceExporter = new OTLPTraceExporter();

const sdk = new NodeSDK({
 traceExporter,
 instrumentations: [getNodeAutoInstrumentations()]
});

sdk.start()

On the other hand, this is the code example from opentelemetry.io to start manual instrumentation. Notice that it’s not using the NodeSDK class anymore and you need to create the Resource and NodeTracerProvider objects and configure it yourself.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


const opentelemetry = require("@opentelemetry/api");
const { Resource } = require("@opentelemetry/resources");
const { SemanticResourceAttributes } = require("@opentelemetry/semantic-conventions");
const { NodeTracerProvider } = require("@opentelemetry/sdk-trace-node");
const { registerInstrumentations } = require("@opentelemetry/instrumentation");
const { ConsoleSpanExporter, BatchSpanProcessor } = require("@opentelemetry/sdk-trace-base");

// Optionally register automatic instrumentation libraries
registerInstrumentations({
 instrumentations: [],
});

const resource =
 Resource.default().merge(
 new Resource({
 [SemanticResourceAttributes.SERVICE_NAME]: "service-name-here",
 [SemanticResourceAttributes.SERVICE_VERSION]: "0.1.0",
 })
 );

const provider = new NodeTracerProvider({
 resource: resource,
});
const exporter = new ConsoleSpanExporter();
const processor = new BatchSpanProcessor(exporter);
provider.addSpanProcessor(processor);

provider.register();

TBH I’m still not clear what is the difference betwen using NodeSDK vs manually configuring the NodeTracerProvider. When using NodeSDK does the NodeTracerProvider got configured automatically?

How and when to start tracing?

To start manually instrumenting your application, you’ll have to create a root span. A root span is the first span you create once the request enters your application.

Now, if you have a normal HTTP request/response-based application, it is easy to figure out where to start and end your root spans. All your incoming requests will most likely be handled by a controller and each endpoint will be handled by a method. In this type of application, your root span can be started once the request hits the application in the method in your controller and ends before you send the response.

During the lifetime of that request, you can create child spans to track other works done while processing the request. There’s only one entry point for requests and exiting the entry point means the request is finished. If your application encountered errors during the execution, it can set the span status to ERROR and add the stack trace info to the span.

Conclusion

Once you managed to create spans, set attributes, and then export it to a backend. You’re pretty much done with the basics of instrumenting your application. Go ahead and add more traces to your application!

Generate fail HCL menggunakan library hclwrite

Sun, 19 Sep 2021 12:25:34 +0800

HCL adalah bahasa yang digunakan dalam produk-produk daripada Hashicorp seperti Terraform dan Packer. Kebiasaannya, fail HCL ini ditulis secara manual tetapi jika anda ingin menulis atau mengubah fail-fail tersebut secara programmatik menggunakan code, maka anda boleh menggunakan hclwrite, sebuah library yang ditulis dalam Go.

Blog post ini dibahagikan kepada dua bahagian. Bahagian pertama menunjukkan cara untuk menghasilkan block baru from scratch dan simpan ke fail. Ini adalah asas untuk bahagian kedua di mana kita akan mengubah fail HCL sedia ada dan memastikan format fail tersebut terjaga dan tidak melakukan pengubahan secara semberono.

Saya tidak akan memberi penerangan penuh syntax fail HCL kerana ia boleh didapati di halaman ini.

Bahagian 1: Cipta block baru daripada mula

Untuk bahagian 1 ini, kita akan belajar cara untuk:

cipta block baru
tambah attribute dalam block tersebut
simpan block yang dicipta ke dalam fail

Untuk contoh pertama, kita akan cuba generate block HCL di bawah:

1
2
3
4


resource "github_membership" "user" {
 username = "github_username"
 role = "member"
}

Inilah code yang diperlukan untuk generate block tersebut:

1
2
3
4
5
6
7
8
9


newMemberBlock := hclwrite.NewBlock("resource", []string{"github_membership", mlId})
body := newMemberBlock.Body()
body.SetAttributeValue("username", cty.StringVal(githubUsername))
body.SetAttributeValue("role", cty.StringVal("members"))

f := hclwrite.NewEmptyFile()
f.Body().AppendBlock(newMemberBlock)
f.Body().AppendNewline()
ioutil.WriteFile("data/result_members.tf", hclwrite.Format(f.Bytes()), 0644)

Cipta block

Mula-mula kita perlukan sebuah block untuk mengisi content-content lain ke dalamnya. Ini boleh dicipta menggunakan function hclwrite.NewBlock(). Parameter pertama function ini adalah nama type, kemudian diikuti dengan label-label bagi block tersebut. Dalam contoh block di atas, nama type yang kita perlukan adalah “resource” dan kita memerlukan label “github_membership” dan “user”.

Seterusnya kita boleh mula mengisi boleh yang baru sahaja kita cipta tadi. Dalam contoh di atas, block itu mengandungi attribute “username” dan “role” dengan nilai masing-masing. Kita boleh set attribute sesebuah block dengan function SetAttributeValue().

Untuk nama attribute, kita boleh menggunakan string biasa tetapi bagi nilai attribute tersebut, hclwrite menggunakan library cty (sebut: si-tai) untuk memastikan nilai attribute tersebut mempunyai type yang betul setelah habis proces pemprosesan nanti. Bagi memasukkan nilai string menggunakan library cty, kita boleh menggunakan function cty.StringVal(), yang akan menukarkan string Go biasa kepada nilai cty yang setaraf.

Simpan block ke dalam fail

1
2
3
4


f := hclwrite.NewEmptyFile()
f.Body().AppendBlock(newMemberBlock)
f.Body().AppendNewline()
ioutil.WriteFile("data/result_members.tf", hclwrite.Format(f.Bytes()), 0644)

Dengan itu selesai bahagian pertama iaitu mencipta block tersebut menggunakan code. Seterusnya, kita perlu menyimpan block yang telah kita cipta ini ke dalam fail. Untuk memudahkan, kali ini kita akan bermula dengan fail baru yang kosong. Untuk bermula dengan fail kosong, kita boleh menggunakan function hclwrite.NewEmptyFile(). Fuction ini seolah-olah memberi kita kanvas kosong untuk kita isikan dengan block-block yang akan kita reka.

Untuk menambah block ke fail tersebut, kita tidak boleh menambahnya terus ke objek File yang dipulangkan oleh function NewEmptyFile. Semua content dalam sebuah fail perlu diletakkan dalam bahagian Body block tersebut. Kita boleh mengakses Body melalui function Body().

Seterusnya, kita boleh tambah block yang telah kita siapkan dalam bahagian sebelum ini menggunakan function ApppendBlock ke dalam Body yang telah dapat dalam langkah sebelum ini. Untuk memastikan block kita itu nampak kemas, maka kita boleh tambah baris kosong di hujung fail dengan menggunakan function AppendNewLine.

Akhirnya, untuk menyimpan semua yang telah kita generate ini ke fail, kita boleh menggunakan function ioutil.WriteFile(). Kita boleh memasukkan content fail kita dengan cara menukarkannya kepada bytes. hclwrite juga mempunyai function Format untuk memastikan fail yang telah kita cipta itu mematuhi recommended format untuk sesebuah fail HCL. Selepas itu anda bolehlah menyemak fail HCL yang dihasilkan di lokasi yang telah diberi semasa memanggil function WriteFile tadi.

Bahagian 2: Mengubah block sedia ada

Untuk bahagian 2 ini, kita akan belajar cara untuk:

baca dan parse fail HCL sedia ada
cari bahagian untuk kita ubah
tambah pengubahan yang diinginkan menggunakan Token
beza Traversal dan Value

Fail yang ingin kita hasilkan adalah seperti berikut:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


module "team_itsm_team" {
 source = "../../modules/github/team_nx"

 team_name = "ITSM Team"

 members = [
 github_membership.kasan.username,
 github_membership.mismail.username, // *kita ingin menambah baris ini
 ]
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


content, _ := ioutil.ReadFile("data/" + pod + ".tf")
f, _ := hclwrite.ParseConfig(content, "", hcl.InitialPos)

block := f.Body().FirstMatchingBlock("module", []string{"team_" + pod + "_team"})

oldMembers := block.Body().GetAttribute("members").Expr().BuildTokens(nil)
newEntry := hclwrite.NewExpressionAbsTraversal(
 hcl.Traversal{
 hcl.TraverseRoot{Name: "github_membership"},
 hcl.TraverseAttr{Name: mlId},
 hcl.TraverseAttr{Name: "username"},
 },
).BuildTokens(nil)

newMembers := append(
 oldMembers[:len(oldMembers)-2],
 &hclwrite.Token{Type: hclsyntax.TokenNewline, Bytes: []byte{'\n'}},
)
newMembers = append(newMembers, newEntry...)
newMembers = append(newMembers, hclwrite.Tokens{
 &hclwrite.Token{Type: hclsyntax.TokenComma, Bytes: []byte{','}},
 &hclwrite.Token{Type: hclsyntax.TokenNewline, Bytes: []byte{'\n'}},
 &hclwrite.Token{Type: hclsyntax.TokenCBrack, Bytes: []byte{']'}},
}...)

block.Body().SetAttributeRaw("members", newMembers)
ioutil.WriteFile("data/result_itsm.tf", hclwrite.Format(f.Bytes()), 0644)

Baca dan parse fail HCL sedia ada

Kali ini kita tidak akan bermula dengan fail kosong, sebaliknya mengambil fail HCL yang sedia ada.

1
2


content, _ := ioutil.ReadFile("data/" + pod + ".tf")
f, _ := hclwrite.ParseConfig(content, "", hcl.InitialPos)

Kita menggunakan function ReadFile untuk membaca keseluruhan fail tersebut. Function tersebut akan memulangkan content dalam bentuk []byte yang akan kita berikan kepada function hclwrite.ParseConfig(). Function inilah yang bertanggungjawab memahami syntax sedia ada fail HCL tersebut dan membolehkan kita mengubah fail itu dengan tepat. Function ini akan memulangkan objeck hclwrite.File, sama seperti function hclwrite.NewEmptyFile() di bahagian 1.

Cari bahagian untuk kita ubah

Terdapat pelbagai cara yang boleh kita gunakan untuk mencari bahagian tertentu yang ingin kita ubah. Antaranya ialah dengan menggunakan function FirstMatchingBlock(). Kita perlu menetapkan jenis (type) block yang ingin dicari, kemudian diikuti dengan label-label yang ada pada block tersebut.

1

block := f.Body().FirstMatchingBlock("module", []string{"team_" + pod + "_team"})

Tambah pengubahan yang diinginkan

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


oldMembers := block.Body().GetAttribute("members").Expr().BuildTokens(nil)
newEntry := hclwrite.NewExpressionAbsTraversal(
 hcl.Traversal{
 hcl.TraverseRoot{Name: "github_membership"},
 hcl.TraverseAttr{Name: mlId},
 hcl.TraverseAttr{Name: "username"},
 },
).BuildTokens(nil)

newMembers := append(
 oldMembers[:len(oldMembers)-2],
 &hclwrite.Token{Type: hclsyntax.TokenNewline, Bytes: []byte{'\n'}},
)
newMembers = append(newMembers, newEntry...)
newMembers = append(newMembers, hclwrite.Tokens{
 &hclwrite.Token{Type: hclsyntax.TokenComma, Bytes: []byte{','}},
 &hclwrite.Token{Type: hclsyntax.TokenNewline, Bytes: []byte{'\n'}},
 &hclwrite.Token{Type: hclsyntax.TokenCBrack, Bytes: []byte{']'}},
}...)

block.Body().SetAttributeRaw("members", newMembers)

Dapatkan nilai attribute yang ingin kita ubah melalui function GetAttribute(). Nilai attribute ini merupakan sebuah expression. Untuk mengubahnya kita perlu menukarkannya kepada Token.

Apa itu Token?

TODO

Beza Traversal dan literal Value

Traversal digunakan untuk merujuk kepada variable lain dalam fail HCL tersebut. Literal value tidak merujuk kepada mana-mana bahagian lain dalam fail/projek, berdiri dengan sendiri.

Simpan fail yang diubah

1

ioutil.WriteFile("data/result_itsm.tf", hclwrite.Format(f.Bytes()), 0644)

Konklusi

Manipulasi fail HCL menggunakan library hclwrite lebih kompleks daripada melakukan ubahsuai secara manual tetapi jika ini perkara yang anda perlu lakukan setiap hari, mungkin lebih senang jika anda meluangkan masa beberapa hari untuk membangunkan solusi automation ini supaya perkara yang sama tidak perlu lagi intervensi manual daripada anda.

Sumber Rujukan

Auto-update Graf Covid-19 menggunakan Github Actions

Thu, 02 Sep 2021 09:56:10 +0800

Dalam blog post saya sebelum ini saya dah menerangkan bagaimana saya membuat graf animasi perkembangan status pemberian imunisasi negeri-negeri di Malaysia. Seterusnya saya juga ada berkongsi asas-asas untuk menggunakan Github Actions. Dalam blog post ini saya ingin menerangkan pula bagaimana saya menggunakan Github Actions untuk mengemaskini graf tersebut dengan data terbaru yang dikeluarkan oleh pihak CITF Malaysia secara automatik setiap hari.

Konfigurasi penuh

Sebelum saya mula penerangan, inilah hasil fail workflow Github Actions yang saya gunakan:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43


name: Update Graphs

on:
 push:
 branches: [main]
 workflow_dispatch:
 schedule:
 - cron: "0 0 * * *" # Run workflow everyday at 12 AM 
jobs:
 vax-count-by-state:
 runs-on: ubuntu-latest
 steps:
 - uses: actions/checkout@v2
 with:
 token: ${{ secrets.GITHUB_TOKEN }}

 - uses: actions/setup-python@v2
 with:
 python-version: "3.8"

 - name: Cache pip
 uses: actions/cache@v2
 with:
 # This path is specific to Ubuntu
 path: ~/.cache/pip
 # Look to see if there is a cache hit for the corresponding requirements file
 key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}
 restore-keys: |
 ${{ runner.os }}-pip-
 ${{ runner.os }}-

 - name: Install dependencies
 run: pip3 install -r requirements.txt

 - name: Fetch latest data & generate new graph
 run: python3 main.py

 - id: get-date
 run: echo "::set-output name=value::$(date --iso-8601)"

 - uses: stefanzweifel/git-auto-commit-action@v4
 with:
 commit_message: "bot: update graph for ${{ steps.get-date.outputs.value }}"

Bahagian-bahagian

Saya akan memecahkan penerangan saya kepada beberapa bahagian iaitu:

Jadual auto-update
Melakukan kemaskini graf
Commit & Push kemaskini ke repository

Bahagian 1: Jadual Auto-update

Di bahagian ini saya akan menunjukkan cara bagaimana saya menetapkan Github Actions untuk melakukan kemaskini setiap hari secara automatik.

Dalam penerangan saya berkenaan asas-asas Github Actions, saya ada menyebut yang sesebuah workflow itu boleh dicetuskan oleh pelbagai event daripada Github. Antara event yang disokong adalah menjalankan workflow tersebut berdasarkan jadual yang ditetapkan. Untuk ini kita memerlukan keyword schedule di bawah keyword utama on seperti contoh di bawah:

1
2
3


on:
 schedule:
 - cron: "0 0 * * *" # Run workflow everyday at 12 AM

Keyword schedule ini menerima jadual dalam format syntax cron. Jika anda tahu selok-belok sesebuah sistem UNIX atau Linux anda mungkin tahu mengenai cron. Untuk yang belum tahu apa syntax cron itu, ia mempunyai 5 bahagian yang dipisahkan dengan paling kurang satu karakter whitespace seperti space atau tab. Bermula dari kiri, bahagian-bahagian tersebut melambangkan nilai berikut, nilai yang boleh diterima saya letakkan dalam kotak disebelah:

minit [0 hingga 59]
jam [0 hingga 23]
hari dalam bulan [1 hingga 31]
bulan dalam tahun [1 hingga 12]
hari dalam minggu [0 hingga 6], bermula dengan 0=Ahad, 1=Isnin, dan seterusnya hingga 6=Sabtu

Nilai khas * boleh digunakan yang membawa maksud untuk setiap nilai dalam bahagian tersebut. Dalam fail workflow saya jadual cron yang digunakan adalah “0 0 * * *” yang bermakna, “Jalankan fail workflow ini pada jam 0:00 (tengah malam) setiap hari dalam bulan, untuk setiap tahun, tidak mengira hari apa pun”. Kadangkala syntax cron ini boleh mengelirukan. Jadi saya mencadangkan laman crontab.guru untuk memeriksa dan bereksperimen dengan syntax cron ini.

Bahagian 2: Melakukan kemaskini graf

Di blog post sebelum ini saya telah menerangkan code yang saya gunakan untuk menjana graf animasi baru jadi kita akan menggunakan skrip yang sama untuk melakukannya di sini. Walaupun begitu, sebelum menjalankan skrip Python untuk menjana graf berdasarkan informasi baru, kita perlu menyediakan semua perisian yang diperlukan oleh skrip tersebut di Github Actions Runner.

Untuk itu saya menggunakan [actions/setup-python] untuk menyediakan Python di runner tersebut dan seterusnya menginstall dependency lain. Hanya step terakhir dalam job tersebut adalah bahagian dimana saya betul-betul menjalan kerja tersebut. Berikut adalah code tersebut.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


- uses: actions/checkout@v2
 with:
 token: ${{ secrets.GITHUB_TOKEN }}

- uses: actions/setup-python@v2
 with:
 python-version: "3.8"

- name: Cache pip
 uses: actions/cache@v2
 with:
 # This path is specific to Ubuntu
 path: ~/.cache/pip
 # Look to see if there is a cache hit for the corresponding requirements file
 key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}
 restore-keys: |
 ${{ runner.os }}-pip-
 ${{ runner.os }}- 

- name: Install dependencies
 run: pip3 install -r requirements.txt

- name: Fetch latest data & generate new graph
 run: python3 main.py

Bahagian 3: Commit dan Push kemaskini ke repository

Setelah mencipta graf baru dari data terkini daripada repo CITF-public, graf baru kita sudah pun ready tapi belum lagi dipaparkan di website https://pokgak.github.io/citf-graphs kerana ia masih belum dicommit lagi ke repository.

Biasanya saya akan melakukan commit secara manual dan push ke Github tapi oleh kerana kita melakukan semua proses diatas secara automatik daripada Github Actions, kita tidak boleh lagi buat begitu. Oleh itu, saya menggunakan Actions stefanzweifel/git-auto-commit-action untuk melakukan commit secara automatik. Berikut adalah segmen fail workflow saya yang menunjukkan penggunaan Actions ini:

1
2
3


- uses: stefanzweifel/git-auto-commit-action@v4
 with:
 commit_message: "bot: update graph for ${{ steps.get-date.outputs.value }}"

Seperti yang anda boleh lihat, mudah sahaja cara penggunaan actions ini. Kita hanya perlu menggunakan keyword uses untuk menanda bahawa kita ingin menggunakan Actions luar dalam fail workflow ini, diikuti dengan nama Actions tersebut. Senarai semua actions yang ada boleh dilihat di Github Actions Marketplace. Tambahan pula, anda juga boleh menulis Actions anda sendiri!.

Konklusi

Sejak adanya workflow ini, saya tidak perlu lagi memastikan graf yang saya hasilkan di pokgak/citf-graphs sentiasa dikemaskini dengan maklumat terbaru secara manual, semuanya dilakukan secara automatik. Sejak itu juga, Github menunjukkan aktiviti saya aktif setiap hari, walaupun pada hakikatnya itu semua adalah bot sahaja :p

Apa itu Cloud Computing?

Wed, 25 Aug 2021 23:47:19 +0800

Perkataan “cloud” rasanya tak asing lagi dalam kamus anak muda zaman sekarang tapi kebanyakannya merujuk kepada “cloud storage” iaitu servis penyimpanan data online. Kali ini saya ingin menerangkan konsep cloud computing dari sudut seorang programmer.

Apa itu cloud?

‘The cloud is just someone else’s computer’, adalah salah satu meme yang banyak ditemui online. Hakikatnya begitulah, gambar yang diupload ke iCloud, website Twitter yang anda akses setiap hari, video yang anda tonton di YouTube, semuanya dilayan (hosted) oleh komputer-komputer yang tersusun di data center besar di seluruh dunia.

Okay, tapi cloud itu bukan sekadar timbunan komputer seluas berpuluh-puluh padang bola sahaja. Ada beberapa ciri-ciri penting untuk sesuatu itu dipanggil cloud computing.

Ciri-Ciri Cloud Computing

Berikut adalah ciri-ciri cloud computing yang utama bagi saya:

sumber atas permintaan (on-demand resources)
kitaran maklum balas pendek (short feedback cycle)
berskala infinity (infinitely scalable)
ketersediaan global (global availability)

Sumber Atas Permintaan (on-demand resources)

Ciri ini adalah kelebihan utama apabila menggunakan servis cloud computing. Sebarang sumber yang diperlukan untuk menyebarkan aplikasi anda sama ada server, sistem penyimpanan atau pengkalan data, kebiasaannya boleh disiapkan dalam masa kurang daripada 30 minit. Dengan ini, apabila aplikasi itu telah siap dibangunkan, ia boleh siap tersedia untuk disebarkan dalam masa yang amat singkat.

Dalam artikel Alkisah Syarikat A saya ingin menggambarkan bagaimanakah proses ini dilakukan sebelum ini. Terdapat pelbagai tugas yang perlu dilakukan sebelum sesuatu sumber itu sedia untuk digunakan. Cloud computing mengambil alih tugas ini dari syarikat A.

Kitaran Maklum Balas Pendek (short feedback cycle)

Salah satu kelebihan kebolehan menyiapkan sumber atas permintaan adalah kebolehan untuk bertindak dan menyelesaikan masalah kekurangan kapasiti sumber dengan segera. Sebelum ini, masa untuk menyiapkan sumber baru amat besar, oleh sebab itu sumber untuk aplikasi sentiasa disiapkan dengan konfigurasi lebih besar daripada keperluannya hanya supaya mereka mempunyai kapasiti untuk berkembang sebelum perlu dipesan sumber baru.

Tapi dengan kelebihan waktu singkat untuk menyediakan sumber baru, tidak perlu lagi server itu disiapkan dengan konfigurasi lebih dari keperluan. Jika aplikasi itu tiba-tiba mendapat trafik yang tinggi, penyiapan sumber baru untuk menampung kapasiti boleh dilakukan segera.

Berskala Infiniti (infinitely scalable)

Platform-platform hyperscaler cloud computing dikatakan mempunyai sumber infiniti. Pada hakikatnya, apa-apa sumber tidak boleh dikatakan infiniti kerana sumber asli dunia pasti akan habis suatu hari nanti. Tapi infiniti di sini bermaksud, platform-platform hyperscaler ini mampu berkembang lebih cepat daripada kadar keperluan sumber oleh semua pengguna platform tersebut.

Ketersediaan Global (global availability)

Secara teori satu sumber yang disebarkan dari sebuah data center di Malaysia punyai kebolehan untuk melayan permintaan daripada seluruh dunia. Hakikatnya, ini akan meninggalkan kesan kepada pengguna-pengguna yang berada di lokasi bertentangan dengan lokasi di mana aplikasi itu dilayan.

Oleh itu, untuk mengembangkan aplikasi ke taraf global, syarikat-syarikat perlu menyediakan server di seluruh dunia supaya permintaan dari pengguna boleh dilayan daripada data center yang terdekat dengan mereka. Bahkan teknologi seperti edge computing wujud hanya untuk mengurangkan masa untuk melayan permintaan pengguna dengan meletakkan server pelayan sedekat mungkin dengan pengguna.

Tapi untuk sesebuah syarikat itu menyediakan server di sebuah lokasi baru tidaklah mudah, terutamanya jika mereka tiada kehadiran fizikal di negara tersebut. Hal-hal regulasi, pembayaran bil dan sebagainya akan menjadi lebih rumit kerana terdapat transaksi rentas negara.

Platform hyperscaler cloud computing seperti AWS, Azure, Oracle dll. memudahkan proses ini. Semua aspek fizikal akan diuruskan pihak mereka. Sebagai syarikat pengguna kita hanya perlu membayar sumber-sumber tersebut. Tidak perlu lagi difikirkan aspek-aspek lain.

Konklusi

4 ciri-ciri di atas pada hemat saya adalah karakteristik utama sesebuah platform cloud computing. Dengan memahami ciri-ciri ini kita dapat menjadikan cloud computing ini sebagai alat untuk menyelesaikan permasalahan yang wujud dalam menyediakan aplikasi kita.

Pemahaman ini amatlah penting kerana tanpa memahami bagaimana teknologi cloud computing ini mampu membantu menyelesaikan masalah yang kita hadapi, kita hanyalah seperti lembu diikat hidung, ikut sahaja apa trend yang orang lain suapkan. Akhirnya masa dan wang dibazirkan tanpa kita memperoleh manfaat apa-apa pun.

Animasi interaktif berdasarkan data CITF menggunakan Plotly

Wed, 25 Aug 2021 04:59:55 +0800

Saya gemar melayari subreddit r/dataisbeautiful dan melihat graf hasil buatan pengguna Reddit lain di sana. Salah satu jenis graf yang saya paling minat adalah apabila graf itu seolah-olah animasi, berubah selaras mengikut jangka masa waktu yang semakin bertambah. Kita boleh melihat perkembangan sesuatu data itu dari mula hingga ke akhir.

Contoh post terbaru di subreddit itu yang mempunyai graf sebegini adalah seperti graf di bawah yang memaparkan Kadar vaksinasi sebahagian daripada negara-negara di seluruh dunia (sayang Malaysia tidak dimasukkan sekali di sini):

Sebelum ini saya menganggap animasi sebegini rumit untuk dilakukan tetapi apabila pihak CITF telah melancarkan public repo di Github bagi data vaksinasi Malaysia, saya memutuskan untuk cuba menghasilkan semula gaya visualisasi ini menggunakan data tersebut.

Seterusnya saya akan menerangkan langkah-langkah yang diperlukan untuk menghasilkan visualisasi seperti yang di bawah. Sebagai rujukan, code penuh yang saya gunakan di sini boleh didapati di sini.

Pembersihan Data

Dalam projek yang melibatkan data sebegini, data boleh datang dari pelbagai sumber dan bentuk. Oleh itu, langkah pertama selalunya adalah pembersihan data. Tujuan langkah ini adalah supaya pada akhirnya kita mempunyai data dalam format yang sesuai dan boleh terus digunakan untuk langkah seterusnya tanpa perlu pemprosesan ekstra apa-apa pun.

Saya bernasib baik kali ini kerana sumber data yang dibekalkan oleh pihak CITF Malaysia sudah pun berada dalam format CSV yang senang untuk dibaca menggunakan pandas, sebuah library untuk memanipulasi data menggunakan Python. Pihak CITF tidak menawarkan public REST API yang boleh digunakan untuk mengambil (fetch) data tersebut maka saya terpaksa mengambil data menerusi Github. Proses ini kurang sesuai jika anda mahu menapis dahulu data yang diambil tapi untuk kegunaan saya ini, kaedah ini adalah mencukupi.

1
2
3


STATE_DATA_URL = "https://raw.githubusercontent.com/CITF-Malaysia/citf-public/main/vaccination/vax_state.csv"

df = pd.read_csv(StringIO(requests.get(data_url).text))

Function read_csv akan mengambil output data yang diambil dari Github dan menukarkannya ke format DataFrame yang digunakan oleh library pandas. Format DataFrame adalah 2D seakan-akan Excel. Ia mempunyai rows dan columns yang mempunyai data dan menawarkan fungsi-fungsi untuk memanipulasi data tersebut (gabung, pisah, transpose, etc) dengan mudah. Berikut adalah code yang saya gunakan untuk menyiapkan data raw tadi untuk visualisasi:

1
2
3
4
5
6


df.set_index(["date", "state"])
 .loc[:, ["cumul_partial", "cumul_full", "cumul"]]
 .rename(columns={"cumul_partial": "partially_vaxed", "cumul_full": "fully_vaxed"})
 .sort_values(by="cumul", ascending=False)
 .sort_index(level="date", sort_remaining=False)
 .reset_index()

Secara ringkasnya,

set_index: saya menetapkan column “date” dan “state” index DataFrame tersebut yang akan saya gunakan nanti untuk mengasingkan data vaksinasi mengikut tarikh dan negeri
loc: pilih hanya column yang saya mahu
rename: memberikan nama baharu kepada column-column tersebut supaya lebih mudah difahami
sort_values: susun semua data vaksinasi mengikut jumlah kumulatif (“cumul”)
sort_index: susun semua data vaksinasi mengikut tarikh
reset_index: menjadikan column index dari langkah 1 sebelum ini balik seperti column biasa yang boleh digunakan secara normal

Untuk mengetahui lebih lanjut fungsi functions yang saya pakai di sini, bolehlah rujuk kepada pandas API Reference.

Visualisasi Data menggunakan Plotly

Plotly adalah sebuah library yang menawarkan fungsi-fungsi untuk mempermudah pengguna untuk menghasilkan visualisasi interaktif. Ia ditawarkan dalam bahasa Python, R, ataupun JavaScript. Saya berpeluang untuk menggunakan Plotly dalam Python untuk menghasilkan visualisasi untuk thesis bachelor saya dan berdasarkan pengalaman saya, sangat mudah untuk bereksperimen dan menghasilkan graf visualisasi menarik menggunakan library ini.

Ciri Plotly yang sangat bagus adalah Plotly Express. Untuk kebanyakan fungsi visualisasi, Plotly Express sudah cukup pandai menakrif data yang diberikan dan kemudian menghasilkan visualisasi seperti yang dikehendaki. Berikut adalah code yang saya gunakan untuk menghasilkan animasi graf yang saya paparkan di permulaan blog post ini:

1
2
3
4
5
6
7
8
9


fig = px.bar(
 state_data,
 x="state",
 y=["partially_vaxed", "fully_vaxed"],
 animation_frame="date",
 animation_group="state",
 labels={"value": "Total vaccinated", "state": "", "variable": "Dose Type"},
 title="Vaccination Count in Malaysia by State",
 )

Jika anda perasan, saya hanyalah menggunakan satu function sahaja daripada Plotly Express iaitu bar. Function ini digunakan untuk menghasilkan visualisasi graf bar. Sebagai parameter, saya berikan data vaksinasi yang telah dibersihkan dan ditukarkan ke format DataFrame. Menerusi parameter x dan y, saya menetapkan data daripada column manakah dalam DataFrame tersebut yang akan digunakan sebagai paksi X dan paksi Y dalam graf.

Seterusnya, untuk menghasilkan animasi bergerak, saya menggunakan parameter animation_frame dan ditetapkan column “date” sebagai nilainya (value). Dengan parameter ini, Plotly akan menghasilkan satu graf untuk setiap nilai dalam column tersebut. Jadi bila saya menggunakan column “date”, Plotly akan menghasilkan satu graf untuk setiap tarikh dalam data vaksinasi. Untuk menghasilkan animasi, graf-graf ini akan disusun mengikut tarikh dan dipaparkan seolah-olah slideshow. Hasil akhirnya kita akan dapat perkembangan kadar vaksinasi selaras dengan masa.

Parameter animation_frame cukup untuk menghasilkan animasi perkembangan kadar vaksinasi tersebut tetapi animasinya kelihation tidak begitu lancar dan seperti terpotong-potong. Oleh itu, saya juga menggunakan parameter animation_group. Dengan parameter ini, Plotly akan mencuba untuk melancarkan transisi antara dua graf yang dihasilkan berdasakan nilai column dalam animation_frame tadi. Dalam visualisasi graf bar, Plotly akan menunjukkan pertukaran posisi bar tersebut apabila ia berubah kedudukan. Dengan ini animasi kita tadi telah pun menjadi lebih lancar.

Akhir sekali, parameter labels dan title digunakan untuk menetapkan label yang lebih mesra pembaca untuk legend, paksi, serta tajuk graf.

Konklusi

Saya amat berpuas hati dengan animasi graf ini kerana saya telah belajar cara untuk menghasilkan jenis bentuk graf yang telah saya minati buat sekian lama. Namun begitu, walaupun graf ini kelihatan lebih cantik berbanding graf lain dengan animasi bergerak, saya akui apa yang telah saya hasilkan ini lebih kepada latihan menggunakan library Plotly itu sendiri. Masih banyak aspek yang boleh diperbaiki untuk menyampaikan maklumat menggunakan graf secara tepat dan efektif.

Untuk mengakses segala code yang telah saya tunjukkan di sini, boleh akses repository pokgak/citf-graphs di Github. Saya juga telah menetapkan jadual berkala supaya graf visualisasi tersebut dikemas kini setiap hari menggunakan Github Actions. Blog post cara saya bagaimana saya buat akan datang.

Pengenalan Github Actions

Tue, 24 Aug 2021 04:49:42 +0800

Github Actions (GA) adalah servis automation yang ditawarkan oleh Github untuk semua penggunanya. Jika anda mempunyai repository public di Github, anda boleh mula menggunakan Github Actions pada saat ini tanpa perlu membayar apa-apa pun!

Bagaimana untuk mula dengan Github Actions?

Untuk mula menggunakan Github Actions, anda boleh pergi ke mana-mana repository public yang anda miliki dan seterusnya pergi ke tab Actions.

Jika anda belum pernah setup mana-mana workflow di repository tersebut, anda akan melihat pilihan templates siap yang boleh digunakan untuk pelbagai jenis projek. Sebagai pemula, saya cadangkan anda mula dengan template barebones yang ditawarkan.

Anda boleh menggunakan editor local di komputer sendiri tapi Github juga ada menawarkan editor online di mana fail workflow anda akan diperiksa formatnya secara langsung sambil anda menaip. Github akan highlight jika fail workflow anda mempunyai kesalahan yang membuatkan workflow anda akan gagal. Selain itu juga, di tepi editor online itu ada dipaparkan documentation ringkas mengenai syntax fail workflow jadi anda tidak perlu lagi tukar-tukar tab untuk semasa menulis fail workflow anda.

Anatomi fail workflow Github Actions

Saya telah beberapa kali menyebut “fail workflow” dalam perenggan sebelum ini tapi belum pernah menerangkan apakah fail workflow itu. Github Actions menggunakan fail workflow untuk menetapkan bagaimana untuk melakukan automasi. Fail ini ditulis dalam format YAML. Satu ciri-ciri penting yang saya mahu highlight di sini adalah format YAML adalah whitespace-sensitive, bermakna anda perlu pastikan indentation fail workflow anda menggunakan 4 spaces.

Sebelum bermula, ini adalah isi akhir fail workflow contoh kita:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


jobs:
 job-pertama:
 runs-on: ubuntu-latest
 steps:
 - run: echo Hello, world!

 - name: Selamat tinggal dunia
 run: echo Bye, world!

 - uses: actions/checkout@v2

Ikuti penjelasan saya di bawah untuk memahami apakah yang akan dilakukan apabila workflow ini dijalankan.

Keyword dalam fail workflow Github Actions

Dalam fail workflow anda ada dua top-level keyword yang wajib: on dan jobs.

Keyword `on`

Satu ciri-ciri penting Github Actions adalah, workflow anda perlu dimulakan melalui “triggers”. Hampir semua aktiviti yang anda boleh lakukan secara manual di Github boleh dijadikan trigger untuk workflow anda. Sebagai contoh, anda boleh menetapkan workflow untuk dijalankan apabila seseorang telah push codenya ke repo, atau apabila pull request baru dibuka. Ini cara bagaimana anda melakukan kedua-dua contoh tersebut:

1
2
3


on:
 push:
 pull_request:

Keyword on digunakan untuk menanda bahawa semua keyword dibawahnya adalah event-event dimana fail workflow anda patut dijalankan. push bermakna apabila seseorang telah push codenya ke repo anda, maka Github Actions akan menjalankan fail workflow tersebut. pull_request pula bermakna jika seseorang telah membuka pull request (PR) baru di repository anda, maka fail workflow tersebut akan dijalankan.

Kedua-dua keyword push dan pull_request ini juga boleh menerima sub-keyword lain untuk tujuan menapis dengan lebih spesifik bila workflow itu patut dijalankan. Antara sub-keyword yang boleh digunakan adalah branches untuk menapis hanya push atau pull request kepada branch yang dinyatakan sahaja. Anda juga boleh menapis mengikut lokasi fail code anda di dalam repo menggunakan sub-keyword paths.

Terdapat banyak lagi keyword yang anda boleh gunakan untuk trigger workflow anda, jika berminat boleh pergi ke page ini dan ini untuk membaca lebih lanjut.

Keyword `jobs`

Okay, kita telah tetapkan bila workflow ini patut dijalankan menggunakan keyword on. Seterusnya kita akan menetapkan apa yang workflow ini patut buat menggunakan keyword jobs. Sesebuah workflow mestilah mempunyai paling kurang satu job. Untuk mencipta job baru, anda boleh menggunakan apa-apa perkataan sebagai id cuma perlu dipastikan tiada space. Contohnya seperti berikut:

1
2
3


jobs:
 job-pertama:
 runs-on: ubuntu-latest

Di sini, job-pertama adalah id untuk job kita. Seterusnya, setiap job perlulah menetapkan di bawah environment manakah job ini akan dijalankan. Github Actions menawarkan platform Windows, Linux, dan macOS yang anda boleh gunakan secara percuma. Senarai penuh versi yang disokong boleh dibaca di halaman ini. Di sini saya menggunakan ubuntu-latest yang bermakna, job ini akan dijalankan di platform Ubuntu yang terbaru (pada masa tulisan ini adalah Ubuntu 20.04.

Setelah menetapkan platform, tiba masa untuk kita senaraikan apakah yang patut workflow kita ini buat. Untuk itu kita perlukan keyword steps. Seperti keyword jobs, keyword steps mengandungi sub-keywords yang, satu untuk setiap apa yang kita mahu jalankan.

Setiap satu step akan dimulakan dengan simbol -. Dalam syntax YAML, ini menandakan bahawa semua keyword di bawah satu - adalah satu bahagian. Keyword run digunakan untuk menjalankan command seolah-olah anda berada di terminal platform yang telah dipilih menggunakan keyword runs-on sebelum ini.

1
2
3
4
5
6
7
8


jobs:
 job-pertama:
 runs-on: ubuntu-latest
 steps:
 - run: echo Hello, world!

 - name: Selamat tinggal dunia
 run: echo Bye, world!

Dalam contoh di atas, saya telah menetapkan step itu untuk run command echo. Command ini akan print perkataan selepas itu ke terminal anda, dalam kes ini anda akan melihat “Hello, world” di log result workflow anda nanti. Dalam contoh di atas juga, saya telah menetapkan workflow ini untuk run command echo tapi kali ini dengan perkataan lain pula. Selain daripada keyword run, setiap step juga boleh ditetapkan dengan keyword-keyword lain seperti name, id dan pelbagai lagi. Senarai penuh boleh anda lihat di halaman ini. Fungsi simbol - di sini adalah untuk membantu mengumpul semua keyword yang berkaitan dengan step itu. Setiap simbol - bermakna satu step dalam job itu.

Keyword `uses`

Kita telah melihat bagaimana cara untuk menjalankan sebarang command melalui keyword run. Untuk sesetengah perkara, sekadar bergantung kepada command mungkin akan membataskan apa yang anda boleh lakukan. Oleh itu, Github Actions juga mempunyai fungsi untuk memanggil code luar dari fail workflow anda. Code ini boleh berasal dari repo yang sama ataupun daripada repo developer lain di Github.

Actions ini boleh ditulis dengan pelbagai cara sama ada menggunakan Javascript atau melalui Docker. Github juga menyediakan marketplace untuk anda mencari actions yang sesuai untuk digunakan dalam fail workflow anda. Github sendiri mempunyai beberapa Actions yang essential seperti checkout untuk checkout git repo anda sewaktu workflow dijalankan dan juga setup-node untuk setup environment node/javascript anda.

Untuk menggunakan Actions, ada perlu menggunakan keyword uses diikuti dengan nama Actions yang ingin digunakan. Kebanyakan Actions juga mempunya keyword tersendiri yang digunakan untuk memperincikan bagaimana Actions tersebut dijalankan.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


jobs:
 job-pertama:
 runs-on: ubuntu-latest
 steps:
 - run: echo Hello, world!

 - name: Selamat tinggal dunia
 run: echo Bye, world!

 - uses: actions/checkout@v2

Dalam contoh di atas, saya menggunakan Actions dari Github actions/checkout untuk melakukan git checkout repo saya ke sewaktu workflow dijalankan. @v2 di bahagian belakang itu menandakan versi Action tersebut yang ingin saya gunakan. Versi yang ditawarkan oleh Action tersebut boleh disemak di page Releases Action tersebut.

Konklusi

Saya pernah menggunakan Jenkins dan Bitbucket Pipeline dan berdasarkan pengalaman saya Github Actions adalah jauh lebih baik dari kedua-dua produk CI/CD tersebut. Dokumentasi Github Actions yang ditawarkan Github adalah sangat lengkap. Saya paling banyak merujuk halaman Workflow Syntax semasa mula belajar menggunakan Github Actions. Selain itu, halaman-halaman lain dalam Reference ini juga sangat membantu anda ingin mula melakukan perkara yang lebih advance dengan Github Actions.

Antara contoh automation yang pernah saya lakukan menggunakan Github Actions adalah, menjalankan unit test untuk setiap commit push, memeriksa dan baiki tajuk pull request secara automatik jika tidak memenuhi kriteria anda. Saya juga pernah menggunakan Github Actions workflow untuk melakukan DB dump daripada server dan terus upload ke S3. Pada pandangan saya, Github Actions sangat menarik dan macam-macam yang anda boleh lakukan dengannya.

Aiman Ismail

Scrape cAdvisor using Grafana Alloy

Alloy cadvisor exporter

The hidden cost of running your own observability stack

Cross AZ Traffic Amplification

Collector to Load Balancer Node: client routing policy

Ingress controller pod to distributor: Kubernetes Topology Aware Routing

Distributor to Ingester: no workaround

Special use case: getting logs from external source

Conclusion

Using Steampipe + DuckDB for VPC Flow Logs Analysis

Setting up the VPC Flow Logs and query using DuckDB

Steampipe: directly query your APIs from SQL

Life before Steampipe

Steampipe is just Postgresql

Setting up Steampipe and DuckDB connection

JOIN-ing it all together

Conclusion

Terraform modules: be opinionated

Terraform Module as an Abstraction Layer

Module Abstraction in Action

But I might need to use that other option in the future…

Access internal kubernetes services from anywhere using Tailscale

So what options do we have?

Tailscale Subnet Router

ELI5: subnet router advertisement

I don’t want to remember all this IP addresses

external-dns to the rescue

Tailscale + external-dns = ❤️

Resources

On Public Speaking

The First Common Thing

The Second Common Thing

What’s the magic?

It’s a journey

Final Note

Interview Series: Explain How Kubernetes DNS works

Calling a service by its cluster-internal DNS

What is that weird DNS format?

How (and where) is this configured?

Which IP will be returned by the DNS query?

How does the service IP maps to pod IPs?

When does this routing happens?

How do you know which node to send the packet to?

Response now sent back to the source node. All done?

Instrumenting CI Pipelines using otel-cli

Why?

How?

OpenTelemetry?

Tracing Backend?

OTLP Protocol

otel-cli?

How to use otel-cli?

Starting a local tracing backend server

Setting the tracing backend endpoint

Sending our first trace

Conclusion

How to use OPA policy from an S3 bucket when using Atlantis policy check

Custom workflow and run step

conftest --update flag

Result

Conclusion

Getting started with OPA and conftest

How does OPA, Rego, and conftest related to each other?

Rego Basics

Variable Assignments

Declarative Rego Language

More complex policies

Using conftest

parsers: using other format as input files

Conclusion

Instrumenting a Slack bot with OpenTelemetry

Slack BoltJS Socket Mode

Creating and Ending Spans

Tracking User Actions with Span Events

Confession

OpenTelemetry Basics

Tracing Primers

Traces, Spans

Span attributes, Status, Errors

Custom workflow and `run` step

conftest `--update` flag

Keyword `on`

Keyword `jobs`

Keyword `uses`