<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Aiman Ismail</title>
    <link>https://pokgak.xyz</link>
    <description>Articles by Aiman Ismail</description>
    <atom:link href="https://pokgak.xyz/index.xml" rel="self" type="application/rss+xml"/>
        <item>
      <title>Scale down your instances, cut down your AWS bills</title>
      <link>https://pokgak.xyz/articles/scale-down-your-instances/</link>
      <guid>https://pokgak.xyz/articles/scale-down-your-instances/</guid>
      <pubDate>Tue, 29 Apr 2025 06:14:00 GMT</pubDate>
      <description>&lt;p&gt;I joined a webinar from Platformatic on the best practices of running nodejs services in production and one of the suggestion is to give each of the pod 1 full CPU core. All went well, our latency metrics seems to have improved, case closed. Little did I know that at the end of the month I would get hit by 3x increase in our monthly AWS bills. So I started optimizing.&lt;/p&gt;
&lt;h2&gt;Know your workload patterns&lt;/h2&gt;
&lt;p&gt;Our company mainly serves restaurants so the traffic to our services has a predictable pattern; it starts ramping up at around breakfast time and peaks at around lunch and dinner. We have a few customers that operates until midnight but they&amp;#39;re the minority so during those ours incoming traffic to our services it at the minimum. This pattern is something I noticed early when I first joined as possible optimization area but never got to it until now.&lt;/p&gt;
&lt;p&gt;Our workloads run on AWS EKS and mainly using managed node groups provisioned statically with no cluster-autoscaler configured to add or remove nodes. There is also not HorizontalPodAutoscaler (HPA) setup for our Pods. This decision is done mainly to simplify operations. Just add more replicas and more nodes when the pods are getting overwhelmed by requests. It works at a smaller scale since the AWS bills is still manageable but we went through a growth phase recently and our infra costs also jumped. We figured that we have to tackle this now instead of pushing it back for later.&lt;/p&gt;
&lt;h2&gt;Know your tools&lt;/h2&gt;
&lt;p&gt;When talking about autoscaling in kubernetes-land, there&amp;#39;s the HorizontalPodAutoscaler (HPA). HPA allows you to scale your pods based on cpu, memory, or in the more recent versions, any custom metrics. The custom metrics autoscaling works but the UX leaves much to be desired. That&amp;#39;s why I chose to use KEDA instead.&lt;/p&gt;
&lt;p&gt;KEDA builts on top of HPA, providing a simplified interface. You also have more options to scale on metrics from almost anywhere using the available &lt;a href=&quot;https://keda.sh/docs/2.17/scalers/&quot;&gt;Scalers&lt;/a&gt;. For my use case, I&amp;#39;ll be using the &lt;a href=&quot;https://keda.sh/docs/2.17/scalers/cron/&quot;&gt;Cron&lt;/a&gt; and &lt;a href=&quot;https://keda.sh/docs/2.17/scalers/prometheus/&quot;&gt;Prometheus&lt;/a&gt; scalers.&lt;/p&gt;
&lt;p&gt;Additional note for why to use KEDA is that it supports scaling down pods to 0 and scaling back up. This is called &lt;a href=&quot;https://keda.sh/docs/2.17/concepts/scaling-deployments/#activating-and-scaling-thresholds&quot;&gt;Activation&lt;/a&gt; inside KEDA where the pod replicas goes from 0 to 1 and vice versa. This doesn&amp;#39;t apply to our use case since we still have to keep some pods running during midnight but if your workload allows it this will definitely gives you more savings.&lt;/p&gt;
&lt;h3&gt;Setting up the cron scaler&lt;/h3&gt;
&lt;p&gt;As mentioned above, we have a regular and predictable traffic pattern for our workloads, starts at 7AM and peaks during lunch and dinner time. So, the cron scaler fits perfectly for this use case. We set a schedule to scale up our services during those hours and for the rest of the hours, scale it back down to the minimum possible. KEDA uses a custom resource called &lt;a href=&quot;https://keda.sh/docs/2.17/reference/scaledobject-spec/&quot;&gt;ScaledObject&lt;/a&gt; to specify the definition for the autoscaling.&lt;/p&gt;
&lt;p&gt;In my setup, I&amp;#39;ll configure it to scale based on the following rule:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;7AM - 11AM: 10 replicas&lt;/li&gt;
&lt;li&gt;11AM - 2PM: 20 replicas&lt;/li&gt;
&lt;li&gt;2PM - 6PM: 10 replicas&lt;/li&gt;
&lt;li&gt;6PM - 10PM: 20 replicas&lt;/li&gt;
&lt;li&gt;outside of those hours, scale down to minimum replica count which is 1&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is how the manifest looks like:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-yaml&quot;&gt;&lt;span class=&quot;hljs-attr&quot;&gt;apiVersion:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;keda.sh/v1alpha1&lt;/span&gt;
&lt;span class=&quot;hljs-attr&quot;&gt;kind:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;ScaledObject&lt;/span&gt;
&lt;span class=&quot;hljs-attr&quot;&gt;metadata:&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;name:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;backend-service&lt;/span&gt;
&lt;span class=&quot;hljs-attr&quot;&gt;spec:&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;maxReplicaCount:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;30&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;minReplicaCount:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;scaleTargetRef:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;apiVersion:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;apps/v1&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;kind:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;Deployment&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;name:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;backend-service&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;triggers:&lt;/span&gt;
    &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-attr&quot;&gt;metadata:&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;desiredReplicas:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;10&amp;#x27;&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;start:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;7&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;end:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;11&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;timezone:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;Asia/Singapore&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;type:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;cron&lt;/span&gt;
    &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-attr&quot;&gt;metadata:&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;desiredReplicas:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;20&amp;#x27;&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;start:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;11&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;end:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;14&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;timezone:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;Asia/Singapore&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;type:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;cron&lt;/span&gt;
    &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-attr&quot;&gt;metadata:&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;desiredReplicas:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;10&amp;#x27;&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;start:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;14&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;end:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;18&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;timezone:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;Asia/Singapore&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;type:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;cron&lt;/span&gt;
    &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-attr&quot;&gt;metadata:&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;desiredReplicas:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;20&amp;#x27;&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;start:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;18&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;end:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;22&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;*&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;timezone:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;Asia/Singapore&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;type:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;cron&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;h3&gt;Combining cron and prometheus scalers&lt;/h3&gt;
&lt;p&gt;Cron and prometheus scalers on its own is not enough. If I&amp;#39;m using cron alone, what if suddenly one day the traffic suddenly higher than usual? This where prometheus scalers comes in. It looks at the actual metrics and scale accordingly.&lt;/p&gt;
&lt;p&gt;Then you might think, why not just use prometheus scalers on its own then? It depends. If your traffic always grows slowly and gradually then ya it might work. New pods starting up can catchup to the traffic coming in but during rush hours the traffic can increase really fast and to avoid waiting for our pods to scale up which might take some time, we just decided to pre-scale up our pods during the expected rush hours. With this setup, the prometheus scaler can supplements the pods if the configured cron autoscaling is not enough.&lt;/p&gt;
&lt;h3&gt;Setting up the prometheus scalers&lt;/h3&gt;
&lt;p&gt;For the prometheus scalers, we decided to scale our pods based on the response latency served from the service. This is based on the &lt;a href=&quot;https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/&quot;&gt;RED method&lt;/a&gt;. The &amp;quot;D&amp;quot; inside &amp;quot;RED&amp;quot; stands for Duration - or latency. This is a great metric to measure the performance of our service since it directly correlates to the experience faced by the user when using that service. High latency means your customer needs to wait longer, which is bad.&lt;/p&gt;
&lt;p&gt;This latency metrics however does not come out of the box. In our setup, we generate this metrics from the traces emitted by our service which is instrumented using &lt;a href=&quot;https://opentelemetry.io/&quot;&gt;OpenTelemetry (otel)&lt;/a&gt;. All the traces are sent to Tempo, our backend for storing traces, and Tempo will generate the metric &lt;code&gt;traces_spanmetrics_latency&lt;/code&gt; from it. Generating the metrics on Tempo is out of the scope for this article but you can refer to the &lt;a href=&quot;https://grafana.com/docs/tempo/latest/metrics-generator/span_metrics/&quot;&gt;Tempo docs&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Issues arising from autoscaling&lt;/h2&gt;
&lt;p&gt;If you think once the autoscaling is rolled out then all is good.. then you&amp;#39;re dead wrong - so was I. The first day we rolled out the autoscaling, we monitor the service closely for any increase in errors and error it did. First, it was just not enough capacity. The original capacity we put in for autoscaling is not enough. Easy fix just add more capacity. Then, we were scaling up too late and scaling down too early. Also easy fix, just move the scaling up period higher and scaling down period later.&lt;/p&gt;
&lt;h3&gt;Connections being terminated prematurely&lt;/h3&gt;
&lt;p&gt;The not so obvious one tho is that, every time the we scales down by half we see a lot of errors from the service coming from the service. Those errors mostly related to the connection being terminated prematurely. There are two ways to reduce this but I&amp;#39;ll explain the one way we took for now which is configuring the HPA behavior.&lt;/p&gt;
&lt;p&gt;The first one is that HPA by default scales down too fast for us. This cause the service to scale down fast, then notice that the resource is not enough and it scales back up - rinse and repeat. In autoscaling we call this behavior as &amp;quot;flapping&amp;quot; and we don&amp;#39;t want that. We want our service to be stable. This is the modification I&amp;#39;ve done to our above ScaledObject - adding &lt;code&gt;stabilizationWindowSeconds&lt;/code&gt; and set it to remove pods one by one.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-yaml&quot;&gt;&lt;span class=&quot;hljs-attr&quot;&gt;apiVersion:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;keda.sh/v1alpha1&lt;/span&gt;
&lt;span class=&quot;hljs-attr&quot;&gt;kind:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;ScaledObject&lt;/span&gt;
&lt;span class=&quot;hljs-attr&quot;&gt;metadata:&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;name:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;backend-service&lt;/span&gt;
&lt;span class=&quot;hljs-attr&quot;&gt;spec:&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;advanced:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;horizontalPodAutoscalerConfig:&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;behavior:&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;scaleDown:&lt;/span&gt;
          &lt;span class=&quot;hljs-attr&quot;&gt;stabilizationWindowSeconds:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;600&lt;/span&gt; &lt;span class=&quot;hljs-comment&quot;&gt;# wait 10 minutes before scaling down&lt;/span&gt;
          &lt;span class=&quot;hljs-attr&quot;&gt;policies:&lt;/span&gt;
            &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-attr&quot;&gt;type:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;Pods&lt;/span&gt;
              &lt;span class=&quot;hljs-attr&quot;&gt;value:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;
              &lt;span class=&quot;hljs-attr&quot;&gt;periodSeconds:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;180&lt;/span&gt;  &lt;span class=&quot;hljs-comment&quot;&gt;# remove two pods every 3 minutes&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;h4&gt;HPA config: stabilizationWindowSeconds&lt;/h4&gt;
&lt;p&gt;KEDA allows you to configure the underlying HPA object directly from the ScaledObject. First, we&amp;#39;ll configure the HPA &lt;a href=&quot;https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#stabilization-window&quot;&gt;stabilization window&lt;/a&gt; so that it will look at the last 10 minutes of recommendations by the HPA and only apply the highest value. This means if within the last 10 minutes your HPA recommended to scale down from 10 to 8, then a few minutes later to 5. Then it will scale down to 8 only and not 5 directly. It&amp;#39;ll have to wait until the recommedation to scale down to 8 is outside of the window then only it&amp;#39;ll scale down to 5.&lt;/p&gt;
&lt;h4&gt;HPA config: pod scale down policy&lt;/h4&gt;
&lt;p&gt;HPA allows configuring the &lt;a href=&quot;https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#scaling-policies&quot;&gt;scaling policies&lt;/a&gt; separately for scaling up and down. By default, HPA will scale down by up to 100% of the available replicas every 15s. This means, if you have 20 replicas, HPA will immediately terminates half the pods from 20 to 10 when scaling down. If the HPA recommendation says it should go down further to 1, then after 15s it will scale down further to 5.&lt;/p&gt;
&lt;p&gt;In this above snippet, we changed it to be less aggressive by allowing only 2 pods to scale down every 3 minutes. We&amp;#39;ve seen massive reductions in the number of connections being terminated. I started with only 1 pod per minute but find it too fast then increase it to the current amount.&lt;/p&gt;
&lt;h2&gt;Scaling down your nodes using Karpenter&lt;/h2&gt;
&lt;p&gt;After the pods has been scaled down, your kubernetes nodes would be running underutilized. You can use any cluster autoscaler of your choice for scaling down underutilized nodes but in my case I used Karpenter since I&amp;#39;m already running on EKS and Karpenter was built for it originally. For this part there is less suprise tho I do plan to write more on running Karpenter in production, hopefully it will come out soon. Ping me on my socials if it is not out yet after 3 months you&amp;#39;re reading this (random deadline for myself lol).&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;After all this autoscaling exercise we actually reduced our AWS spending for EC2 instances used by EKS clusters by approximately 50%. This is a huge amount for us and my boss was defnitely happy (promotion soon?). Hope this helps anyone going on this journey :)&lt;/p&gt;
&lt;p&gt;On how to tackle the disconnect issues, you can also configure graceful shutdown for your pods. I&amp;#39;ll link to this detailed article from learnk8s on how to do &lt;a href=&quot;https://learnk8s.io/graceful-shutdown&quot;&gt;graceful shutdown in kubernetes&lt;/a&gt;.&lt;/p&gt;
</description>
    </item>
    <item>
      <title>Eliminating cross-AZ traffic cost on AWS</title>
      <link>https://pokgak.xyz/articles/eliminating-cross-az-traffic-cost/</link>
      <guid>https://pokgak.xyz/articles/eliminating-cross-az-traffic-cost/</guid>
      <pubDate>Sat, 14 Dec 2024 09:23:06 GMT</pubDate>
      <description>&lt;p&gt;Imagine going through your AWS bills and noticing that &lt;strong&gt;APS1-DataTransfer-Regional-Bytes&lt;/strong&gt; is 1/3 of your monthly AWS cost. After reading a bit more on &lt;a href=&quot;https://docs.aws.amazon.com/cur/latest/userguide/cur-data-transfers-charges.html&quot;&gt;how AWS charges for network traffic&lt;/a&gt; you know that this is referring to the cost incurred when your traffic crosses an availability zone (AZ). This article will go through what you can do to eliminate this cross-AZ traffic cost.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Disclaimer&lt;/strong&gt;: this is not necessarily the best practice. I try my best to highlight the caveat and tradeoff you&amp;#39;re making in this article but please make your own judgement before making the changes.&lt;/p&gt;
&lt;h2&gt;The Traffic Flow&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/images/cross-az-traffic.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;p&gt;In this article I will use the above diagram as our example network flow. Our example scenario will use the AWS Network Load Balancer (NLB) as the load balancer (LB), then routes the traffic to the ingress-nginx pods running inside our cluster. Finally, ingress-controller pods will route the traffic to the backend services serving the API.&lt;/p&gt;
&lt;p&gt;The top and bottom part of the diagram will show how the traffic flow will be between AZs. Once we implemented the steps I will be describing below, you should be able to eliminate the cross-AZ traffic in our network.&lt;/p&gt;
&lt;h2&gt;Incoming traffic to Load Balancer Nodes&lt;/h2&gt;
&lt;p&gt;When you provision a load balancer on AWS, you will get a domain name that can be used to resolve to the IP addresses of your LB nodes. Depending on where the DNS resolving happens, this might be the first contributor to your cross-AZ traffic cost.&lt;/p&gt;
&lt;h3&gt;Resolving LB Nodes IP from the Internet&lt;/h3&gt;
&lt;p&gt;There is nothing we can do if the source of the traffic is from the internet. The DNS will resolve to one of the IP addresses of the LB and we wouldn&amp;#39;t be charged for it as there is no AZ yet here.&lt;/p&gt;
&lt;h3&gt;Resolving LB Nodes IP from within your AWS Network&lt;/h3&gt;
&lt;p&gt;If the source traffic originates from within your AWS network, there is the possibility that the LB nodes IP resolved is not within the same AZ as your source traffic. To avoid this, on the LB there is an option to set the client routing policy to resolve to LB nodes IP that is within the same AZ.&lt;/p&gt;
&lt;p&gt;There is 3 options to choose from:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;AZ affinity: queries may resolve to other zones if there are no healthy load balancer IP addresses in their own zone&lt;/li&gt;
&lt;li&gt;Partial AZ affinity: 85% of client DNS queries will favor load balancer IP addresses in their own Availability Zone, remaining resolves to any zone&lt;/li&gt;
&lt;li&gt;Any Availability Zone (default)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In my opinion, it is safe to always set the LB to use the AZ affinity policy because the DNS queries will automatically resolve to other healthy LB IP in other zones when the one in the same zone is down.&lt;/p&gt;
&lt;h2&gt;Load Balancer Nodes to ingress-nginx pods&lt;/h2&gt;
&lt;p&gt;Once the traffic reaches the LB node, the LB now have to decide to which target to send the traffic to. To improve our reliability, we have the option to enable cross-zone load balancing. With this enabled, the LB node can also send the traffic to targets in other zones. It is disabled by default.&lt;/p&gt;
&lt;p&gt;There is a tradeoff between reliability and cost here. If cross-zone load balancing is disabled, you have to make sure that the target(s) in each zone are healthy. If there are no healthy target in the zone, the request might fail. This means your service can be less reliable but if you enabled cross-zone load balancing you might incur cross-AZ traffic cost. So do your research and decide what&amp;#39;s best for you.&lt;/p&gt;
&lt;h2&gt;ingress-nginx pods to backend services pods&lt;/h2&gt;
&lt;h3&gt;Kubernetes 1.31 Service traffic distribution&lt;/h3&gt;
&lt;p&gt;After the ingress-nginx pods, the traffic will be routed to the backend services pods. Kubernetes version 1.31 introduced a new &lt;a href=&quot;https://kubernetes.io/docs/concepts/services-networking/service/#traffic-distribution&quot;&gt;traffic distribution mechanism&lt;/a&gt; for Service resources which will influence how traffic is routed to your pods. You can now set the Service &lt;code&gt;.spec.trafficDistribution&lt;/code&gt; to &lt;code&gt;PreferClose&lt;/code&gt; to route the traffic to endpoints that are &amp;quot;topologically proximate&amp;quot;. The details of what that means depends on the implementation but for kube-proxy this means sending the traffic to endpoints that are within the same zone when available. This means we can achieve our goal here to avoid the cross-AZ traffic cost. To understand more please refer to the &lt;a href=&quot;https://kubernetes.io/docs/reference/networking/virtual-ips/#traffic-distribution&quot;&gt;Kubernetes documentation&lt;/a&gt; and &lt;a href=&quot;https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/4444-service-traffic-distribution#preferclose&quot;&gt;KEP-4444&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;CAVEAT&lt;/strong&gt;: as mentioned in the &lt;a href=&quot;https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/4444-service-traffic-distribution#risks-and-mitigations&quot;&gt;Risks and Mitigation&lt;/a&gt; section of KEP-4444, enabling this feature might cause the pods in one AZ getting overloaded with traffic if the originating traffic is skewed towards one AZ.&lt;/p&gt;
&lt;h3&gt;ingress-nginx default routing behaviour&lt;/h3&gt;
&lt;p&gt;NOTE: please be extra careful with this section as this influences the routing of traffic to pods within your cluster. Do test this out on a non-production environment and make sure your workloads are still running fine before introducing it in production.&lt;/p&gt;
&lt;p&gt;That alone is not enough though due to how ingress-nginx works. In the Ingress configuration, you specified the Service that ingress-nginx should route the traffic to but I was surprised to know that, by default, ingress-nginx does not actually uses the Service ClusterIP. It will actually search for the Endpoint of the services and get the IP of the pods behind the service. Then, it will distribute the traffic to the pod IPs using the round robin algorithm.&lt;/p&gt;
&lt;p&gt;To leverage the newly introuduced traffic distribution feature I mentioned above, we need to make ingress-nginx routes the traffic using the ClusterIP of the Service. To do this globally for the ingress controller we can set &lt;code&gt;service-upstream: true&lt;/code&gt; in the ingress-nginx configmap. This alone is not enough though because by default ingress-nginx tries to keep a long connection between the ingress-nginx pods and the backend services pods to reduce resource usage using keepalives.&lt;/p&gt;
&lt;p&gt;To configure ingress-nginx not to use this, you can add another config &lt;code&gt;upstream-keepalive-requests: &amp;quot;0&amp;quot;&lt;/code&gt; to the ingress-nginx configmap but beware that this might increase the resource of your ingress-nginx pods as it now needs to maintains a new connection for each requests coming in.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;With the above steps, you should be able to avoid cross-AZ traffic in your AWS network for workloads hosted using NLB, EKS and ingress-nginx. Since this optimization might affect your production system, please make sure you test properly before rolling it out.&lt;/p&gt;
</description>
    </item>
    <item>
      <title>Exploring AWS ALB for EKS</title>
      <link>https://pokgak.xyz/articles/exploring-alb-ingress-controller/</link>
      <guid>https://pokgak.xyz/articles/exploring-alb-ingress-controller/</guid>
      <pubDate>Thu, 12 Dec 2024 09:07:00 GMT</pubDate>
      <description>&lt;p&gt;Recently, &lt;a href=&quot;https://pokgak.xyz/articles/we-got-ddosed/&quot;&gt;my company got DDoS&amp;#39;ed&lt;/a&gt; and during the attack I noticed that the first thing that went down is the ingress-nginx pods. With this information, I was thinking if I could eliminate ingress-nginx and rely on AWS load balancers directly for routing the traffic to our services. AWS load balancer comes in two flavors: the L4 Network Load Balancer (NLB) and the L7 Application Load Balancer. The NLB doesn&amp;#39;t support the same feature as ingress-nginx for routing HTTP requests as it operates on the L4 layer only so we&amp;#39;ll be looking at the ALB in this article.&lt;/p&gt;
&lt;h2&gt;Current Setup&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/images/lb-nlb-ingress-nginx.png&quot; alt=&quot;AWS NLB with ingress-nginx&quot;&gt;&lt;/p&gt;
&lt;p&gt;Our current setup uses a AWS NLB per country to accepts the connection from the internet. The NLB then forwards the requests to a target group consisting all the ingress-nginx controller pods. The ingress-nginx pods will then route the traffic from NLB to the backend services based on the configured Ingress configurations.&lt;/p&gt;
&lt;p&gt;For arguments sake, let&amp;#39;s assume that AWS NLB is reliable and can scale infinitely. Our bottleneck then is the ingress-nginx pods. We can setup the ingress-nginx pods to autoscale based on the traffic but I&amp;#39;ve seen issues with connections getting disrupted when the autoscaling happens. What if we can eliminate this bottleneck altogether and just route directly from AWS LB to our backend services?&lt;/p&gt;
&lt;h2&gt;Proposed Setup&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/images/lb-alb.png&quot; alt=&quot;AWS ALB without ingress-nginx&quot;&gt;&lt;/p&gt;
&lt;p&gt;Our proposed setup uses the ALB directly and route traffic to the backend services through its listener rules. By default the aws-load-balancer controller will provision one ALB for each Ingress resource but we can share the ALB for multiple Ingress by specifying the same &lt;code&gt;alb.ingress.kubernetes.io/group.name&lt;/code&gt;. In my case there will be one group name for each country resulting in separate ALB created for each respectively. Each Ingress resource will create a separate listener rule based on the host and path configuration specified in the Ingress.&lt;/p&gt;
&lt;h2&gt;ALB Quotas and Limitations&lt;/h2&gt;
&lt;p&gt;Since we now rely on AWS ALB directly to route requests to our backend services, we must pay attention to the &lt;a href=&quot;https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-limits.html&quot;&gt;limitations&lt;/a&gt; set by AWS. For ALB there is a soft-limit of 100 rules per ALB. Assuming we need one listener rule per backend service, the maximum number of backend services that can be served by one ALB is 100. There are workarounds you can do to overcome this grouping separate backend services together by team or other attributes and each will get one ALB.&lt;/p&gt;
&lt;h2&gt;Pricing: ALB vs NLB&lt;/h2&gt;
&lt;p&gt;This is the part that I&amp;#39;m most interested in: how much more will it cost us if we migrate to ALB?&lt;/p&gt;
&lt;p&gt;Based on the &lt;a href=&quot;https://aws.amazon.com/elasticloadbalancing/pricing/&quot;&gt;AWS ELB pricing page&lt;/a&gt;, the per-hour cost of the LB instance itself is the same for ALB and NLB. What differs is the (N)LCU-hour cost. ALB LCU-hour is around 30% more expensive than NLB. It might seem like that&amp;#39;s the only difference but if you read further on the ELB pricing page you will notice that an LCU-hour for ALB is not the same as NLB.&lt;/p&gt;
&lt;h3&gt;Rule Evaluations&lt;/h3&gt;
&lt;p&gt;For ALB, the LCU-hour has an extra dimension measured which is the rule evaluations. You get 10 free rules per ALB. The formula given for calculating this dimension is &lt;code&gt;Rule evaluations = Request rate * (Number of rules processed - 10 free rules)&lt;/code&gt;. Since ALB adds new rule for each Ingress and having separate path for defined in the Ingress spec will also create new rule, your LCU-hour cost might increase the more Ingress and paths you have in your cluster.&lt;/p&gt;
&lt;h3&gt;New Connections&lt;/h3&gt;
&lt;p&gt;Another difference is the included new connections count per second. For NLB there are different values depending on whether you&amp;#39;re using TCP, UDP, or TLS but for our comparison with ALB, lets look at the TLS pricing.&lt;/p&gt;
&lt;p&gt;NLB with TLS includes 50 new TLS connections per second while ALB only includes half of that amount. We already calculated that ALB LCU-hour is already 30% more expensive than NLB but assuming you have the same amount of new connections, ALB will incur more LCU-hour than NLB.&lt;/p&gt;
&lt;p&gt;To get a better comparison let&amp;#39;s calculate the cost per connection for each LCU-hour. For NLB this is $0.006/50 connections = $0.00012 and for ALB it&amp;#39;s $0.008/25 connections = $0.00032. So, comparing cost per connection, $0.00032/$0.00012, &lt;strong&gt;ALB is 2.7x more expensive than NLB&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;LCU-hour pricing&lt;/h3&gt;
&lt;blockquote&gt;
&lt;p&gt;You are charged only on the dimension with the highest usage.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;LCU-hour is charged based on the highest dimension from all the dimensions measured so this means it depends on which is higher for both the rule evaluations and new connections dimension above. Let&amp;#39;s say you don&amp;#39;t have that many services so your listening rules also not that many causing you to not exceed the rule evaluations per second for ALB, you will still be charged 2.7x more compared to when you&amp;#39;re using NLB.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;I started this thought experiment thinking it might be better for us to migrate to ALB and eliminate ingress-nginx but after doing this research I think for our current workload we are better suited sticking with NLB + ingress-nginx. Hope this is useful for others too.&lt;/p&gt;
</description>
    </item>
    <item>
      <title>Docker's lesser known command: docker buildx bake</title>
      <link>https://pokgak.xyz/articles/docker-bake/</link>
      <guid>https://pokgak.xyz/articles/docker-bake/</guid>
      <pubDate>Fri, 15 Nov 2024 16:52:00 GMT</pubDate>
      <description>&lt;p&gt;Have you seen someone use &lt;code&gt;docker buildx bake&lt;/code&gt; before? Me neither... until I need to build a multi-platform image for our services. In this blog post I&amp;#39;ll walk through the reason why I ended up being forced to use &lt;code&gt;docker buildx bake&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Background&lt;/h2&gt;
&lt;p&gt;I am on a journey to run our services on Graviton on AWS and its using an arm64-based CPU. To make this migration smooth I decided there will be a transitionary period where there might be both architectures running on separate environments. This means I need to make sure that we&amp;#39;re building both the &lt;code&gt;linux/amd64&lt;/code&gt; and &lt;code&gt;linux/arm64&lt;/code&gt; variant of the images.&lt;/p&gt;
&lt;p&gt;We&amp;#39;re using one monorepo per team for our services (don&amp;#39;t ask me how we got there). For each monorepo there will be codebases for multiple services side by side and the Dockerfile are also managed together in one folder. To avoid code duplication, we have one &lt;code&gt;base.Dockerfile&lt;/code&gt; that will generate a local &lt;code&gt;base&lt;/code&gt; image which will be referred to when building other services.&lt;/p&gt;
&lt;p&gt;For reference, this is how the files look like:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-Dockerfile&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;# ./cicd/docker/base.Dockerfile&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; node:&lt;span class=&quot;hljs-number&quot;&gt;21&lt;/span&gt;-alpine as base
...
&lt;span class=&quot;hljs-comment&quot;&gt;# installs all the common dependencies&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;RUN&lt;/span&gt;&lt;span class=&quot;language-bash&quot;&gt; pnpm install --production&lt;/span&gt;
...&lt;/code&gt;&lt;/pre&gt;&lt;pre&gt;&lt;code class=&quot;hljs language-Dockerfile&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;# ./cicd/docker/serviceA.Dockerfile&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; node:&lt;span class=&quot;hljs-number&quot;&gt;21&lt;/span&gt;-alpine
&lt;span class=&quot;hljs-keyword&quot;&gt;WORKDIR&lt;/span&gt;&lt;span class=&quot;language-bash&quot;&gt; /app&lt;/span&gt;

&lt;span class=&quot;hljs-comment&quot;&gt;# copies the dependencies from base image into current folder&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;COPY&lt;/span&gt;&lt;span class=&quot;language-bash&quot;&gt; --from=base /usr/src/app /app&lt;/span&gt;
...&lt;/code&gt;&lt;/pre&gt;&lt;pre&gt;&lt;code class=&quot;hljs language-yaml&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;# ./cicd/docker/docker-compose.yaml&lt;/span&gt;
&lt;span class=&quot;hljs-attr&quot;&gt;services:&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;base:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;build:&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;dockerfile:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;./cicd/docker/base.Dockerfile&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;serviceA:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;build:&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;dockerfile:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;./cicd/docker/serviceA.Dockerfile&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;serviceB:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;build:&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;dockerfile:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;./cicd/docker/serviceB.Dockerfile&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;When building the service we run the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;&lt;span class=&quot;hljs-variable&quot;&gt;$&lt;/span&gt; docker compose &lt;span class=&quot;hljs-operator&quot;&gt;-f&lt;/span&gt; ./cicd/docker/docker&lt;span class=&quot;hljs-literal&quot;&gt;-compose&lt;/span&gt;.yaml build base
&lt;span class=&quot;hljs-variable&quot;&gt;$&lt;/span&gt; docker compose &lt;span class=&quot;hljs-operator&quot;&gt;-f&lt;/span&gt; ./cicd/docker/docker&lt;span class=&quot;hljs-literal&quot;&gt;-compose&lt;/span&gt;.yaml build serviceA&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;This will first build the &lt;code&gt;base&lt;/code&gt; image and then use that image to build &lt;code&gt;serviceA&lt;/code&gt;. It works without issue.&lt;/p&gt;
&lt;h2&gt;Adding Multi-platform support&lt;/h2&gt;
&lt;p&gt;The Docker Compose specification supports specifying the we want to build for using the &lt;code&gt;platforms&lt;/code&gt; key. So, adding that to our &lt;code&gt;docker-compose.yaml&lt;/code&gt;, it&amp;#39;ll now look like this with multi-platform support:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-yaml&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;# ./cicd/docker/docker-compose.yaml&lt;/span&gt;
&lt;span class=&quot;hljs-attr&quot;&gt;services:&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;base:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;build:&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;dockerfile:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;./cicd/docker/base.Dockerfile&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;platforms:&lt;/span&gt;
        &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;linux/amd64&lt;/span&gt;
        &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;linux/arm64&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;serviceA:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;build:&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;dockerfile:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;./cicd/docker/serviceA.Dockerfile&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;platforms:&lt;/span&gt;
        &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;linux/amd64&lt;/span&gt;
        &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;linux/arm64&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;serviceB:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;build:&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;dockerfile:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;./cicd/docker/serviceB.Dockerfile&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;platforms:&lt;/span&gt;
        &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;linux/amd64&lt;/span&gt;
        &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;linux/arm64&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Now let&amp;#39;s run the same &lt;code&gt;docker-compose build&lt;/code&gt; command like before and we get...&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;[+] Building &lt;span class=&quot;hljs-number&quot;&gt;0.0&lt;/span&gt;s (&lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt;/&lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt;)
Multi-&lt;span class=&quot;hljs-keyword&quot;&gt;platform&lt;/span&gt; build &lt;span class=&quot;hljs-keyword&quot;&gt;is&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;not&lt;/span&gt; supported &lt;span class=&quot;hljs-keyword&quot;&gt;for&lt;/span&gt; the docker driver.
Switch &lt;span class=&quot;hljs-keyword&quot;&gt;to&lt;/span&gt; a different driver, &lt;span class=&quot;hljs-keyword&quot;&gt;or&lt;/span&gt; turn &lt;span class=&quot;hljs-keyword&quot;&gt;on&lt;/span&gt; the containerd image store, &lt;span class=&quot;hljs-keyword&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;try&lt;/span&gt; again.
Learn more at https:&lt;span class=&quot;hljs-comment&quot;&gt;//docs.docker.com/go/build-multi-platform/&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Damn it.&lt;/p&gt;
&lt;h2&gt;Using the docker-container build driver&lt;/h2&gt;
&lt;p&gt;After some reading, I learnt that the default build driver when you use docker is the &lt;code&gt;docker&lt;/code&gt; driver which doesn&amp;#39;t have support for multi-platform images. You can read more on Docker build drivers on this &lt;a href=&quot;https://docs.docker.com/build/builders/drivers/&quot;&gt;page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I need to use the &lt;code&gt;docker-container&lt;/code&gt; driver which has support for building multi-platform images. To do that I need to create a new builder using the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;$ docker buildx create &lt;span class=&quot;hljs-attr&quot;&gt;--driver&lt;/span&gt; docker-&lt;span class=&quot;hljs-attribute&quot;&gt;container&lt;/span&gt; &lt;span class=&quot;hljs-attr&quot;&gt;--name&lt;/span&gt; multiplatform &lt;span class=&quot;hljs-attr&quot;&gt;--use&lt;/span&gt;
$ docker buildx install&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;The first command creates the builder using the &lt;code&gt;docker-container&lt;/code&gt; driver and sets it as the default while the second command creates a shell alias so that I can just use &lt;code&gt;docker build&lt;/code&gt; instead of having to specify &lt;code&gt;docker buildx build&lt;/code&gt; on the CLI.&lt;/p&gt;
&lt;p&gt;Now, I should be able to run my docker-compose command right...?&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;# use `--builder multiplatform` to tell docker-compose to use the docker-container builder we just created&lt;/span&gt;
$ docker compose &lt;span class=&quot;hljs-operator&quot;&gt;-&lt;/span&gt;f &lt;span class=&quot;hljs-symbol&quot;&gt;./cicd/docker/docker-compose.yaml&lt;/span&gt; build &lt;span class=&quot;hljs-operator&quot;&gt;-&lt;/span&gt;-builder multiplatform serviceA
...
failed to &lt;span class=&quot;hljs-params&quot;&gt;solve:&lt;/span&gt; &lt;span class=&quot;hljs-params&quot;&gt;base:&lt;/span&gt; failed to resolve source metadata for docker.io&lt;span class=&quot;hljs-operator&quot;&gt;/&lt;/span&gt;library&lt;span class=&quot;hljs-operator&quot;&gt;/&lt;/span&gt;base:&lt;span class=&quot;hljs-params&quot;&gt;latest:&lt;/span&gt; pull access denied, repository does not exist &lt;span class=&quot;hljs-keyword&quot;&gt;or&lt;/span&gt; may require &lt;span class=&quot;hljs-params&quot;&gt;authorization:&lt;/span&gt; server &lt;span class=&quot;hljs-params&quot;&gt;message:&lt;/span&gt; &lt;span class=&quot;hljs-params&quot;&gt;insufficient_scope:&lt;/span&gt; authorization failed&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Now the builder cannot find the &lt;code&gt;base&lt;/code&gt; image that we&amp;#39;ve built used before. Why? This section from the Docker build drivers page answered it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Unlike when using the default docker driver, images built using other drivers aren&amp;#39;t automatically loaded into the local image store. If you don&amp;#39;t specify an output, the build result is exported to the build cache only.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I did use the &lt;code&gt;--load&lt;/code&gt; and the &lt;code&gt;--driver-opt default-load=true&lt;/code&gt; to automatically load the image into the local image store but long story short, it didn&amp;#39;t work. So what&amp;#39;s next?&lt;/p&gt;
&lt;h2&gt;Enter docker buildx bake&lt;/h2&gt;
&lt;p&gt;At this point, I&amp;#39;ve almost exhausted all my options and just browsing through the Docker documentation in hope of something and I found it!&lt;/p&gt;
&lt;p&gt;At first the docker buildx bake command just looks like a different syntax for specifying the docker compose file to me but when my eyes caught on to one of the properties: &lt;a href=&quot;https://docs.docker.com/build/bake/reference/#targetcontexts&quot;&gt;&lt;code&gt;target.contexts&lt;/code&gt;&lt;/a&gt;. It allows you to pass in more contexts in addition to the folder content context that we&amp;#39;re used to with normal docker build that can be used in the Dockerfile.&lt;/p&gt;
&lt;p&gt;These are the things you can specify under the &lt;code&gt;target.contexts&lt;/code&gt; property:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Container image: &lt;code&gt;docker-image://alpine@sha256:0123456789&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Git URL: &lt;code&gt;https://github.com/user/proj.git&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;HTTP URL: &lt;code&gt;https://example.com/files&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Local directory: &lt;code&gt;../path/to/src&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Bake target: &lt;code&gt;target:base&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The first four are cool but the last one stood out to me: &lt;code&gt;Bake target: target:bake&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Bake 101 crash course: remember in &lt;code&gt;docker-compose.yaml&lt;/code&gt; we have &lt;code&gt;services&lt;/code&gt;?&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-yaml&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;# ./cicd/docker/docker-compose.yaml&lt;/span&gt;
&lt;span class=&quot;hljs-attr&quot;&gt;services:&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;base:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;build:&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;dockerfile:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;./cicd/docker/base.Dockerfile&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;serviceA:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;build:&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;dockerfile:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;./cicd/docker/serviceA.Dockerfile&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;serviceB:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;build:&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;dockerfile:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;./cicd/docker/serviceB.Dockerfile&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;In Bake format, those services are called targets. Let&amp;#39;s recall my original &lt;code&gt;docker-compose.yaml&lt;/code&gt; file again, I have a &lt;code&gt;base&lt;/code&gt; service used to build the shared image used in the Dockerfiles of &lt;code&gt;serviceA&lt;/code&gt; and &lt;code&gt;serviceB&lt;/code&gt;. In other words, the &lt;code&gt;base&lt;/code&gt; service is also a &lt;strong&gt;target&lt;/strong&gt;. This means, I can pass it on as extra contexts to my build!&lt;/p&gt;
&lt;h2&gt;Putting it all together&lt;/h2&gt;
&lt;p&gt;The &lt;a href=&quot;https://docs.docker.com/build/bake/reference/&quot;&gt;Bake specification&lt;/a&gt; allows you to write the file in 3 languages: HCL (Terraform, anyone?), JSON, and YAML (through docker-compose.yaml syntax). To be less disruptive I chose YAML. To use Bake specification with YAML, the CLI can parse existing docker-compose.yaml files but for Bake-specific syntax you have to put it under the property &lt;code&gt;x-bake&lt;/code&gt;. We&amp;#39;ll also add the &lt;code&gt;platform&lt;/code&gt; properties under the &lt;code&gt;x-bake&lt;/code&gt; property. This is how the final &lt;code&gt;docker-compose.yaml&lt;/code&gt; looks like in my case:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-yaml&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;# ./cicd/docker/docker-compose.yaml&lt;/span&gt;
&lt;span class=&quot;hljs-attr&quot;&gt;services:&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;base:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;build:&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;dockerfile:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;./cicd/docker/base.Dockerfile&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;x-bake:&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;platforms:&lt;/span&gt;
        &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;linux/amd64&lt;/span&gt;
        &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;linux/arm64&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;serviceA:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;build:&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;dockerfile:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;./cicd/docker/serviceA.Dockerfile&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;x-bake:&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;contexts:&lt;/span&gt;
            &lt;span class=&quot;hljs-attr&quot;&gt;base:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;target:base&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;platforms:&lt;/span&gt;
        &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;linux/amd64&lt;/span&gt;
        &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;linux/arm64&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;serviceB:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;build:&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;dockerfile:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;./cicd/docker/serviceB.Dockerfile&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;x-bake:&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;contexts:&lt;/span&gt;
            &lt;span class=&quot;hljs-attr&quot;&gt;base:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;target:base&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;platforms:&lt;/span&gt;
        &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;linux/amd64&lt;/span&gt;
        &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;linux/arm64&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;To run the build we need to use our friend &lt;code&gt;docker buildx bake&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;&lt;span class=&quot;hljs-symbol&quot;&gt;$&lt;/span&gt; docker buildx bake --&lt;span class=&quot;hljs-keyword&quot;&gt;file&lt;/span&gt; ./cicd/docker/docker-compose.yaml serviceA

DONE&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Finally, we managed to build a multi-platform image using docker buildx bake!&lt;/p&gt;
&lt;h2&gt;well ackshually&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/images/well-ackchyually.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;you can just use docker multi-stage build&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I know. You&amp;#39;re right. I chose not do that to avoid making big changes to the code.&lt;/p&gt;
&lt;p&gt;Just let me suffer.&lt;/p&gt;
&lt;h2&gt;Recap&lt;/h2&gt;
&lt;p&gt;TLDR here&amp;#39;s what I had to do to get multi-platform build working when using multiple Dockerfiles:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Add the &lt;code&gt;platforms&lt;/code&gt; section to your service in &lt;code&gt;docker-compose.yaml&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Setup a new Docker builder using the &lt;code&gt;docker-container&lt;/code&gt; driver&lt;/li&gt;
&lt;li&gt;Add the &lt;code&gt;base&lt;/code&gt; image as extra contexts using Bake &lt;code&gt;x-bake&lt;/code&gt; syntax&lt;/li&gt;
&lt;li&gt;Build the image using &lt;code&gt;docker buildx bake&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This use case is quite niche tbh and like I&amp;#39;ve said it is avoidable by using multi-stage builds but it is what it is. There&amp;#39;s other use cases for Bake as described in this &lt;a href=&quot;https://docs.docker.com/guides/bake/&quot;&gt;guide&lt;/a&gt; that is more practical. It&amp;#39;s worth a read IMO. Hope you&amp;#39;ve learnt something new as I have.&lt;/p&gt;
</description>
    </item>
    <item>
      <title>Using DuckDB to analyze NGINX logs</title>
      <link>https://pokgak.xyz/articles/duckdb-nginx-logs/</link>
      <guid>https://pokgak.xyz/articles/duckdb-nginx-logs/</guid>
      <pubDate>Sun, 27 Oct 2024 10:15:00 GMT</pubDate>
      <description>&lt;p&gt;As part of my recent &lt;a href=&quot;https://pokgak.xyz/articles/we-got-ddosed/&quot;&gt;DDoS mitigation effort&lt;/a&gt;, I had to go through millions of nginx logs to identify patterns that I can use to further improve our custom WAF rules on Cloudflare. This article shows what I did to be able to run some analysis on the logs using DuckDB.&lt;/p&gt;
&lt;h2&gt;Making the hard things easy&lt;/h2&gt;
&lt;h3&gt;Changing the log format&lt;/h3&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;To solve a problem that is difficult, you must first make it easy.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The default nginx access logs format looks something like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;&lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;&lt;span class=&quot;hljs-subst&quot;&gt;$remote_addr&lt;/span&gt; - &lt;span class=&quot;hljs-subst&quot;&gt;$remote_user&lt;/span&gt; [&lt;span class=&quot;hljs-subst&quot;&gt;$time_local&lt;/span&gt;] &amp;#x27;&lt;/span&gt;&lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;&amp;quot;&lt;span class=&quot;hljs-subst&quot;&gt;$request&lt;/span&gt;&amp;quot; &lt;span class=&quot;hljs-subst&quot;&gt;$status&lt;/span&gt; &lt;span class=&quot;hljs-subst&quot;&gt;$body_bytes_sent&lt;/span&gt; &amp;#x27;&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;&amp;quot;&lt;span class=&quot;hljs-subst&quot;&gt;$http_referer&lt;/span&gt;&amp;quot; &amp;quot;&lt;span class=&quot;hljs-subst&quot;&gt;$http_user_agent&lt;/span&gt;&amp;quot;&amp;#x27;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Parsing it is definitely possible using regex as shown by &lt;a href=&quot;https://www.alibabacloud.com/help/en/sls/user-guide/parse-nginx-logs&quot;&gt;this article&lt;/a&gt; but why bother when you have the option to change the format and make it easier to parse using DuckDB. To do so we will be configuring ingress-nginx controller in our cluster to log in JSON format. You can set the custom log format using the &lt;code&gt;log-format-upstream&lt;/code&gt; config and set &lt;code&gt;log-format-escape-json&lt;/code&gt; to make sure that the variables are escaped properly for use in as JSON variables.&lt;/p&gt;
&lt;p&gt;This is the log format that I&amp;#39;m currently using:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;&amp;#x27;{&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;timestamp&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;&lt;span class=&quot;hljs-variable&quot;&gt;$time_iso8601&lt;/span&gt;&amp;quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;requestID&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;&lt;span class=&quot;hljs-variable&quot;&gt;$req_id&lt;/span&gt;&amp;quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;proxyUpstreamName&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;&lt;span class=&quot;hljs-variable&quot;&gt;$proxy_upstream_name&lt;/span&gt;&amp;quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;proxyAlternativeUpstreamName&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;&lt;span class=&quot;hljs-variable&quot;&gt;$proxy_alternative_upstream_name&lt;/span&gt;&amp;quot;&lt;/span&gt;,&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;upstreamStatus&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;&lt;span class=&quot;hljs-variable&quot;&gt;$upstream_status&lt;/span&gt;&amp;quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;upstreamAddr&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;&lt;span class=&quot;hljs-variable&quot;&gt;$upstream_addr&lt;/span&gt;&amp;quot;&lt;/span&gt;,&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;method&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;&lt;span class=&quot;hljs-variable&quot;&gt;$request_method&lt;/span&gt;&amp;quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;host&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;&lt;span class=&quot;hljs-variable&quot;&gt;$host&lt;/span&gt;&amp;quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;uri&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;&lt;span class=&quot;hljs-variable&quot;&gt;$uri&lt;/span&gt;&amp;quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;uriNormalized&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;&lt;span class=&quot;hljs-variable&quot;&gt;$uri_normalized&lt;/span&gt;&amp;quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;uriWithParams&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;&lt;span class=&quot;hljs-variable&quot;&gt;$request_uri&lt;/span&gt;&amp;quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;status&amp;quot;&lt;/span&gt;: $status,&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;requestSize&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;&lt;span class=&quot;hljs-variable&quot;&gt;$request_length&lt;/span&gt;&amp;quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;responseSize&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;&lt;span class=&quot;hljs-variable&quot;&gt;$upstream_response_length&lt;/span&gt;&amp;quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;userAgent&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;&lt;span class=&quot;hljs-variable&quot;&gt;$http_user_agent&lt;/span&gt;&amp;quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;remoteIp&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;&lt;span class=&quot;hljs-variable&quot;&gt;$remote_addr&lt;/span&gt;&amp;quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;referer&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;&lt;span class=&quot;hljs-variable&quot;&gt;$http_referer&lt;/span&gt;&amp;quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;latency&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;&lt;span class=&quot;hljs-variable&quot;&gt;$upstream_response_time&lt;/span&gt; s&amp;quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;protocol&amp;quot;&lt;/span&gt;:&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;&lt;span class=&quot;hljs-variable&quot;&gt;$server_protocol&lt;/span&gt;&amp;quot;&lt;/span&gt;}&amp;#x27;&lt;/code&gt;&lt;/pre&gt;&lt;h3&gt;Mapping customer IDs to generic placeholder&lt;/h3&gt;
&lt;p&gt;One thing I don&amp;#39;t know how to do with DuckDB is normalizing URIs. Given a URI path containing user ID 12345678 like &lt;code&gt;/users/1234567/info&lt;/code&gt;. How do I group by path where I ignore the middle section and group it as if all the URIs are like &lt;code&gt;/users/:userId/info&lt;/code&gt;. I tried regex and patterns but couldn&amp;#39;t get it to work. If you know how to do it please DM me on Twitter, I&amp;#39;d really appreciate it.&lt;/p&gt;
&lt;p&gt;I found a way to do it in nginx instead. It works but its definitely not scalable. I use the &lt;code&gt;ngx_http_map_module&lt;/code&gt; module to match URIs with certain paths and then convert it to a generic version of that path.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;map&lt;/span&gt; &lt;span class=&quot;hljs-variable&quot;&gt;$uri&lt;/span&gt; &lt;span class=&quot;hljs-variable&quot;&gt;$uri_normalized&lt;/span&gt; {
    &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;~^/user/(.*)$&amp;quot;&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;/user/:ID&amp;quot;&lt;/span&gt;;
    &lt;span class=&quot;hljs-keyword&quot;&gt;default&lt;/span&gt; &lt;span class=&quot;hljs-variable&quot;&gt;$uri&lt;/span&gt;;
}&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;This snippet will do the following: for each of the variable &lt;code&gt;$uri&lt;/code&gt;, map it to a new variable &lt;code&gt;$uri_normalized&lt;/code&gt;. When &lt;code&gt;$uri&lt;/code&gt; value matches the regex &lt;code&gt;^/user/(.*)$&lt;/code&gt;, the replace the value with new value &lt;code&gt;/users/ID&lt;/code&gt;. If no matching regex found, then use default to the value of &lt;code&gt;$uri&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;With those configured, you can proceed to ship the logs to your logging backend of choice to retrieve later.&lt;/p&gt;
&lt;h2&gt;Fetching the logs from Loki&lt;/h2&gt;
&lt;p&gt;I&amp;#39;m using Loki as my logging backend. Loki provides an API endpoint you can use to fetch the logs and to make things easier they also have an CLI tool called &lt;a href=&quot;https://grafana.com/docs/loki/latest/query/logcli/&quot;&gt;logcli&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;To fetch all nginx logs from my production cluster, this is the command that I used:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;$ logcli query -oraw &lt;span class=&quot;hljs-attribute&quot;&gt;--from&lt;/span&gt;=&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;2024-10-24T00:00:00+08:00&amp;quot;&lt;/span&gt; &lt;span class=&quot;hljs-attribute&quot;&gt;--to&lt;/span&gt;=&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;2024-10-24T22:00:00+08:00&amp;quot;&lt;/span&gt; &lt;span class=&quot;hljs-attribute&quot;&gt;--part-path-prefix&lt;/span&gt;=&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;logs&amp;quot;&lt;/span&gt; &lt;span class=&quot;hljs-attribute&quot;&gt;--parallel-max-workers&lt;/span&gt;=100 &lt;span class=&quot;hljs-attribute&quot;&gt;--parallel-duration&lt;/span&gt;=15m &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;{namespace=&amp;quot;ingress-nginx&amp;quot;, cluster=&amp;quot;production&amp;quot;} | json | __error__=``&amp;#x27;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;-oraw&lt;/code&gt; is to set the output format of the logs. I&amp;#39;m using the raw format here so that I can get the same JSON input that I sent to Loki&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--from&lt;/code&gt; and &lt;code&gt;--to&lt;/code&gt; are self-explanatory but I did have some problem specifying the correct format that the tool will accept and the official docs was quite confusing&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--part-path-prefix&lt;/code&gt; use this prefix to name the files when downloading multiple files in parallel&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--parallel-max-workers&lt;/code&gt; sets the max parallel workers to be used. Note that the actual workers used depends also on the available tasks based on the parallel duration configured&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--parallel-duration&lt;/code&gt; is the duration size to use for each file. combined with the &lt;code&gt;--from&lt;/code&gt; and &lt;code&gt;--to&lt;/code&gt; option, this option determines how many files will be created e.g. for 1 hour duration and you&amp;#39;ve specified the &lt;code&gt;--parallel-duration&lt;/code&gt; there will be 4 files created, each containing logs from specific 15 minutes section.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Loading the logs into DuckDB&lt;/h2&gt;
&lt;p&gt;I&amp;#39;m using the CLI version of duckdb but you can also do this using the embedded library in other languages of your choice.&lt;/p&gt;
&lt;p&gt;To read the files we&amp;#39;ve pulled from Loki and create a table with it I used this command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;create table logs &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; select * &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-title function_ invoke__&quot;&gt;read_json&lt;/span&gt;(&lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;logs_*.part&amp;#x27;&lt;/span&gt;, format=&lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;auto&amp;#x27;&lt;/span&gt;, columns={&lt;span class=&quot;hljs-attr&quot;&gt;timestamp&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;TIMESTAMP&amp;#x27;&lt;/span&gt;, &lt;span class=&quot;hljs-attr&quot;&gt;method&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;VARCHAR&amp;#x27;&lt;/span&gt;, &lt;span class=&quot;hljs-attr&quot;&gt;host&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;VARCHAR&amp;#x27;&lt;/span&gt;, &lt;span class=&quot;hljs-attr&quot;&gt;uri&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;VARCHAR&amp;#x27;&lt;/span&gt;, &lt;span class=&quot;hljs-attr&quot;&gt;uriNormalized&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;VARCHAR&amp;#x27;&lt;/span&gt;, &lt;span class=&quot;hljs-attr&quot;&gt;status&lt;/span&gt;:&lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;INT&amp;#x27;&lt;/span&gt;, &lt;span class=&quot;hljs-attr&quot;&gt;remoteIp&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;VARCHAR&amp;#x27;&lt;/span&gt;, &lt;span class=&quot;hljs-attr&quot;&gt;is_authed&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;BOOLEAN&amp;#x27;&lt;/span&gt;, &lt;span class=&quot;hljs-attr&quot;&gt;userAgent&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;VARCHAR&amp;#x27;&lt;/span&gt;});&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;&lt;code&gt;read_json&lt;/code&gt; can automatically create all the columns based on the keys in the JSON files but I want it to treat certain columns as specific data types so that&amp;#39;s why I&amp;#39;m specifying the columns manually here.&lt;/p&gt;
&lt;p&gt;After loading the columns into the table you can check the number of rows using this commands:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;&lt;span class=&quot;hljs-function&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;select&lt;/span&gt; &lt;span class=&quot;hljs-title&quot;&gt;count&lt;/span&gt;() &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; logs&lt;/span&gt;;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;In my case, for a day&amp;#39;s worth of logs from nginx, I have around 60 million rows. On idle, its using around 8GBs of RAM. If you have longer periods of logs to analyze, then definitely opt for a beefier machine or else DuckDB will crash.&lt;/p&gt;
&lt;h2&gt;Running queries&lt;/h2&gt;
&lt;p&gt;Its the fun part. Here&amp;#39;s some queries that I find useful:&lt;/p&gt;
&lt;h3&gt;Highest requests per minute grouped by IP&lt;/h3&gt;
&lt;p&gt;I use this information to set the proper value to use in our Cloudflare rate limiting rule. By looking at existing request rate I&amp;#39;m reducing the chance of rate limiting our actual customers.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;select&lt;/span&gt;
        (hour(&lt;span class=&quot;hljs-type&quot;&gt;timestamp&lt;/span&gt;) + &lt;span class=&quot;hljs-number&quot;&gt;8&lt;/span&gt;) % &lt;span class=&quot;hljs-number&quot;&gt;24&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; hour,
        minute(&lt;span class=&quot;hljs-type&quot;&gt;timestamp&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; minute,
        remoteIp,
        count() &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; total
&lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; logs
&lt;span class=&quot;hljs-keyword&quot;&gt;where&lt;/span&gt; remoteIp &lt;span class=&quot;hljs-keyword&quot;&gt;not&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;in&lt;/span&gt; (&lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;X.X.X.X&amp;#x27;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;Y.Y.Y.Y&amp;#x27;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;Z.Z.Z.Z&amp;#x27;&lt;/span&gt;) &lt;span class=&quot;hljs-comment&quot;&gt;--production NAT gateway IPs&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;and&lt;/span&gt; remoteIp &lt;span class=&quot;hljs-keyword&quot;&gt;not&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;like&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;162.158.192.%&amp;#x27;&lt;/span&gt;   &lt;span class=&quot;hljs-comment&quot;&gt;--cloudflare IPs&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;group&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;by&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;all&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;order&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;by&lt;/span&gt; total &lt;span class=&quot;hljs-keyword&quot;&gt;desc&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;limit&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;5&lt;/span&gt;;&lt;/code&gt;&lt;/pre&gt;&lt;h3&gt;IP address with the highest number of request to a particular host&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;select&lt;/span&gt; remoteIp, &lt;span class=&quot;hljs-built_in&quot;&gt;count&lt;/span&gt;() &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; total &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; logs &lt;span class=&quot;hljs-keyword&quot;&gt;where&lt;/span&gt; host = &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;subdomain.example.com&amp;#x27;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;group&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;by&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;all&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;order&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;by&lt;/span&gt; total &lt;span class=&quot;hljs-keyword&quot;&gt;desc&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;limit&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;10&lt;/span&gt;;&lt;/code&gt;&lt;/pre&gt;&lt;h3&gt;Checking the paths an IP have been sending requests to&lt;/h3&gt;
&lt;p&gt;I&amp;#39;m including the result from the query here since it shows that this particular IP most likely are using a scanner to find vulnerable endpoints on our service.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;select uri, &lt;span class=&quot;hljs-keyword&quot;&gt;count&lt;/span&gt;() as total &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; logs where remoteIp = &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;152.32.189.70&amp;#x27;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;group&lt;/span&gt; by all order by total desc limit &lt;span class=&quot;hljs-number&quot;&gt;20&lt;/span&gt;;
┌─────────────────────────────────────────────────────────────────┬───────┐
│                               uri                               │ total │
│                             varchar                             │ int64 │
├─────────────────────────────────────────────────────────────────┼───────┤
│ /                                                               │    &lt;span class=&quot;hljs-number&quot;&gt;10&lt;/span&gt; │
│ /favicon.ico                                                    │     &lt;span class=&quot;hljs-number&quot;&gt;6&lt;/span&gt; │
│ &lt;span class=&quot;hljs-regexp&quot;&gt;/api/u&lt;/span&gt;ser/ismustmobile                                          │     &lt;span class=&quot;hljs-number&quot;&gt;6&lt;/span&gt; │
│ &lt;span class=&quot;hljs-regexp&quot;&gt;/h5/&lt;/span&gt;                                                            │     &lt;span class=&quot;hljs-number&quot;&gt;4&lt;/span&gt; │
│ &lt;span class=&quot;hljs-regexp&quot;&gt;/m/&lt;/span&gt;                                                             │     &lt;span class=&quot;hljs-number&quot;&gt;4&lt;/span&gt; │
│ /api                                                            │     &lt;span class=&quot;hljs-number&quot;&gt;4&lt;/span&gt; │
│ &lt;span class=&quot;hljs-regexp&quot;&gt;/app/&lt;/span&gt;                                                           │     &lt;span class=&quot;hljs-number&quot;&gt;3&lt;/span&gt; │
│ &lt;span class=&quot;hljs-regexp&quot;&gt;/api/&lt;/span&gt;config                                                     │     &lt;span class=&quot;hljs-number&quot;&gt;3&lt;/span&gt; │
│ /leftDao.php?callback=jQuery183016740860980352856_1604309800583 │     &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt; │
│ &lt;span class=&quot;hljs-regexp&quot;&gt;/public/&lt;/span&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;static&lt;/span&gt;&lt;span class=&quot;hljs-regexp&quot;&gt;/home/&lt;/span&gt;js&lt;span class=&quot;hljs-regexp&quot;&gt;/moblie/&lt;/span&gt;login.js                          │     &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt; │
│ &lt;span class=&quot;hljs-regexp&quot;&gt;/static/&lt;/span&gt;home&lt;span class=&quot;hljs-regexp&quot;&gt;/css/&lt;/span&gt;feiqi-ee5401a8e6.css                           │     &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt; │
│ &lt;span class=&quot;hljs-regexp&quot;&gt;/client/&lt;/span&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;static&lt;/span&gt;&lt;span class=&quot;hljs-regexp&quot;&gt;/icon/&lt;/span&gt;hangqingicon.png                            │     &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt; │
│ &lt;span class=&quot;hljs-regexp&quot;&gt;/admin/&lt;/span&gt;webadmin.php?mod=&lt;span class=&quot;hljs-keyword&quot;&gt;do&lt;/span&gt;&amp;amp;act=login                            │     &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt; │
│ &lt;span class=&quot;hljs-regexp&quot;&gt;/static/im&lt;/span&gt;ages&lt;span class=&quot;hljs-regexp&quot;&gt;/auth/&lt;/span&gt;background.png                              │     &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt; │
│ &lt;span class=&quot;hljs-regexp&quot;&gt;/index/i&lt;/span&gt;ndex/home?business_id=&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;                                 │     &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt; │
│ &lt;span class=&quot;hljs-regexp&quot;&gt;/stage-api/&lt;/span&gt;common&lt;span class=&quot;hljs-regexp&quot;&gt;/configKey/&lt;/span&gt;all                                 │     &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt; │
│ &lt;span class=&quot;hljs-regexp&quot;&gt;/Public/&lt;/span&gt;home&lt;span class=&quot;hljs-regexp&quot;&gt;/common/&lt;/span&gt;js/index.js                                 │     &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt; │
│ &lt;span class=&quot;hljs-regexp&quot;&gt;/ws/i&lt;/span&gt;ndex/getTheLotteryInitList                                 │     &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt; │
│ &lt;span class=&quot;hljs-regexp&quot;&gt;/app/&lt;/span&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;static&lt;/span&gt;&lt;span class=&quot;hljs-regexp&quot;&gt;/picture/&lt;/span&gt;star.png                                    │     &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt; │
│ &lt;span class=&quot;hljs-regexp&quot;&gt;/resource/&lt;/span&gt;home&lt;span class=&quot;hljs-regexp&quot;&gt;/js/&lt;/span&gt;common.js                                     │     &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt; │
├─────────────────────────────────────────────────────────────────┴───────┤
│ &lt;span class=&quot;hljs-number&quot;&gt;20&lt;/span&gt; rows                                                       &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt; columns │
└─────────────────────────────────────────────────────────────────────────┘&lt;/code&gt;&lt;/pre&gt;&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;This has been an adhoc task and when I was told to do some analysis on the logs I immediately think of the tools I&amp;#39;m familiar with which is DuckDB to do the analysis. I&amp;#39;m aware there are better tools out there and we are currently evaluating using OpenSearch Security Analytics to automatically do this kind of detection in the future. Hopefully, I can talk about it soon. If any of you data analyst/data engineer out there got better ways to do the things I&amp;#39;m doing feel free to tweet me @pokgak73 on Twitter.&lt;/p&gt;
</description>
    </item>
    <item>
      <title>We got DDoSed</title>
      <link>https://pokgak.xyz/articles/we-got-ddosed/</link>
      <guid>https://pokgak.xyz/articles/we-got-ddosed/</guid>
      <pubDate>Sun, 27 Oct 2024 06:37:00 GMT</pubDate>
      <description>&lt;p&gt;Recently at $work we&amp;#39;ve been hit by a series of DDoS attacks. In this post, I&amp;#39;m gonna describe the steps we&amp;#39;ve taken to protect our services from these attacks in the future and also what works and what don&amp;#39;t.&lt;/p&gt;
&lt;h2&gt;Detection&lt;/h2&gt;
&lt;p&gt;The first attack was around 10PM on a Sunday. I was at home at the time and was notified that our ingress-nginx pods were repeatedly crashing. I&amp;#39;ve seen this happen before and my initial thought was that our service was getting more customers this night so I added more nodes into our cluster and increased the replica count for the ingress-nginx pods. That did nothing. Our pods are still crashing and customers still cannot use our service.&lt;/p&gt;
&lt;p&gt;Then, one of my colleagues showed me the metrics for new connections to our load balancer. We&amp;#39;re getting 10x our usual traffic in a minute. That&amp;#39;s a DDoS for sure.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/ddos-lb-active-flow.png&quot; alt=&quot;New connections to load balancer&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Rate limiting from the ingress-controller&lt;/h2&gt;
&lt;p&gt;Ingress-nginx supports a whole set of &lt;a href=&quot;https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/#rate-limiting&quot;&gt;rate limiting features&lt;/a&gt;. So we decided to start there but we first need to know which endpoint is being hit. For this, the logs from the ingress-nginx pods was really helpful. The incoming logs was coming in fast but we can see that most of the log lines contain the same hostname. So we put the &lt;code&gt;nginx.ingress.kubernetes.io/limit-rps&lt;/code&gt; annotation to the ingress that is used by that hostname. Then, we wait and monitor whether this annotation helps stabilize our crashing pods. It didn&amp;#39;t.&lt;/p&gt;
&lt;p&gt;From the logs, I can see that ingress-nginx is rate-limiting the request but because it still has to process the request first and keep counting it, it cannot keep up with the amount of incoming traffic. Rate limiting on the ingress-nginx level is useless because our infra is not enough to even receive and block all the request. We have an option to add more replicas to our ingress-nginx pods but thats gonna be costly. So, we started looking for other solutions.&lt;/p&gt;
&lt;h2&gt;Cloudflare to the rescue&lt;/h2&gt;
&lt;p&gt;I&amp;#39;ve always have known that people uses Cloudflare to protect against it I&amp;#39;ve never done it myself personally. After the initial response was proven not effective, I remember we&amp;#39;re using Cloudflare to point the domain to our load balancer but why does Cloudflare not blocking all this DDoS traffic?&lt;/p&gt;
&lt;p&gt;Our first mistake was that we didn&amp;#39;t &lt;a href=&quot;https://developers.cloudflare.com/dns/manage-dns-records/reference/proxied-dns-records/&quot;&gt;proxy the traffic through Cloudflare&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;After proxying the traffic through Cloudflare though, there is a delay to when all the traffic will be routed to Cloudflare Anycast IP due to the TTL settings which is 300 seconds by default. So, again, we wait...and the request still coming in after a while. Meanwhile, I&amp;#39;ve enabled the &lt;a href=&quot;https://developers.cloudflare.com/fundamentals/reference/under-attack-mode/&quot;&gt;Under Attack mode&lt;/a&gt; but honestly I&amp;#39;m not sure if this actually helps. After a while the requests was still not going down. It might take a while before Cloudflare can kick in to automatically mitigate the attack. In the meantime, we need to do something.&lt;/p&gt;
&lt;h3&gt;Using the WAF Rules&lt;/h3&gt;
&lt;p&gt;On the WAF page on Cloudflare, there&amp;#39;s three types of rules you can configure: Custom rules, Rate limiting rules and Managed rules.&lt;/p&gt;
&lt;h3&gt;Managed rules&lt;/h3&gt;
&lt;p&gt;After enabling the Under Attack mode, we also enabled the &lt;a href=&quot;https://developers.cloudflare.com/waf/managed-rules/&quot;&gt;Managed rules&lt;/a&gt; but it didn&amp;#39;t help much in our case. Managed rules block commonly used attacks but it our case it doesn&amp;#39;t block any of the DDoS attacks because the attack is targeting our API endpoint specifically with requests path that wasn&amp;#39;t included in the managed ruleset. We leave it turned on regardless since it might protect against other attacks in the future.&lt;/p&gt;
&lt;h3&gt;Custom rules: blocking by country&lt;/h3&gt;
&lt;p&gt;After all the above doesn&amp;#39;t seem to work, we need a way to differentiate DDoS traffic from valid traffic. I know that we only operate in several countries in Southeast Asia. This means that all the traffic that&amp;#39;s coming from outside those countries are bots (read more below to see why this is not true). So, we added a Custom rule to only whitelist the traffic coming from the countries that we operate in and that works, sorta. The request hitting our LB reduced to around half but its still higher than usual and our ingress pods are still being overwhelmed.&lt;/p&gt;
&lt;p&gt;Then, we noticed from the Security Analytics page in Cloudflare that the requests that are still hitting the LB passed the rule because its coming from one of the whitelisted countries. We can remove that country from our whitelist but that also means that we&amp;#39;ll be blocking valid traffic from our customers from those countries. So, we need to come up with new rule to block the DDoS traffic. What does a valid request have that the requests from the attacker doesn&amp;#39;t have?&lt;/p&gt;
&lt;h3&gt;Custom rules: blocking using query params, headers, and user agent&lt;/h3&gt;
&lt;p&gt;We went through nginx logs and noticed that the requests from the attacker are always using the same query params so we created a new rule blocking rqeuests with that query params and it worked! The requests hitting our LB dropped back to normal levels and we declared the incident finished. This doesn&amp;#39;t last long tho. The next time we were hit with the attack, we noticed that the attacker now uses a different query param. Luckily, we had also added other rules in place.&lt;/p&gt;
&lt;p&gt;After the first attack, we analysed our valid requests and came up with other rules based on the &lt;strong&gt;Referer&lt;/strong&gt; and the &lt;strong&gt;X-Requested-With&lt;/strong&gt; headers. We also check the User-Agent and block if its similar to the one that came from the attacker based on past attacks. So far this has been the most effective at blocking the attack. However, we know that this is not the final solution. If the attacker is determined enough, they can still look at valid requests and then spoof the values in their attack but so far we haven&amp;#39;t seen this happening yet.&lt;/p&gt;
&lt;h3&gt;Custom rules: blocking known attacker IPs&lt;/h3&gt;
&lt;p&gt;After several rounds of attacks we noticed that the DDoS attacks were all coming from the same set of IPs, so we created a list of known attacker IPs on Cloudflare and block future requests coming from those IPs. Looking back, this rule was only effective for a little while. Once the attacker noticed that all their requests were blocked, they will change the IP so the process will just keep repeating over and over again.&lt;/p&gt;
&lt;h3&gt;Rate limiting rule&lt;/h3&gt;
&lt;p&gt;After we put in the custom rules, we also turned on the rate limiting rules. Unless you are on the Enterprise plan (its expensive XXXX), the rate limiting rule is pretty restricted. You can only rate limit by IP but I think it is good enough. The rate limiting rules will act as the last line of defense after the requests passes all your other configured custom rules. Here is the &lt;a href=&quot;https://developers.cloudflare.com/waf/concepts/#rule-execution-order&quot;&gt;rule execution order&lt;/a&gt; for your reference.&lt;/p&gt;
&lt;p&gt;On a free plan, you get 10 second counting period but if you pay for other plans you&amp;#39;ll get more options. To me, the bigger counting period helps prevent from blocking valid requests. There might be a burst of activity from your users that causes the IP to hit the rate limit within 10 seconds but if measured within a longer period its still within a normal range. So, paying more is definitely worth it here.&lt;/p&gt;
&lt;p&gt;I definitely think that rate limiting rule is a must if you&amp;#39;re fighting against DDoS. So, make sure to configure this.&lt;/p&gt;
&lt;h2&gt;Bonus&lt;/h2&gt;
&lt;h3&gt;Blocking our own IPs&lt;/h3&gt;
&lt;p&gt;This is one of those facepalm moments in my life. After the attacks passed, I spent some time exploring the events in Cloudflare Security Analytics to see if I can find any insights. From the requests that were not blocked by Cloudflare, I grouped the requests by IP and checked the requests. Requests from the top two IPs looks good, it has all the headers and referers we were expecting them to have but the requests coming from a datacenter in Singapore. What makes it more suspicious was that all the requests had user agents from mobile devices eventhough the IP shows that they&amp;#39;re coming from a datacenter. So, I informed my team and proceed to block the requests. After a while complaints started coming in saying our customers requests were blocked. One of my colleagues suspects that those are actually our IPs.&lt;/p&gt;
&lt;p&gt;We have NAT Gateways configured in our network which means that if the requests were actually from our own network it will have one of those IPs from the NAT Gateway...and after comparing, they are indeed our NAT Gateway IPs. Apparently, one of the services proxies all the requests from the customers back to another service, complete with all the headers and user-agents that why it was showing mobile device user agents eventhough the IP was coming from a datacenter.&lt;/p&gt;
&lt;p&gt;After this incident, we created a &lt;a href=&quot;https://developers.cloudflare.com/waf/tools/lists/&quot;&gt;list on Cloudflare&lt;/a&gt; containing all our known IPs and skip blocking to avoid confusion in the future.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/slack-suspicious-ips.png&quot; alt=&quot;Slack message&quot;&gt;&lt;/p&gt;
&lt;h3&gt;Blocking accessibility bots (from US)&lt;/h3&gt;
&lt;p&gt;We also put in place rate limiting rule for all the requests that was categorized as &lt;a href=&quot;https://radar.cloudflare.com/traffic/verified-bots&quot;&gt;Verified Bots&lt;/a&gt; by Cloudflare. After putting in all these rules, I try to regularly review the block requests to make sure they&amp;#39;re not false positives - valid requests that was blocked - to further optimize our rules. There was a bunch of requests coming from the US, which we already put in custom rule to block but after looking at their user agent it seems a bit unusual - it contains the string &lt;code&gt;Google-Read-Aloud&lt;/code&gt;. After some googling, I found out that Google uses that user agent for their &lt;a href=&quot;https://developers.google.com/search/docs/crawling-indexing/read-aloud-user-agent&quot;&gt;text-to-speech feature&lt;/a&gt; for accessibility purposes.&lt;/p&gt;
&lt;p&gt;Having this user-agent does not necessarily mean that the requests are 100% valid because attackers can still &lt;a href=&quot;https://cheq.ai/blog/user-agent-spoofing/&quot;&gt;spoof their user agent&lt;/a&gt;. So, I&amp;#39;ll leave it up to you to decide if this is something that should be blocked but I think this is worth mentioning since in our fervor to prevent attackers from bringing down our systems we might also be hurting valid customers and affecting the accessibility of our service.&lt;/p&gt;
&lt;h3&gt;Silly mistake: external-dns&lt;/h3&gt;
&lt;p&gt;Fast forward a few days, we got hit again with a DDoS attack but this time it was during lunch time, which was the peak hour for our customers. I thought, &amp;quot;Did the rule from last time not working anymore?&amp;quot;. We checked the Security Analytics page on Cloudflare and noticed that Cloudlfare wasn&amp;#39;t blocking any requests even though we already had the rule from last time.&lt;/p&gt;
&lt;p&gt;I scratched my head for a bit wondering what we missed when my colleague pointed out that the DNS record for that domain wasn&amp;#39;t proxied. So, I turned it back on but after a while it was turned back off. Then, I remember that we&amp;#39;re using &lt;a href=&quot;https://github.com/kubernetes-sigs/external-dns&quot;&gt;external-dns&lt;/a&gt; to automate the creation of our records on Cloudflare and it was configured to disable the proxy option. So, whenever we enabled the proxy option, external-dns reverted it back as its supposed to. We end up turning off external-dns to make sure the proxy option would not get reverted. Read more below to know how you can turn on the proxy option on a per-ingress basis.&lt;/p&gt;
&lt;h3&gt;Using annotation to proxy traffic through Cloudflare per Ingress&lt;/h3&gt;
&lt;p&gt;To enable proxying requests through Cloudflare on a per-Ingress basis when using external-dns, you can add the annotation &lt;code&gt;external-dns.alpha.kubernetes.io/cloudflare-proxied: &amp;quot;true&amp;quot;&lt;/code&gt; to the Ingress. If you have multiple Ingress that all points to the same host, make sure to add the annotation to all of them. If not, external-dns will keep fighting itself, turning the proxy option onn and off forever.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;I&amp;#39;ve outlined several steps you can take if you&amp;#39;re facing DDoS attacks in the future. Despite the success in mitigating the attacks so far, we know that there is no forever solution to DDoS. We have to keep up with the attacker and play Whac-A-Mole until they are bored and stop the attacks. When fighting DDoS attacks, I find it helpful to log the requests and review it regularly to avoid false-positives. It has been an eye opening experience for me and next time I&amp;#39;m asked how to protect from a DDoS, I can definitely say more than just put it behind Cloudflare.&lt;/p&gt;
</description>
    </item>
    <item>
      <title>Scrape cAdvisor using Grafana Alloy</title>
      <link>https://pokgak.xyz/articles/scrape-cadvisor-alloy/</link>
      <guid>https://pokgak.xyz/articles/scrape-cadvisor-alloy/</guid>
      <pubDate>Wed, 10 Jul 2024 09:45:00 GMT</pubDate>
      <description>&lt;p&gt;I was having some issues figuring out how to scrape cAdvisor metrics using Grafana Alloy. After googling I came across this k8s-monitoring helm chart and inside there is a configuration for scraping the built-in cAdvisor on the k8s kubelet.&lt;/p&gt;
&lt;p&gt;I ran Alloy as a single pod Deployment and it&amp;#39;ll scrape all the nodes in the cluster. Here&amp;#39;s the config that I used to get the metrics:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-hcl&quot;&gt;prometheus.remote_write &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;default&amp;quot;&lt;/span&gt; {
  endpoint {
    url &lt;span class=&quot;hljs-operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;https://mimir.example.com/api/v1/push&amp;quot;&lt;/span&gt;
  }
}

discovery.kubernetes &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;nodes&amp;quot;&lt;/span&gt; {
  role &lt;span class=&quot;hljs-operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;node&amp;quot;&lt;/span&gt;
}

discovery.relabel &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;cadvisor&amp;quot;&lt;/span&gt; {
  targets &lt;span class=&quot;hljs-operator&quot;&gt;=&lt;/span&gt; discovery.kubernetes.nodes.targets

  rule {
    replacement   &lt;span class=&quot;hljs-operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;/metrics/cadvisor&amp;quot;&lt;/span&gt;
    target_label  &lt;span class=&quot;hljs-operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;__metrics_path__&amp;quot;&lt;/span&gt;
  }
}

prometheus.scrape &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;cadvisor&amp;quot;&lt;/span&gt; {
  job_name   &lt;span class=&quot;hljs-operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;integrations/kubernetes/cadvisor&amp;quot;&lt;/span&gt;
  targets    &lt;span class=&quot;hljs-operator&quot;&gt;=&lt;/span&gt; discovery.relabel.cadvisor.output
  scheme     &lt;span class=&quot;hljs-operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;https&amp;quot;&lt;/span&gt;
  scrape_interval &lt;span class=&quot;hljs-operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;60s&amp;quot;&lt;/span&gt;
  bearer_token_file &lt;span class=&quot;hljs-operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;/var/run/secrets/kubernetes.io/serviceaccount/token&amp;quot;&lt;/span&gt;
  tls_config {
    insecure_skip_verify &lt;span class=&quot;hljs-operator&quot;&gt;=&lt;/span&gt; true
  }

  forward_to &lt;span class=&quot;hljs-operator&quot;&gt;=&lt;/span&gt; [prometheus.remote_write.default.receiver]
}&lt;/code&gt;&lt;/pre&gt;&lt;h3&gt;Alloy cadvisor exporter&lt;/h3&gt;
&lt;p&gt;Alloy provides the &lt;code&gt;prometheus.exporter.cadvisor&lt;/code&gt; components that can be used to start a new cadvisor on the nodes. This is not required if the kubelet running on your nodes already runs cadvisor. This is the case for me on EKS running on Bottlerocket.&lt;/p&gt;
</description>
    </item>
    <item>
      <title>The hidden cost of running your own observability stack</title>
      <link>https://pokgak.xyz/articles/hidden-cost-lgtm/</link>
      <guid>https://pokgak.xyz/articles/hidden-cost-lgtm/</guid>
      <pubDate>Mon, 24 Jun 2024 05:00:00 GMT</pubDate>
      <description>&lt;p&gt;At my latest $job, I was tasked of setting up the LGTM stack (Loki, Grafana, Tempo, Mimir) for observability. Fast forward a few months, I noticed there&amp;#39;s a hidden aspect to running the stack that I was not expecting before and that is the network cost, specifically the network transfer cost for cross AZ traffic. At one point we were paying more than $100 per day just for the cross AZ network traffic.&lt;/p&gt;
&lt;p&gt;Update: since this article was written, I&amp;#39;ve found out that the official Loki helm chart have a section addressing the cross-az issue in the values file. It does not recommend running Loki across multiple AZs on the cloud.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Note: This can be used to run Loki over multiple cloud provider availability zones however this is not currently
recommended as Loki is not optimized for this and cross zone network traffic costs can become extremely high
extremely quickly. Even with zone awareness enabled, it is recommended to run Loki in a single availability zone.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Cross AZ Traffic Amplification&lt;/h2&gt;
&lt;p&gt;While investigating where does the traffic coming from I compared the load balancer &amp;quot;Processed Bytes&amp;quot; metrics with the Cost Explorer usage for cross AZ traffic and noticed that there&amp;#39;s a 10x increase in the reported values by the load balancer to the actual charged traffic. It baffled me a bit and made me step back and take a deeper look at the possible points where I&amp;#39;m getting charged.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Collector to load balancer node&lt;/li&gt;
&lt;li&gt;load balancer node to ingress controller pod &lt;/li&gt;
&lt;li&gt;ingress controller pod to distributor&lt;/li&gt;
&lt;li&gt;distributor to ingester&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Collector to Load Balancer Node: client routing policy&lt;/h3&gt;
&lt;p&gt;In my setup, the services are exposed through a load balancer and given a DNS name like &lt;code&gt;loki.example.com&lt;/code&gt;. The collectors are configured to send the telemetry data to that URL. Here is my fist mistake, I didnt&amp;#39; enable &amp;quot;Availability Zone affinity&amp;quot; for the client routing policy. When enabled, this will route traffic from the collector to the load balancer node in the same AZ avoiding being charged for cross AZ traffic.&lt;/p&gt;
&lt;p&gt;The load balancer node will then forward the traffic to the ingress controller pod in the same AZ.&lt;/p&gt;
&lt;h3&gt;Ingress controller pod to distributor: Kubernetes Topology Aware Routing&lt;/h3&gt;
&lt;p&gt;From the load balancer node, the traffic will be forwarded to the k8s pod through the k8s service. The default behavior of service in k8s is it will route the traffic using the round-robin algorithm. This means that from the ingress controller pod to the distributor pod the traffic will go cross AZ. If you have 3 distributor pods, this means 2 out of 3 connections will be routed to pods in different AZ.&lt;/p&gt;
&lt;p&gt;To avoid the traffic from crossing AZ, we can use &lt;a href=&quot;https://kubernetes.io/docs/concepts/services-networking/topology-aware-routing/&quot;&gt;kubernetes topology aware routing&lt;/a&gt; feature. Downside of using this is that we need to have at least 3 pods in each AZ but compute is cheaper in my use case since I&amp;#39;m using spot instances through Karpenter and getting up to 70% discount on the node price.&lt;/p&gt;
&lt;h3&gt;Distributor to Ingester: no workaround&lt;/h3&gt;
&lt;p&gt;This the only part I haven&amp;#39;t solved. In the LGTM stack, the distributors uses an &lt;a href=&quot;https://grafana.com/docs/loki/latest/get-started/hash-rings/#about-the-ingester-ring&quot;&gt;internal discovery mechanism&lt;/a&gt; to get the IP of the ingesters. This means that we cannot use kubernetes topology aware routing here.&lt;/p&gt;
&lt;p&gt;To make things worse, depending on the &lt;a href=&quot;https://grafana.com/docs/loki/latest/get-started/components/#replication-factor&quot;&gt;replication_factor&lt;/a&gt; configuration, each distributor might be sending the logs to multiple ingesters, each one multiplying the cross AZ cost that we have to pay.&lt;/p&gt;
&lt;h2&gt;Special use case: getting logs from external source&lt;/h2&gt;
&lt;p&gt;Other than the above use case, our company also have other use case where the logs comes from external sources instead of from our internal network. In this case, I actually managed to eliminate the cross AZ cost completely by deploying the LGTM stack in just one AZ. The load balancer is also configured to use only one subnet that is in the same AZ.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;All the above factor might explain why the amount I was charged for cross AZ traffic is 10x bigger than the amount that is received at the load balancer. I outlined some the possible points where the cross AZ charges are coming from and how to fix it. Hope it helps!&lt;/p&gt;
</description>
    </item>
    <item>
      <title>Tech Janitor: Linux Cookbook</title>
      <link>https://pokgak.xyz/articles/tech-janitor-linux-cookbook/</link>
      <guid>https://pokgak.xyz/articles/tech-janitor-linux-cookbook/</guid>
      <pubDate>Fri, 26 Jan 2024 19:14:00 GMT</pubDate>
      <description>&lt;p&gt;In this blog post I&amp;#39;ll be listing out the tools that I frequently use in my day to day work but in a cookbook format where it&amp;#39;s not just a random list of CLI tools but it is grouped by the goals that I&amp;#39;m trying to achive.&lt;/p&gt;
&lt;h2&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Probing an instance from the outside&lt;/li&gt;
&lt;li&gt;Checking which service is running&lt;/li&gt;
&lt;li&gt;Checking the logs&lt;/li&gt;
&lt;li&gt;Monitor the system&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Probing an instance from the outside&lt;/h2&gt;
&lt;h3&gt;Check DNS record exists&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-bash&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;## dig &amp;lt;domain&amp;gt; &amp;lt;record-type&amp;gt;&lt;/span&gt;

&lt;span class=&quot;hljs-comment&quot;&gt;# when the record exists&lt;/span&gt;
$ dig pokgak.xyz +short
185.199.108.153
185.199.110.153
185.199.109.153
185.199.111.153

&lt;span class=&quot;hljs-comment&quot;&gt;# when the record doesn&amp;#x27;t exist&lt;/span&gt;
$ dig pokgak.abc +short
&lt;span class=&quot;hljs-comment&quot;&gt;# &amp;lt;empty&amp;gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;h3&gt;Check if a port is open&lt;/h3&gt;
&lt;p&gt;If you&amp;#39;re having trouble connecting to an instance, you can check if the port is open by using &lt;code&gt;telnet&lt;/code&gt;. This can be used for all protocols that use TCP ie HTTP(80/443), SSH(22), Postgres(5432), Redis(6379) etc.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-bash&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;## telnet &amp;lt;host&amp;gt; &amp;lt;port&amp;gt;&lt;/span&gt;

&lt;span class=&quot;hljs-comment&quot;&gt;# when the port is open&lt;/span&gt;
$ telnet nuc 22
Trying 100.102.147.64...
Connected to nuc.hamster-nase.ts.net.
Escape character is &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;^]&amp;#x27;&lt;/span&gt;.
SSH-2.0-OpenSSH_8.9p1 Ubuntu-3ubuntu0.6
^]
telnet&amp;gt; Connection closed.

&lt;span class=&quot;hljs-comment&quot;&gt;# when the port is closed&lt;/span&gt;
$ telnet nuc 80
Trying 100.102.147.64...
telnet: connect to address 100.102.147.64: Connection refused
telnet: Unable to connect to remote host&lt;/code&gt;&lt;/pre&gt;&lt;h2&gt;Checking which service is running&lt;/h2&gt;
&lt;h3&gt;Check which process is listening on any port&lt;/h3&gt;
&lt;p&gt;Sometimes you want to know which process is listening on a port. This can be done using &lt;code&gt;ss&lt;/code&gt;. Make sure to use &lt;code&gt;sudo&lt;/code&gt; or run it as root because it won&amp;#39;t show the process name if you don&amp;#39;t.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-bash&quot;&gt;$ &lt;span class=&quot;hljs-built_in&quot;&gt;sudo&lt;/span&gt; ss -ntlp
State    Recv-Q   Send-Q                     Local Address:Port        Peer Address:Port   Process
LISTEN   0        4096                      100.102.147.64:63177            0.0.0.0:*       &lt;span class=&quot;hljs-built_in&quot;&gt;users&lt;/span&gt;:((&amp;quot;tailscaled&amp;quot;,pid=&lt;span class=&quot;hljs-number&quot;&gt;771&lt;/span&gt;,fd=&lt;span class=&quot;hljs-number&quot;&gt;26&lt;/span&gt;))
LISTEN   0        128                              0.0.0.0:22               0.0.0.0:*       &lt;span class=&quot;hljs-built_in&quot;&gt;users&lt;/span&gt;:((&amp;quot;sshd&amp;quot;,pid=&lt;span class=&quot;hljs-number&quot;&gt;719&lt;/span&gt;,fd=&lt;span class=&quot;hljs-number&quot;&gt;3&lt;/span&gt;))
LISTEN   0        32                            10.45.81.1:53               0.0.0.0:*       &lt;span class=&quot;hljs-built_in&quot;&gt;users&lt;/span&gt;:((&amp;quot;dnsmasq&amp;quot;,pid=&lt;span class=&quot;hljs-number&quot;&gt;1047&lt;/span&gt;,fd=&lt;span class=&quot;hljs-number&quot;&gt;7&lt;/span&gt;))
LISTEN   0        4096                       127.0.0.53%lo:53               0.0.0.0:*       &lt;span class=&quot;hljs-built_in&quot;&gt;users&lt;/span&gt;:((&amp;quot;systemd-resolve&amp;quot;,pid=&lt;span class=&quot;hljs-number&quot;&gt;607&lt;/span&gt;,fd=&lt;span class=&quot;hljs-number&quot;&gt;14&lt;/span&gt;))
LISTEN   0        4096         [fd7a:115c:a1e0::a166:9340]:63177               [::]:*       &lt;span class=&quot;hljs-built_in&quot;&gt;users&lt;/span&gt;:((&amp;quot;tailscaled&amp;quot;,pid=&lt;span class=&quot;hljs-number&quot;&gt;771&lt;/span&gt;,fd=&lt;span class=&quot;hljs-number&quot;&gt;28&lt;/span&gt;))
LISTEN   0        128                                 [::]:22                  [::]:*       &lt;span class=&quot;hljs-built_in&quot;&gt;users&lt;/span&gt;:((&amp;quot;sshd&amp;quot;,pid=&lt;span class=&quot;hljs-number&quot;&gt;719&lt;/span&gt;,fd=&lt;span class=&quot;hljs-number&quot;&gt;4&lt;/span&gt;))&lt;/code&gt;&lt;/pre&gt;&lt;h3&gt;Check systemd service status&lt;/h3&gt;
&lt;p&gt;If you&amp;#39;re running a service using systemd, you can check the status of the service using &lt;code&gt;systemctl&lt;/code&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-bash&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;## systemctl status &amp;lt;service-name&amp;gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;h3&gt;Check if a process is running&lt;/h3&gt;
&lt;p&gt;This might be useful if you know what process/program name you&amp;#39;re looking for.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-base&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;## ps aux | grep &amp;lt;process-name&amp;gt;&lt;/span&gt;

$ ps aux | grep tailscale
root         771  0.3  0.2 1259760 43092 ?       Ssl  Jan25  11:45 /usr/sbin/tailscaled &lt;span class=&quot;hljs-attribute&quot;&gt;--state&lt;/span&gt;=/var/lib/tailscale/tailscaled.state &lt;span class=&quot;hljs-attribute&quot;&gt;--socket&lt;/span&gt;=/run/tailscale/tailscaled.sock &lt;span class=&quot;hljs-attribute&quot;&gt;--port&lt;/span&gt;=41641
pokgak     30489  0.0  0.0   7008  2304 pts/0    S+   12:43   0:00 grep &lt;span class=&quot;hljs-attribute&quot;&gt;--color&lt;/span&gt;=auto tailscal&lt;/code&gt;&lt;/pre&gt;&lt;h2&gt;Checking the logs&lt;/h2&gt;
&lt;h3&gt;Check systemd service logs&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;journalctl&lt;/code&gt; is a tool that can be used to check the logs of a systemd service. This is really useful when you&amp;#39;re trying to debug a systemd service that is not running or keep on failing.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-bash&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;## journalctl -fu &amp;lt;service-name&amp;gt;&lt;/span&gt;
$ journalctl -fu tailscaled&lt;/code&gt;&lt;/pre&gt;&lt;h3&gt;Check nginx logs&lt;/h3&gt;
&lt;p&gt;Some applications will write their logs to a certain folder. Usually on linux, the logs will be written to &lt;code&gt;/var/log/&amp;lt;app-name&amp;gt;&lt;/code&gt;. For example, nginx will write its logs to &lt;code&gt;/var/log/nginx&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;tail&lt;/code&gt; will get the last 10 lines of a file. You can use &lt;code&gt;-f&lt;/code&gt; to follow the file and print out the new lines that are added to the file.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-bash&quot;&gt;&lt;span class=&quot;hljs-built_in&quot;&gt;tail&lt;/span&gt; -f /var/log/nginx/access.log&lt;/code&gt;&lt;/pre&gt;&lt;h2&gt;Monitor the system&lt;/h2&gt;
&lt;h3&gt;Check CPU &amp;amp; Memory usage&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;htop&lt;/code&gt; is a tool that can be used to monitor the CPU and memory usage of a system. It&amp;#39;s like &lt;code&gt;top&lt;/code&gt; but with a better UI.&lt;/p&gt;
&lt;h3&gt;Check disk usage&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;df&lt;/code&gt; is a tool that can be used to check the disk usage of a system. &lt;code&gt;-h&lt;/code&gt; is used to make the output human readable. In linux, your hard drive usually will be represented as &lt;code&gt;/dev/sda&lt;/code&gt; or &lt;code&gt;/dev/sdb&lt;/code&gt; or other letters of alphabet if you have more disks in the system. The prefix number on the disk is the partition number. For example, &lt;code&gt;/dev/sda1&lt;/code&gt; is the first partition of the first disk.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-bash&quot;&gt;$ &lt;span class=&quot;hljs-built_in&quot;&gt;df&lt;/span&gt; -h
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           1.6G  1.5M  1.6G   1% /run
efivarfs        128K   87K   37K  71% /sys/firmware/efi/efivars
/dev/sda2       457G   15G  419G   4% /
tmpfs           7.8G     0  7.8G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda1       1.1G  6.1M  1.1G   1% /boot/efi
tmpfs           1.6G  4.0K  1.6G   1% /run/user/1000&lt;/code&gt;&lt;/pre&gt;</description>
    </item>
    <item>
      <title>Using Steampipe + DuckDB for VPC Flow Logs Analysis</title>
      <link>https://pokgak.xyz/articles/steampipe-duckdb-flow-logs/</link>
      <guid>https://pokgak.xyz/articles/steampipe-duckdb-flow-logs/</guid>
      <pubDate>Fri, 26 Jan 2024 12:00:00 GMT</pubDate>
      <description>&lt;p&gt;As a so called &lt;a href=&quot;https://x.com/tevanraj/status/1747920076203057273?s=20&quot;&gt;Tech Janitor&lt;/a&gt;, I&amp;#39;ve been tasked to clean up one of our AWS accounts at work and that account have a bunch of EC2 instances that no one knows what they all do. So, I&amp;#39;ve decided to use one of AWS features, VPC Flow Logs, to first identify which EC2 instances are still being used and which are not.&lt;/p&gt;
&lt;h2&gt;Setting up the VPC Flow Logs and query using DuckDB&lt;/h2&gt;
&lt;p&gt;For our purpose, I&amp;#39;ve setup VPC flow logs to send all the traffic data to a S3 bucket that we&amp;#39;ll refer to as &lt;code&gt;vpc-flow-logs-bucket&lt;/code&gt; in this post. The flow logs are stored in a Parquet format for querying later using &lt;a href=&quot;https://duckdb.org&quot;&gt;DuckDB&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Once the flow logs file are sent to S3, I&amp;#39;ll be able to query them using DuckDB. To do that we will need to install the &lt;code&gt;aws&lt;/code&gt; and &lt;code&gt;httpfs&lt;/code&gt; extensions.&lt;/p&gt;
&lt;p&gt;From DuckDB shell:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;&lt;span class=&quot;hljs-meta prompt_&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;language-javascript&quot;&gt;&lt;span class=&quot;hljs-variable constant_&quot;&gt;INSTALL&lt;/span&gt; aws;&lt;/span&gt;
&lt;span class=&quot;hljs-meta prompt_&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;language-javascript&quot;&gt;&lt;span class=&quot;hljs-variable constant_&quot;&gt;INSTALL&lt;/span&gt; httpfs;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;We also need to load our AWS credentials into DuckDB. Luckily, DuckDB has a built-in command to do that:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;&lt;span class=&quot;hljs-operator&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;CALL&lt;/span&gt; load_aws_credentials();
┌──────────────────────┬──────────────────────────┬──────────────────────┬───────────────┐
│ loaded_access_key_id │ loaded_secret_access_key │ loaded_session_token │ loaded_region │
│       &lt;span class=&quot;hljs-type&quot;&gt;varchar&lt;/span&gt;        │         &lt;span class=&quot;hljs-type&quot;&gt;varchar&lt;/span&gt;          │       &lt;span class=&quot;hljs-type&quot;&gt;varchar&lt;/span&gt;        │    &lt;span class=&quot;hljs-type&quot;&gt;varchar&lt;/span&gt;    │
├──────────────────────┼──────────────────────────┼──────────────────────┼───────────────┤
│ &lt;span class=&quot;hljs-operator&quot;&gt;&amp;lt;&lt;/span&gt;redacted&lt;span class=&quot;hljs-operator&quot;&gt;&amp;gt;&lt;/span&gt;           │ &lt;span class=&quot;hljs-operator&quot;&gt;&amp;lt;&lt;/span&gt;redacted&lt;span class=&quot;hljs-operator&quot;&gt;&amp;gt;&lt;/span&gt;               │                      │ eu&lt;span class=&quot;hljs-operator&quot;&gt;-&lt;/span&gt;west&lt;span class=&quot;hljs-number&quot;&gt;-1&lt;/span&gt;     │
└──────────────────────┴──────────────────────────┴──────────────────────┴───────────────┘&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;This will look for your AWS credentials based on the standard AWS credentials file location. If you have multiple profiles in your credentials file, you can specify which profile to use by passing the profile name as an argument to the &lt;code&gt;load_aws_credentials&lt;/code&gt; function.&lt;/p&gt;
&lt;p&gt;Now it&amp;#39;s time to load our VPC flow logs from S3 into a table in DuckDB. You can replace the &lt;code&gt;year/month/day/hour&lt;/code&gt; with the actual date and hour of the flow logs that you want to load or use &lt;code&gt;*&lt;/code&gt; for any or all of them to load all the flow logs. I&amp;#39;ll be loading all the flow log records into a table &lt;code&gt;flow_logs&lt;/code&gt; in DuckDB.&lt;/p&gt;
&lt;p&gt;This might take a while since DuckDB will have to download the Parquet files from S3 and load them into memory. It took several minutes to finish loading in my case.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;&amp;gt; CREATE TABLE flow_logs AS SELECT * &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; read_parquet(&amp;#x27;s3://vpc-flow-logs-bucket/AWSLogs/&lt;span class=&quot;hljs-variable&quot;&gt;&amp;lt;aws-account-id&amp;gt;&lt;/span&gt;/vpcflowlogs/&lt;span class=&quot;hljs-variable&quot;&gt;&amp;lt;region&amp;gt;&lt;/span&gt;/&lt;span class=&quot;hljs-variable&quot;&gt;&amp;lt;year&amp;gt;&lt;/span&gt;/&lt;span class=&quot;hljs-variable&quot;&gt;&amp;lt;month&amp;gt;&lt;/span&gt;/&lt;span class=&quot;hljs-variable&quot;&gt;&amp;lt;day&amp;gt;&lt;/span&gt;/&lt;span class=&quot;hljs-variable&quot;&gt;&amp;lt;hour&amp;gt;&lt;/span&gt;/*.parquet&amp;#x27;)&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Now we can see that the flow logs records only contains the network interface ID (ENI) of the EC2 instance but not the EC2 instance ID or name itself. That won&amp;#39;t be enough for my use case since I want to identify which traffic is flowing to which EC2 instance. Therefore, we need to correlate the ENI with the EC2 instance ID and here&amp;#39;s where Steampipe comes in.&lt;/p&gt;
&lt;h2&gt;Steampipe: directly query your APIs from SQL&lt;/h2&gt;
&lt;p&gt;&lt;a href=&quot;https://steampipe.io&quot;&gt;Steampipe&lt;/a&gt; is a tool that allows you to query APIs from SQL. It supports a lot of different APIs from AWS, GCP, Azure, Github, etc. You can also write your own plugins to support other APIs. I&amp;#39;ll be using it to query my AWS account for the EC2 instance ID and name based on the ENI ID from the VPC flow logs.&lt;/p&gt;
&lt;h2&gt;Life before Steampipe&lt;/h2&gt;
&lt;p&gt;Usually to do the things I&amp;#39;m about to show below, I&amp;#39;ll pull the data from AWS using the aws-cli and then massage it using &lt;code&gt;jq/yq/awk/sed&lt;/code&gt;, if I&amp;#39;m desperate maybe Python. Then I&amp;#39;ll use some other tools to visualize it or export to CSV. With Steampipe, pulling the data from AWS is so simple and using SQL to correlate the data with other information source is a breeze.&lt;/p&gt;
&lt;h2&gt;Steampipe is just Postgresql&lt;/h2&gt;
&lt;p&gt;Under the hood, Steampipe is running PostgreSQL and it even allows you to run it as a standalone instance running in the background and allows &lt;a href=&quot;https://steampipe.io/docs/query/third-party&quot;&gt;connecting to it from any third-party tools&lt;/a&gt; that can connect to a Postgresql instance. Here&amp;#39;s where it gets interesting, DuckDB has the capability to &lt;a href=&quot;https://duckdb.org/docs/extensions/postgres.html&quot;&gt;connect to any PostgreSQL database&lt;/a&gt; through the Postgresql extension and query it as if all the data inside that database is coming from the DuckDB. This means that we can use Steampipe as a data source for DuckDB and access all of the AWS resources data available in Steampipe.&lt;/p&gt;
&lt;h2&gt;Setting up Steampipe and DuckDB connection&lt;/h2&gt;
&lt;p&gt;To run Steampipe as a service mode, you&amp;#39;ll need to run the following command to start the PostgreSQL instance and get the credentials for connecting to it:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;&lt;span class=&quot;hljs-string&quot;&gt;$&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;steampipe&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;service&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;start&lt;/span&gt;
&lt;span class=&quot;hljs-attr&quot;&gt;Database:&lt;/span&gt;

  &lt;span class=&quot;hljs-attr&quot;&gt;Host(s):&lt;/span&gt;            &lt;span class=&quot;hljs-number&quot;&gt;127.0&lt;/span&gt;&lt;span class=&quot;hljs-number&quot;&gt;.0&lt;/span&gt;&lt;span class=&quot;hljs-number&quot;&gt;.1&lt;/span&gt;&lt;span class=&quot;hljs-string&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;::1,&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;2606&lt;/span&gt;&lt;span class=&quot;hljs-string&quot;&gt;:4700:110:8818:e17b:f78c:6c52:dccb,&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;172.16&lt;/span&gt;&lt;span class=&quot;hljs-number&quot;&gt;.0&lt;/span&gt;&lt;span class=&quot;hljs-number&quot;&gt;.2&lt;/span&gt;&lt;span class=&quot;hljs-string&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;2001&lt;/span&gt;&lt;span class=&quot;hljs-string&quot;&gt;:f40:909:8e2:207a:634a:2070:d99d,&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;2001&lt;/span&gt;&lt;span class=&quot;hljs-string&quot;&gt;:f40:909:8e2:1cdb:75da:2a70:4b05,&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;192.168&lt;/span&gt;&lt;span class=&quot;hljs-number&quot;&gt;.100&lt;/span&gt;&lt;span class=&quot;hljs-number&quot;&gt;.23&lt;/span&gt;&lt;span class=&quot;hljs-string&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;127.0&lt;/span&gt;&lt;span class=&quot;hljs-number&quot;&gt;.2&lt;/span&gt;&lt;span class=&quot;hljs-number&quot;&gt;.3&lt;/span&gt;&lt;span class=&quot;hljs-string&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;127.0&lt;/span&gt;&lt;span class=&quot;hljs-number&quot;&gt;.2&lt;/span&gt;&lt;span class=&quot;hljs-number&quot;&gt;.2&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;Port:&lt;/span&gt;               &lt;span class=&quot;hljs-number&quot;&gt;9193&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;Database:&lt;/span&gt;           &lt;span class=&quot;hljs-string&quot;&gt;steampipe&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;User:&lt;/span&gt;               &lt;span class=&quot;hljs-string&quot;&gt;steampipe&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;Password:&lt;/span&gt;           &lt;span class=&quot;hljs-string&quot;&gt;*********&lt;/span&gt; [&lt;span class=&quot;hljs-string&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;--show-password&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;to&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;reveal&lt;/span&gt;]
  &lt;span class=&quot;hljs-attr&quot;&gt;Connection string:&lt;/span&gt;  &lt;span class=&quot;hljs-string&quot;&gt;postgres://steampipe@127.0.0.1:9193/steampipe&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Then inside DuckDB shell, you can connect to the Steampipe PostgreSQL instance using the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;&lt;span class=&quot;hljs-meta prompt_&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;language-javascript&quot;&gt;&lt;span class=&quot;hljs-variable constant_&quot;&gt;ATTACH&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;dbname=steampipe user=steampipe password=23e2_4853_bd96 host=127.0.0.1 port=9193&amp;#x27;&lt;/span&gt; &lt;span class=&quot;hljs-variable constant_&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-title function_&quot;&gt;steampipe&lt;/span&gt; (&lt;span class=&quot;hljs-variable constant_&quot;&gt;TYPE&lt;/span&gt; postgres);&lt;/span&gt;
&lt;span class=&quot;hljs-meta prompt_&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;language-javascript&quot;&gt;use steampipe.&lt;span class=&quot;hljs-property&quot;&gt;aws&lt;/span&gt;;&lt;/span&gt;
&lt;span class=&quot;hljs-meta prompt_&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;language-javascript&quot;&gt;&lt;span class=&quot;hljs-variable constant_&quot;&gt;SHOW&lt;/span&gt; tables;&lt;/span&gt;
show tables;
┌─────────────────────────────────────────────┐
│                    name                     │
│                   varchar                   │
├─────────────────────────────────────────────┤
│ aws_accessanalyzer_analyzer                 │
│ aws_account                                 │
│ aws_account_alternate_contact               │
&lt;span class=&quot;hljs-meta prompt_&quot;&gt;...&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Now we can see all the tables from the Steampipe Postgresql instance. For my use case I&amp;#39;ll be using the &lt;code&gt;aws_ec2_network_interface&lt;/code&gt; table which contains both the network interface ID (ENI) and the EC2 instance ID that I can use &lt;code&gt;JOIN&lt;/code&gt; together with the VPC flow logs records to map the records to the EC2 instance ID.&lt;/p&gt;
&lt;h2&gt;JOIN-ing it all together&lt;/h2&gt;
&lt;p&gt;Here&amp;#39;s an example query that will give me the count of all incoming traffic to the instances grouped by the port number:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;select&lt;/span&gt;
    i.&lt;span class=&quot;hljs-built_in&quot;&gt;title&lt;/span&gt;,
    fl.dstport,
    &lt;span class=&quot;hljs-built_in&quot;&gt;count&lt;/span&gt;(fl.dstaddr) traffic
&lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; aws_ec2_network_interface ni
&lt;span class=&quot;hljs-keyword&quot;&gt;left&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;join&lt;/span&gt; memory.flow_logs fl &lt;span class=&quot;hljs-keyword&quot;&gt;on&lt;/span&gt; fl.interface_id = ni.network_interface_id
&lt;span class=&quot;hljs-keyword&quot;&gt;left&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;join&lt;/span&gt; aws_ec2_instance i &lt;span class=&quot;hljs-keyword&quot;&gt;on&lt;/span&gt; i.instance_id = ni.attached_instance_id
&lt;span class=&quot;hljs-keyword&quot;&gt;where&lt;/span&gt;
    fl.dstaddr = i.private_ip_address
&lt;span class=&quot;hljs-keyword&quot;&gt;group&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;by&lt;/span&gt; i.instance_name, fl.dstport
&lt;span class=&quot;hljs-keyword&quot;&gt;order&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;by&lt;/span&gt; traffic &lt;span class=&quot;hljs-keyword&quot;&gt;desc&lt;/span&gt;, dstport &lt;span class=&quot;hljs-keyword&quot;&gt;asc&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;From this information I&amp;#39;ll be able to guess which service is running on those instances and take the next step towards migrating or depecrating the instances.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;It is kinda mindblowing that I can do all this using SQL. Both Steampipe and DuckDB are great products and the flexibility of those tools allows me to
pick and choose the best tool for the job. I first came across Steampipe in one of the podcasts that I listen to but haven&amp;#39;t really used it much. Now, after having the opportunity to use it to solve one of my problems, I&amp;#39;ll definitely pay more attention to it to make my tech janitor life easier in the future ;)&lt;/p&gt;
</description>
    </item>
    <item>
      <title>Terraform modules: be opinionated</title>
      <link>https://pokgak.xyz/articles/terraform-modules-opinionated/</link>
      <guid>https://pokgak.xyz/articles/terraform-modules-opinionated/</guid>
      <pubDate>Fri, 08 Dec 2023 09:21:59 GMT</pubDate>
      <description>&lt;blockquote&gt;
&lt;p&gt;Modules are containers for multiple resources that are used together.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Terraform modules is a way to bundle a bunch of Terraform resources into one group. Although not explicitly mentioned in the definition, it is also a way to provide an &lt;strong&gt;abstraction&lt;/strong&gt; to the resources inside the module and only expose inputs and outputs that are relevant to the users of the module.&lt;/p&gt;
&lt;h2&gt;Terraform Module as an Abstraction Layer&lt;/h2&gt;
&lt;p&gt;A Terraform resource usually tends to be generic in that it allows you to configure it multiple ways through the input variables that it can accept. For example, the &lt;code&gt;aws_mskconnect_connector&lt;/code&gt; resource has &lt;a href=&quot;https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/mskconnect_connector#worker_log_delivery-configuration-block&quot;&gt;3 options&lt;/a&gt; for log delivery: CloudWatch Logs, Kinesis Data Firehose, or S3. In most cases, your module &lt;strong&gt;shouldn&amp;#39;t&lt;/strong&gt; expose all 3 options to your users. Your organization probably already has some standard place where you send you logs to e.g. S3 for example where it gets forwarded to another logs search service for later use.&lt;/p&gt;
&lt;p&gt;Therefore, your module should only expose the S3 option and leave out CloudWatch Logs and Kinesis Data Firehose from your module. By doing this you eliminate the choice from the user and they don&amp;#39;t have to think which one to use. Your Terraform code can be a lot simpler too.&lt;/p&gt;
&lt;h2&gt;Module Abstraction in Action&lt;/h2&gt;
&lt;p&gt;Here&amp;#39;s an example module code when including all 3 options:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-terraform&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;variable&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;log_delivery_s3&amp;quot;&lt;/span&gt; {
  type &lt;span class=&quot;hljs-comment&quot;&gt;= object({&lt;/span&gt;
    bucket_arn &lt;span class=&quot;hljs-comment&quot;&gt;= string&lt;/span&gt;
    prefix     &lt;span class=&quot;hljs-comment&quot;&gt;= optional(string,&lt;/span&gt; &lt;span class=&quot;hljs-comment&quot;&gt;&amp;quot;mskconnect-logs&amp;quot;&lt;/span&gt;&lt;span class=&quot;hljs-comment&quot;&gt;)&lt;/span&gt;
  })

  default &lt;span class=&quot;hljs-comment&quot;&gt;= null&lt;/span&gt;
}

&lt;span class=&quot;hljs-keyword&quot;&gt;variable&lt;/span&gt; &lt;span class=&quot;hljs-comment&quot;&gt;&amp;quot;log_delivery_cloudwatch_logs&amp;quot;&lt;/span&gt; &lt;span class=&quot;hljs-comment&quot;&gt;{&lt;/span&gt;
  type &lt;span class=&quot;hljs-comment&quot;&gt;= object({&lt;/span&gt;
    log_group &lt;span class=&quot;hljs-comment&quot;&gt;= string&lt;/span&gt;
  })

  default &lt;span class=&quot;hljs-comment&quot;&gt;= null&lt;/span&gt;
}

&lt;span class=&quot;hljs-keyword&quot;&gt;variable&lt;/span&gt; &lt;span class=&quot;hljs-comment&quot;&gt;&amp;quot;log_delivery_firehose&amp;quot;&lt;/span&gt; &lt;span class=&quot;hljs-comment&quot;&gt;{&lt;/span&gt;
  type &lt;span class=&quot;hljs-comment&quot;&gt;= object({&lt;/span&gt;
    delivery_stream &lt;span class=&quot;hljs-comment&quot;&gt;= string&lt;/span&gt;
  })

  default &lt;span class=&quot;hljs-comment&quot;&gt;= null&lt;/span&gt;
}

resource &lt;span class=&quot;hljs-comment&quot;&gt;&amp;quot;aws_mskconnect_connector&amp;quot;&lt;/span&gt; &lt;span class=&quot;hljs-comment&quot;&gt;&amp;quot;example&amp;quot;&lt;/span&gt; &lt;span class=&quot;hljs-comment&quot;&gt;{&lt;/span&gt;
  name &lt;span class=&quot;hljs-comment&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;hljs-comment&quot;&gt;&amp;quot;example&amp;quot;&lt;/span&gt;

  log_delivery &lt;span class=&quot;hljs-comment&quot;&gt;{&lt;/span&gt;
    worker_log_delivery &lt;span class=&quot;hljs-comment&quot;&gt;{&lt;/span&gt;
      dynamic &lt;span class=&quot;hljs-comment&quot;&gt;&amp;quot;s3&amp;quot;&lt;/span&gt; &lt;span class=&quot;hljs-comment&quot;&gt;{&lt;/span&gt;
        for_each &lt;span class=&quot;hljs-comment&quot;&gt;= var.log_delivery_s3 != null ? [1] : []&lt;/span&gt;

        enabled    &lt;span class=&quot;hljs-comment&quot;&gt;= true&lt;/span&gt;
        bucket_arn &lt;span class=&quot;hljs-comment&quot;&gt;= var.log_delivery_s3.bucket_arn&lt;/span&gt;
        prefix     &lt;span class=&quot;hljs-comment&quot;&gt;= var.log_delivery_s3.prefix&lt;/span&gt;
      }

      dynamic &lt;span class=&quot;hljs-comment&quot;&gt;&amp;quot;cloudwatch_logs&amp;quot;&lt;/span&gt; &lt;span class=&quot;hljs-comment&quot;&gt;{&lt;/span&gt;
        for_each &lt;span class=&quot;hljs-comment&quot;&gt;= var.log_delivery_cloudwatch_logs != null ? [1] : []&lt;/span&gt;

        enabled   &lt;span class=&quot;hljs-comment&quot;&gt;= true&lt;/span&gt;
        log_group &lt;span class=&quot;hljs-comment&quot;&gt;= var.log_delivery_cloudwatch_logs.log_group&lt;/span&gt;
      }

      dynamic &lt;span class=&quot;hljs-comment&quot;&gt;&amp;quot;firehose&amp;quot;&lt;/span&gt; &lt;span class=&quot;hljs-comment&quot;&gt;{&lt;/span&gt;
        for_each &lt;span class=&quot;hljs-comment&quot;&gt;= var.log_delivery_firehose != null ? [1] : []&lt;/span&gt;

        enabled         &lt;span class=&quot;hljs-comment&quot;&gt;= true&lt;/span&gt;
        delivery_stream &lt;span class=&quot;hljs-comment&quot;&gt;= var.log_delivery_firehose.delivery_stream&lt;/span&gt;
      }
    }
  }
}&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;And here&amp;#39;s an example when we only offer S3 option:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-terraform&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;variable&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;log_delivery_s3&amp;quot;&lt;/span&gt; {
  type &lt;span class=&quot;hljs-comment&quot;&gt;= object({&lt;/span&gt;
    bucket_arn &lt;span class=&quot;hljs-comment&quot;&gt;= string&lt;/span&gt;
    prefix     &lt;span class=&quot;hljs-comment&quot;&gt;= optional(string,&lt;/span&gt; &lt;span class=&quot;hljs-comment&quot;&gt;&amp;quot;mskconnect-logs&amp;quot;&lt;/span&gt;&lt;span class=&quot;hljs-comment&quot;&gt;)&lt;/span&gt;
  })

  default &lt;span class=&quot;hljs-comment&quot;&gt;= null&lt;/span&gt;
}

resource &lt;span class=&quot;hljs-comment&quot;&gt;&amp;quot;aws_mskconnect_connector&amp;quot;&lt;/span&gt; &lt;span class=&quot;hljs-comment&quot;&gt;&amp;quot;example&amp;quot;&lt;/span&gt; &lt;span class=&quot;hljs-comment&quot;&gt;{&lt;/span&gt;
  name &lt;span class=&quot;hljs-comment&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;hljs-comment&quot;&gt;&amp;quot;example&amp;quot;&lt;/span&gt;

  log_delivery &lt;span class=&quot;hljs-comment&quot;&gt;{&lt;/span&gt;
    worker_log_delivery &lt;span class=&quot;hljs-comment&quot;&gt;{&lt;/span&gt;
      s3 &lt;span class=&quot;hljs-comment&quot;&gt;{&lt;/span&gt;
        enabled    &lt;span class=&quot;hljs-comment&quot;&gt;= var.log_delivery_s3 != null&lt;/span&gt;
        bucket_arn &lt;span class=&quot;hljs-comment&quot;&gt;= var.log_delivery_s3.bucket_arn&lt;/span&gt;
        prefix     &lt;span class=&quot;hljs-comment&quot;&gt;= var.log_delivery_s3.prefix&lt;/span&gt;
      }
    }
  }
}&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;See how simple and shorter the code become? No need to use those &lt;code&gt;dynamic&lt;/code&gt; blocks just to conditionally add or remove the log delivery option block anymore.&lt;/p&gt;
&lt;h2&gt;But I might need to use that other option in the future...&lt;/h2&gt;
&lt;p&gt;&lt;a href=&quot;https://martinfowler.com/bliki/Yagni.html&quot;&gt;YAGNI.&lt;/a&gt;&lt;/p&gt;
</description>
    </item>
    <item>
      <title>Access internal kubernetes services from anywhere using Tailscale</title>
      <link>https://pokgak.xyz/articles/k8s-cluster-access-tailscale/</link>
      <guid>https://pokgak.xyz/articles/k8s-cluster-access-tailscale/</guid>
      <pubDate>Mon, 11 Sep 2023 18:05:00 GMT</pubDate>
      <description>&lt;p&gt;&lt;img src=&quot;/images/k8s-external-dns-tailscale.png&quot; alt=&quot;Full flow&quot;&gt;&lt;/p&gt;
&lt;p&gt;I recently setup a local kubernetes in my home network to play with and one of the issues that I faced is that it is hard to access the services inside the cluster from my laptop. I don&amp;#39;t have a load-balancer in my setup so everytime I want to access a service from my laptop, I&amp;#39;ll have to run &lt;code&gt;kubectl port-forward&lt;/code&gt; first before using the localhost address to access it. It works but it&amp;#39;s annoying.&lt;/p&gt;
&lt;p&gt;Usually in cloud environments like AWS, you would setup an &lt;a href=&quot;https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/&quot;&gt;ingress-controller&lt;/a&gt; that will provision a load balancer for you and use that load balancer to expose your services inside the cluster to the internet using Ingress resources. Your incoming traffic from the internet will then be routed through the load balancer into your cluster onto your pods. Unfortunately, you don&amp;#39;t get the same thing when hosting your cluster locally outside of the cloud environment. You have to manually configure your network to allow access from the internet.&lt;/p&gt;
&lt;h3&gt;So what options do we have?&lt;/h3&gt;
&lt;p&gt;One way we can do it is to use a Service with type NodePort to use the host port and access the pods using the host IP, this will allow access to services inside the cluster to your local network but still not from the internet. To allow access from the internet, you&amp;#39;ll have to open a port on your router to route to the host IP from the NodePort service.&lt;/p&gt;
&lt;p&gt;I&amp;#39;m not a fan of opening my home network to the internet. Home routers is infamous for being vulnerable and easily exploitable. I don&amp;#39;t want mine to be part of a new legion of botnets that will &lt;a href=&quot;https://blog.cloudflare.com/cloudflare-mitigates-record-breaking-71-million-request-per-second-ddos-attack/&quot;&gt;break a new record for biggest DDoS attack&lt;/a&gt;. Using the NodePort service type also is not that great. With NodePort service, you&amp;#39;ll have to specify the node IP to access along with the port assigned and your traffic will always go to that node and the pods running on it. More reason on why NodePort is a bad idea on &lt;a href=&quot;https://devops.stackexchange.com/a/17084&quot;&gt;StackOverflow&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;What other option do we have? &lt;a href=&quot;https://tailscale.com&quot;&gt;Tailscale&lt;/a&gt;!&lt;/p&gt;
&lt;h3&gt;Tailscale Subnet Router&lt;/h3&gt;
&lt;p&gt;Tailscale is a mesh VPN built on top of &lt;a href=&quot;https://www.wireguard.com/&quot;&gt;Wireguard&lt;/a&gt;. I&amp;#39;ve been using it for a long time for accessing my personal servers at home while I&amp;#39;m outside and I love it. It is so simple to setup you don&amp;#39;t have to know any networking magic to use it. Tailscale will create a peer-to-peer network from your client to your other Tailscale devices and it is also really smart in figuring out a way to punch a hole through your home network (&lt;a href=&quot;#resources&quot;&gt;see the Resources section&lt;/a&gt;) to connect to the internet so you don&amp;#39;t have to open a port on your home router anymore. Bot legion problem solved!&lt;/p&gt;
&lt;p&gt;One of the ways you can use Tailscale is by configuring a Tailscale node as a &lt;strong&gt;subnet router&lt;/strong&gt;. Usually, when you have 10 devices in your network, you&amp;#39;ll have to install Tailscale on each of those devices to connect it to your VPN network but with a subnet router only one Tailscale node in that network is enough, as long as that subnet router node have network access to all the devices in that network. You&amp;#39;ll have to configure your subnet router to advertise the route of the internal cluster network that the subnet router is in using CIDR range e.g. &lt;code&gt;10.43.0.0/16&lt;/code&gt; so that other devices outside of that network will know to look for the subnet router if they want to access the IP address from that CIDR range.&lt;/p&gt;
&lt;h4&gt;ELI5: subnet router advertisement&lt;/h4&gt;
&lt;p&gt;You&amp;#39;re a postman trying to deliver a parcel. Your parcel destination is set to unit A-1-2-3 in the TRX Exchange 106 building. You&amp;#39;ve never been to TRX before so you don&amp;#39;t know which floor the office actually is but you noticed there&amp;#39;s a big signboard at the reception saying &amp;quot;Come here if you have parcel for unit A-1-0-0 to A-9-9-9&amp;quot;. So, you went the reception and then the nice lady at the reception gave you the direction to reach the office unit A-1-2-3 for you to deliver your parcel.&lt;/p&gt;
&lt;p&gt;The reception here is like our subnet router. All the traffic meant for the network have to go through the subnet router first, then they&amp;#39;re passed through to the actual packet destination.&lt;/p&gt;
&lt;h3&gt;I don&amp;#39;t want to remember all this IP addresses&lt;/h3&gt;
&lt;p&gt;With a subnet router, you can now reach any of the services inside the cluster using the ClusterIP of that service but IP address is not human-friendly and you don&amp;#39;t want to (you can&amp;#39;t!) memorize all the IP addresses for all the services inside the cluster. So, now we need something that will map our IP addresses to a human-friendly format. Sounds familiar? We can use DNS records.&lt;/p&gt;
&lt;p&gt;You can definitely create DNS records manually and map it to each of the ClusterIP for your services. That&amp;#39;s what I did for testing when validating this setup actually. At scale, that won&amp;#39;t work tho. You don&amp;#39;t want to be the one to manually go to your DNS registrar and create the records one by one. Luckily, in kubernetes there is an application called external-dns.&lt;/p&gt;
&lt;h3&gt;external-dns to the rescue&lt;/h3&gt;
&lt;p&gt;external-dns is an application that runs inside your kubernetes cluster and it periodically queries the kubernetes API for the list of all Service and Ingress resources. From the list, it checks whether it should create a DNS record for the resources based on the resource annotation. It supports a lot of DNS providers like AWS Route53, Cloudflare, Google Cloud DNS and more. For my setup I&amp;#39;m using Cloudflare.&lt;/p&gt;
&lt;p&gt;By default, external-dns will only create DNS records for Ingress resource or Service with type LoadBalancer. For my setup, since I&amp;#39;m self-hosting the cluster inside my home network and don&amp;#39;t have access to a load balancer, I have to &lt;a href=&quot;https://github.com/pokgak/gitops/blob/0a880ec3e08481a7c50e67995fd4092dfb3c92f4/system/external-dns.yaml#L18&quot;&gt;add an extra configuration parameter&lt;/a&gt; to external-dns so that it will create DNS records for ClusterIP Service type. On the Service resource itself, usually external-dns searches for the &lt;a href=&quot;https://github.com/kubernetes-sigs/external-dns/blob/master/docs/annotations/annotations.md#external-dnsalphakubernetesiohostname&quot;&gt;&lt;code&gt;external-dns.alpha.kubernetes.io/hostname&lt;/code&gt; annotation &lt;/a&gt; but since we&amp;#39;re using it with ClusterIP, I have to change it to &lt;a href=&quot;https://github.com/kubernetes-sigs/external-dns/blob/master/docs/annotations/annotations.md#external-dnsalphakubernetesiointernal-hostname&quot;&gt;&lt;code&gt;external-dns.alpha.kubernetes.io/internal-hostname&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Tailscale + external-dns = ❤️&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;➜ dig prometheus.k8s.pokgak.xyz +short
&lt;span class=&quot;hljs-number&quot;&gt;10.43.170.163&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;With those changes applied. All new Service resources in my kubernetes cluster that have the annotation will get one DNS record on Cloudflare. Now, if I try to resolve a name for a service inside the cluster, it will return me an internal ClusterIP. Combined with the Tailscale subnet-router we&amp;#39;ve configured earlier, now you can access services inside your cluster from any of your Tailscale devices from any part of the world.&lt;/p&gt;
&lt;p&gt;With tailscale, you&amp;#39;ll also have an additional layer of authentication. Only users in your Tailscale networks can access the exposed services. For others, they might be able to guess what you have running in your cluster from your DNS records but they won&amp;#39;t be able to access it since all the IPs will be private IPs. &lt;/p&gt;
&lt;p&gt;For the next part, I&amp;#39;m looking into exposing some service inside my cluster to the internet &lt;strong&gt;fully&lt;/strong&gt; without having to be in the Tailscale network. Tailscale Funnel suppose to do just that but I still haven&amp;#39;t tested if it&amp;#39;s working with services inside kubernetes.&lt;/p&gt;
&lt;h3&gt;Resources&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://tailscale.com/blog/how-tailscale-works/&quot;&gt;How Tailscale Works&lt;/a&gt;: explanation on how Tailscale uses Wireguard to create a mesh VPN network architecture.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://tailscale.com/blog/how-nat-traversal-works/&quot;&gt;How NAT traversal works&lt;/a&gt;: recommended read even if you&amp;#39;re not a networking geek. You&amp;#39;ll learn a thing or two about networking for sure.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/pokgak/gitops/blob/0a880ec3e08481a7c50e67995fd4092dfb3c92f4/system/external-dns.yaml&quot;&gt;Full cofiguration for external-dns helm chart&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/pokgak/gitops/blob/0a880ec3e08481a7c50e67995fd4092dfb3c92f4/system/kube-prometheus-stack.yaml#L20&quot;&gt;external-dns annotation for the services&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://tailscale.com/kb/1185/kubernetes/#subnet-router&quot;&gt;Tailscale subnet router inside kubernetes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/pokgak/gitops/blob/0a880ec3e08481a7c50e67995fd4092dfb3c92f4/system/tailscale-subnet-router.yaml&quot;&gt;k8s manifest for my subnet router&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    <item>
      <title>On Public Speaking</title>
      <link>https://pokgak.xyz/articles/on-public-speaking/</link>
      <guid>https://pokgak.xyz/articles/on-public-speaking/</guid>
      <pubDate>Fri, 08 Sep 2023 13:37:00 GMT</pubDate>
      <description>&lt;p&gt;&lt;img src=&quot;/images/devops-meetup.jpg&quot; alt=&quot;Giving my talk at the local DevOps meetup&quot;&gt;&lt;/p&gt;
&lt;p&gt;I had the chance to give a talk at my local DevOps meetup last July and recently, did another one at a Tech Talk event hosted by my current employer. Both talks are related to OPA but this blog post will be more of a personal reflection for me just to document these moments in my career.&lt;/p&gt;
&lt;p&gt;There are two common things that I noticed about myself from giving the talk and I&amp;#39;m gonna describe them here.&lt;/p&gt;
&lt;h3&gt;The First Common Thing&lt;/h3&gt;
&lt;p&gt;Firstly, on days leading to the event I would slowly get agitated, the realness of having to do that talk creeps on me. I tend to procrastinate when faced with a big assignment like this. I will be so productive and do everything from playing a new game, thinking about a new project to do, cleaning my house, cooking, and eating --- except preparing and finishing the slides for the talk. &lt;/p&gt;
&lt;p&gt;On the day of the talk itself, my productivity will be out of the window. I&amp;#39;d be reading, or doing something else but the thought of having to do the talk later is always on the back of my head. A few hours before the talk, I&amp;#39;m still polishing my slides, making last-minute changes, and going through the slides a couple more times in my head but around two hours before, I start distracting myself and doing other stuff. I&amp;#39;ll go talk to my colleagues, read up on unrelated topics, and even get the time to catch up on my novels. This is when things are getting real for me.&lt;/p&gt;
&lt;p&gt;Usually at these events, there will be food served before the event starts. For the first one, they had pizza and for the second event, they served rice with chicken curry and some kuih. They look really good tbh but I definitely won&amp;#39;t be touching those. I&amp;#39;m already struggling enough to keep my nerves together and the thought of having a taste of the food at the back of my throat when talking later will throw me into a full-blown panic mode. It&amp;#39;s like my sense is heightened and I get so sensitive to my surrounding that even keeping a conversation with others made me feel overwhelmed. My best companion at this moment is a bottle of mineral water --- tasteless, and I can fiddle with the bottle cap to distract myself.&lt;/p&gt;
&lt;p&gt;It might feel like I&amp;#39;m exaggerating here but I can tell you that me before and after finishing the talk is so different I feel like we&amp;#39;re a totally different person.&lt;/p&gt;
&lt;h3&gt;The Second Common Thing&lt;/h3&gt;
&lt;p&gt;As the host of the event invited me to the stage, I&amp;#39;ll bring my laptop, set it up, and connect it to the projector. Then, holding the mic, I&amp;#39;ll take a moment to look over the audience and take a deep breath.&lt;/p&gt;
&lt;p&gt;Once I start talking, it&amp;#39;s like a switch flipped and all that nervousness is gone.&lt;/p&gt;
&lt;p&gt;Just like how I&amp;#39;d practiced, I would go through the slides, following the flow that I&amp;#39;ve set when putting the slides together. I like my presentation to have a story. I think it is more engaging to the audience and it is also a lot easier for me to remember what to say. I probably need to work on my tempo still (I blazed through 40+ slides in 20+ minutes 😆) but at least I wasn&amp;#39;t too nervous and blanked out mid-speaking. In the most recent talk that I gave, I even managed to slip in a joke during my introduction. Quite proud of that one. Hah!&lt;/p&gt;
&lt;h3&gt;What&amp;#39;s the magic?&lt;/h3&gt;
&lt;p&gt;I&amp;#39;m not sure if there&amp;#39;s any science behind this but one factor that I think contributed to this flip in the switch is my confidence in the topic. Before giving my first talk, I&amp;#39;ve been reading, thinking, and playing with the technology for about a year on and off. I won&amp;#39;t say that I mastered the topic but at least for the topic that I&amp;#39;m sharing, I feel like I have something to give to that the audience can benefit. &lt;/p&gt;
&lt;p&gt;For my second talk, I&amp;#39;ve been involved in the development process from day one --- together with my team members, of course. So, at this point, I can say that I know the topic quite well.&lt;/p&gt;
&lt;h3&gt;It&amp;#39;s a journey&lt;/h3&gt;
&lt;p&gt;My journey with public speaking started a long time ago. I used to be so nervous waiting for my turn to introduce myself to the class on the first day of school. Having all the attention on me even for a brief moment when I barged in the conversation with my friend group during our lepak session at mamak can make my ears turn red.&lt;/p&gt;
&lt;p&gt;My first time speaking in front of a large audience was in a high school public speaking competition when I was 13. I was chosen because my English was pretty good in the class, not because I was good at public speaking, mind you. Being the inexperienced public speaker that I am, I printed my speech text on a big A4 paper and brought it to the stage, holding it full like that, not folding the paper whatsoever.&lt;/p&gt;
&lt;p&gt;I didn&amp;#39;t manage to memorize the whole thing so I kept looking at the text during my speech. After I finished, my friends told me they noticed how nervous I was from how the A4 paper in my hand was shaking so badly during the whole speech. I&amp;#39;ll always remember that one.&lt;/p&gt;
&lt;h3&gt;Final Note&lt;/h3&gt;
&lt;p&gt;I can gladly say that I&amp;#39;ve improved a lot since then. Being exposed to these situations countless times made me realize that it is natural to feel nervous and I know that I&amp;#39;ll come out better after going through it. &lt;/p&gt;
&lt;p&gt;For those who might struggle with the same issue, just know that others are going through the same thing too and most importantly, take the risk and put yourself out there and eventually you&amp;#39;ll get to the point where you&amp;#39;ll overcome that fear.&lt;/p&gt;
&lt;p&gt;Special thanks to the organizers for giving me the chance to put myself out there and my colleagues and friends for supporting me.&lt;/p&gt;
</description>
    </item>
    <item>
      <title>Interview Series: Explain How Kubernetes DNS works</title>
      <link>https://pokgak.xyz/articles/explain-k8s-dns/</link>
      <guid>https://pokgak.xyz/articles/explain-k8s-dns/</guid>
      <pubDate>Wed, 24 May 2023 15:00:00 GMT</pubDate>
      <description>&lt;p&gt;This will be the first in my interview questions series. I&amp;#39;ll compile interesting questions that I got from my experience interviewing for DevOps/SRE role in Malaysia.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/k8s-dns.png&quot; alt=&quot;&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Calling a service by its cluster-internal DNS&lt;/h2&gt;
&lt;p&gt;We&amp;#39;ll go from the highest to the lowest level in this journey. So let&amp;#39;s go through the scenario a bit: you have two services, foo and bar. those two services live in the same namespace &lt;code&gt;app&lt;/code&gt; in your cluster. Now, inside service foo code, it makes a HTTP request to service bar. Probably something like so:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;http&lt;/span&gt;.&lt;span class=&quot;hljs-built_in&quot;&gt;get&lt;/span&gt;(&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;https://bar/&amp;quot;&lt;/span&gt;)&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;What happens behind the scene from when the request is made to service bar and until the response is received back by service foo?&lt;/p&gt;
&lt;h2&gt;What is that weird DNS format?&lt;/h2&gt;
&lt;p&gt;You might&amp;#39;ve noticed that we&amp;#39;re just calling the service bar by the name using a weird name. Instead of the usual something.com domain, we&amp;#39;re just using &lt;code&gt;bar&lt;/code&gt; directly. How is this possible?&lt;/p&gt;
&lt;p&gt;Kubernetes allows you to call other services by using the service resource name directly. It does this by automatically appending the full DNS domain to the given service name. So for example here, when you make a request to &lt;code&gt;bar&lt;/code&gt;, the application will make a DNS request to the local DNS server. The DNS server then notices that the domain that it received is not &amp;quot;complete&amp;quot; so it automatically appends the rest of the domain name based on the configuration that was given to it. If the service is running inside the namespace &lt;code&gt;app&lt;/code&gt;, it will turn &lt;code&gt;bar&lt;/code&gt; into &lt;code&gt;bar.app.svc.cluster.local&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This automatic appending to complete the domain name is called &amp;quot;search domain&amp;quot;. In our example the seach domain is configured as &lt;code&gt;app.svc.cluster.local&lt;/code&gt;. So, whenever the service makes a call to &lt;code&gt;bar&lt;/code&gt; it will automatically try to append the search domain and tries to resolve the domain name.&lt;/p&gt;
&lt;h2&gt;How (and where) is this configured?&lt;/h2&gt;
&lt;p&gt;Every pods in kubernetes has a file &lt;code&gt;/etc/resolv.conf&lt;/code&gt; that is configured by the kubelet when starting the pod. This file will contain the info where to find the DNS server inside the cluster and also what to use as the search domain. Here&amp;#39;s an example of the file (&lt;a href=&quot;https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/&quot;&gt;source&lt;/a&gt;):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;nameserver &lt;span class=&quot;hljs-number&quot;&gt;10.32&lt;/span&gt;&lt;span class=&quot;hljs-number&quot;&gt;.0&lt;/span&gt;&lt;span class=&quot;hljs-number&quot;&gt;.10&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;search&lt;/span&gt; &amp;lt;namespace&amp;gt;.svc.&lt;span class=&quot;hljs-keyword&quot;&gt;cluster&lt;/span&gt;.&lt;span class=&quot;hljs-keyword&quot;&gt;local&lt;/span&gt; svc.&lt;span class=&quot;hljs-keyword&quot;&gt;cluster&lt;/span&gt;.&lt;span class=&quot;hljs-keyword&quot;&gt;local&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;cluster&lt;/span&gt;.&lt;span class=&quot;hljs-keyword&quot;&gt;local&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;options&lt;/span&gt; ndots:&lt;span class=&quot;hljs-number&quot;&gt;5&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;h2&gt;Which IP will be returned by the DNS query?&lt;/h2&gt;
&lt;p&gt;The DNS query will return us a virtual service IP. Why virtual? It&amp;#39;s because this IP doesn&amp;#39;t actually points to a pod that runs our services.&lt;/p&gt;
&lt;p&gt;In kubernetes, pods can come and go at any time which also means that their IP will change all the time. How do we know then where to send our requests to? The Service resource is used to abstract dynamic nature of pod IPs and provide a consistent IP that your application can use to send requests to it.&lt;/p&gt;
&lt;h2&gt;How does the service IP maps to pod IPs?&lt;/h2&gt;
&lt;p&gt;The Service resource always comes with its pair, the Endpoint (or EndpointSlice) resource. This Endpoint resource tracks the pod IPs and also have information which pod IP is ready to receive traffic. This information can be queried using the kubernetes API.&lt;/p&gt;
&lt;p&gt;On the node where the pod runs, there is a program called kube-proxy that runs and updates the routing to map from service IP to pod IP. This routing can be done in multiple ways but currently the default is using iptables.&lt;/p&gt;
&lt;h2&gt;When does this routing happens?&lt;/h2&gt;
&lt;p&gt;When a request is first sent from the application code, its destination will be set to the service IP but before the request is sent out over the network, iptables modifies the destination and changes the service IP to pod IP. If there are multiple pods that sits behind a service, the pod IP will be load balanced using a round-robin. Once the destination IP is changed, the packet is then sent out over the network.&lt;/p&gt;
&lt;h2&gt;How do you know which node to send the packet to?&lt;/h2&gt;
&lt;p&gt;A kubernetes cluster can contain a lot of nodes. Sending the packet to the correct node is important. To know which node to send the packet to, the router in your network will need to know which node to send this packet to. If you setup your own cluster ala kubernetes-the-hard-way, you might need to &lt;a href=&quot;https://github.com/kelseyhightower/kubernetes-the-hard-way/blob/master/docs/11-pod-network-routes.md&quot;&gt;configure these routes yourself&lt;/a&gt; but if you&amp;#39;re using kubernetes on top of any cloud providers, they usually will do these setup for you and you don&amp;#39;t have to do anything here.&lt;/p&gt;
&lt;p&gt;Once that is sorted, your packet now can reach the correct node and the packet is sent to the correct pod on the node based on the destination pod IP set in the packet header. The response then will be sent to the source pod IP in the request packet header.&lt;/p&gt;
&lt;h2&gt;Response now sent back to the source node. All done?&lt;/h2&gt;
&lt;p&gt;Not yet. There&amp;#39;s one more last thing to do. Remember when we sent the request originally, iptables had rewrote the destination from service IP to pod IP? Now for the response packet to be received back by the pod, the pod IP that we rewrote before needs to be converted back to the service IP.&lt;/p&gt;
&lt;p&gt;This is needed because as far as the application knows, it sends a request to the service IP and not the pod IP. If it suddenly receives a response from a pod IP that it doesn&amp;#39;t know of, then it will just drop the response. So, here iptable will have to remember what it did before and convert pod IP on the response packet back to service IP. Finally, our foo service can receive the response that it wants from the bar service.&lt;/p&gt;
</description>
    </item>
    <item>
      <title>Analysis Ticketing System</title>
      <link>https://pokgak.xyz/articles/analysis-ticketing-system/</link>
      <guid>https://pokgak.xyz/articles/analysis-ticketing-system/</guid>
      <pubDate>Wed, 17 May 2023 16:37:00 GMT</pubDate>
      <description>&lt;p&gt;I kinda have an idea how you can scale the backend service of a ticketing system but I&amp;#39;m still stuck on the seating choice locking.&lt;/p&gt;
&lt;p&gt;As some Twitter users have highlighted in the thread, it&amp;#39;s not as simple as infinitely scaling the backend service using serverless functions or Kubernetes. The bottleneck usually lies on the DB tier.&lt;/p&gt;
&lt;p&gt;Let&amp;#39;s review the scenario:&lt;/p&gt;
&lt;p&gt;An announcement was made that users can start buying ticket at 10AM. Before reaching 10AM, a bunch of users already ready with their devices (most use more than one for higher probability of getting the ticket). Once the clock reached 10AM, they&amp;#39;ll start accessing the link to buy the ticket (some may started spamming the link before 10AM). (Reminds me of DDoS). &lt;/p&gt;
&lt;p&gt;Requirement: one seat can only be purchased by exactly one user. To simplify my analysis, one user can only buy maximum 1 seat/ticket.&lt;/p&gt;
&lt;h2&gt;Waiting Room Queue&lt;/h2&gt;
&lt;p&gt;Based on the tweets that I saw, the ticketing vendor implemented a waiting room queue before you can enter the seat selection page. User @zulhhandyplast suggested they use Cloudflare product, Waiting Room, instead of rolling out your own waiting room implementation. How does a waiting room works? Here&amp;#39;s a snippet from the Cloudflare Waiting room &lt;a href=&quot;https://www.cloudflare.com/en-gb/waiting-room/&quot;&gt;landing page&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Cloudflare Waiting Room allows organizations to route excess users to a custom-branded waiting room, helping preserve customer experience and protect origin servers from being overwhelmed with requests.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&amp;quot;...protect origin servers from being overwhelmed with requests.&amp;quot; I think this is the most important part, which leads to improved reliability and consequently, user happiness.&lt;/p&gt;
&lt;p&gt;Implementing a waiting room yourself seems quite hard. Will be nice if I can revisit this in the future and write more about this.&lt;/p&gt;
&lt;h3&gt;How does a waiting room queue works?&lt;/h3&gt;
&lt;p&gt;From the &lt;a href=&quot;https://cf-assets.www.cloudflare.com/slt3lc6tev37/IydVtIa13olmKwJ1Dv8KW/43dc4e3cc26f9a2578750fab360172be/2_Pager___Layout_A_-_Standard_Cloudflare_-_A4.pdf&quot;&gt;whitepaper&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The Cloudflare Waiting Room limits the number of users allowed in the application while placing excess traffic into a virtual queue to provide a smoother, more predictable user experience.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The number of users allowed in the room is capped and tracked using session cookies. Once the user leaves the application or the session cookies expired, new users will be allowed in until the max cap is reached. So, instead of users accidentally DDoSing the origin servers by continously refreshing the page in the browser, the incoming traffic never hits the origin server. With experience handling large scale DDoS, it&amp;#39;s a given that we can trust Cloudflare to be able to handle the traffic coming to our ticketing platform.&lt;/p&gt;
&lt;h2&gt;I got into the room, now what?&lt;/h2&gt;
&lt;p&gt;Once the user manage to get into the room, they now need to &amp;quot;fight&amp;quot; with other users who can get the best seats first. Now we need figure out how can we lock a seat for a user so that once the seat is selected, no other user can select the same seat. If the user completes the purchase, now the seat is no longer available. When should we release the lock on the seat?&lt;/p&gt;
&lt;p&gt;In the frontend, should there be a difference between a seat that is locked but not purchased yet and seats that&amp;#39;s already taken and paid for?&lt;/p&gt;
&lt;h3&gt;Real-time locking or refresh-based?&lt;/h3&gt;
&lt;p&gt;Should the user refresh the page to know the current status of the seat whether it is still locked or released already? As a customer, I would prefer if the status of the seats are updated in real-time without me having to refresh the page. If not real-time, there can be cases where the user selects a seat because it is shown as available but only after selecting the seat will the user be informed that the seat is no longer available or temporarily locked. Then it&amp;#39;s a game of refreshing the page as fast as possilbe and randomly choosing a seat, which will significantly increase the load on the server.&lt;/p&gt;
&lt;p&gt;Making this real-time means we will need to maintain a connection for each user accessing the page. Maintaining these connections is expensive and webservers usually have a limit how many connections it can keep track of at a time. Assuming the waiting room from before is working, we already set the max number of users that will be accessing the page, so there is already a hard cap in number of connections that we have to handle so we can already preprovision our hardware to handle the load.&lt;/p&gt;
&lt;h3&gt;How do we keep track status of the seats?&lt;/h3&gt;
&lt;p&gt;Instead of maintaining the status of seats in RDBMS like PostgreSQL or MySQL, I think it is better to use in-memory data store like Redis. The &lt;a href=&quot;https://redis.io/commands/expire/&quot;&gt;EXPIRE&lt;/a&gt; command seems like a perfect choice for locking the seat for a certain period and automatically releasing the lock after the period ended. How can the backend knows when a key expired so that it can push a new message to the frontend to update the seat status? Keyword: &lt;a href=&quot;https://redis.io/docs/manual/keyspace-notifications/&quot;&gt;keyspace notification&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I&amp;#39;m leaning on having a separate service handling this real-time seat status update. Once a user loads the page, it&amp;#39;ll start a websocket connection to this seat status service which will set the seat number as key in Redis with a expiry time. Once the key expired, this service will be notified by Redis and in turn it will send a message through the websocket connection that it maintained to update the status of the seat on the frontend.&lt;/p&gt;
&lt;h3&gt;Payment processor as the bottleneck?&lt;/h3&gt;
&lt;p&gt;What can we do if 3rd-party payment processor not behaving?&lt;/p&gt;
&lt;h3&gt;I manage to lock a seat!&lt;/h3&gt;
&lt;p&gt;Congrats! Now since the seat is locked, user will be given a fixed duration to finish the transaction and pay for the ticket. Once payment is confirmed, we can update the DB and then tell the seat locking service to update status of the seat lock to &amp;quot;taken&amp;quot;.&lt;/p&gt;
&lt;p&gt;Once we booked the user seat, we can dispatch a background job to send an email to the user as confirmation and also includes the ticket details. It&amp;#39;s okay to use a background job here so that user can get instant response instead of having to wait longer. In the instant response that is sent to user, we should manage their expectations and inform them that it may take a few minutes for the ticket to be sent to their email.&lt;/p&gt;
&lt;h2&gt;The Plan&lt;/h2&gt;
&lt;p&gt;TLDR&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;waiting room queue&lt;/li&gt;
&lt;li&gt;websocket for real-time connections to update seat status&lt;/li&gt;
&lt;li&gt;redis for keeping track of seat status (available, locked, taken)&lt;/li&gt;
&lt;li&gt;rdbms for persisting seating record after payment is done&lt;/li&gt;
&lt;li&gt;background job for non-time sensitive tasks like sending email&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    <item>
      <title>Instrumenting CI Pipelines using otel-cli</title>
      <link>https://pokgak.xyz/articles/instrument-your-ci/</link>
      <guid>https://pokgak.xyz/articles/instrument-your-ci/</guid>
      <pubDate>Sat, 08 Apr 2023 04:18:00 GMT</pubDate>
      <description>&lt;h2&gt;Why?&lt;/h2&gt;
&lt;p&gt;Why not?
Get the whole picture of what is happening in your pipeline. Get notified when something is taking longer than it should.&lt;/p&gt;
&lt;h2&gt;How?&lt;/h2&gt;
&lt;p&gt;Use otel-cli, a standalone Go binary that can create OpenTelemetry traces and sends to a tracing backend using the OTLP protocol.&lt;/p&gt;
&lt;h3&gt;OpenTelemetry?&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;https://opentelemetry.io/&quot;&gt;https://opentelemetry.io/&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Tracing Backend?&lt;/h3&gt;
&lt;p&gt;You collect traces from your application using the OpenTelemetry SDK. To visualize the relationship between the traces, you&amp;#39;ll have to send the traces to a tracing backend, which will provide a UI for exploring your traces. Example of tracing backend:&lt;/p&gt;
&lt;p&gt;Self-hosted:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Grafana Tempo&lt;/li&gt;
&lt;li&gt;Jaeger&lt;/li&gt;
&lt;li&gt;ElasticSearch&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Paid:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Honeycomb&lt;/li&gt;
&lt;li&gt;Datadog&lt;/li&gt;
&lt;li&gt;Grafana Cloud&lt;/li&gt;
&lt;li&gt;ElasticSearch Cloud&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;OTLP Protocol&lt;/h3&gt;
&lt;blockquote&gt;
&lt;p&gt;The OpenTelemetry Protocol (OTLP) specification describes the encoding, transport, and delivery mechanism of telemetry data between telemetry sources, intermediate nodes such as collectors and telemetry backends.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https://opentelemetry.io/docs/reference/specification/protocol/otlp/&quot;&gt;https://opentelemetry.io/docs/reference/specification/protocol/otlp/&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;otel-cli?&lt;/h3&gt;
&lt;p&gt;OpenTelemetry (OTel) supports many &lt;a href=&quot;https://opentelemetry.io/docs/instrumentation/&quot;&gt;SDK&lt;/a&gt; to create traces from your application but in CI pipelines, you&amp;#39;re usually using a shell script language like Bash which is not supported by any OTel SDKs currently. Therefore, we need a tool create this traces for us.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/equinix-labs/otel-cli&quot;&gt;otel-cli&lt;/a&gt; is a tool that will do that. It will generate a trace ID, span ID, and sends the traces in the expected format.&lt;/p&gt;
&lt;h2&gt;How to use otel-cli?&lt;/h2&gt;
&lt;p&gt;The simplest way to start using it is first to set the &lt;code&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/code&gt; value to tell otel-cli which backend to send our traces to.&lt;/p&gt;
&lt;h3&gt;Starting a local tracing backend server&lt;/h3&gt;
&lt;p&gt;otel-cli has a &lt;code&gt;server&lt;/code&gt; subcommand that you can use to run a simple tracing backend on your local. You can run the following command in another terminatl to start the server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;otel-&lt;span class=&quot;hljs-keyword&quot;&gt;cli&lt;/span&gt; server tui&lt;/code&gt;&lt;/pre&gt;&lt;h3&gt;Setting the tracing backend endpoint&lt;/h3&gt;
&lt;p&gt;Now that we have a server running locally to send our traces to, let&amp;#39;s tell otel-cli to send all the traces that it generated to this local server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;&lt;span class=&quot;hljs-built_in&quot;&gt;export&lt;/span&gt; &lt;span class=&quot;hljs-attribute&quot;&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;=localhost:4317&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Here we send it to &lt;code&gt;localhost&lt;/code&gt; on port 4317. Port 4317 is the default port when sending traces using grpc.&lt;/p&gt;
&lt;h3&gt;Sending our first trace&lt;/h3&gt;
&lt;p&gt;You can use &lt;code&gt;exec&lt;/code&gt; subcommand to wrap a command with otel-cli. It will automatically set the start and end time to calculate the run duration for the command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;otel-cli exec &lt;span class=&quot;hljs-params&quot;&gt;--service&lt;/span&gt; my-service &lt;span class=&quot;hljs-params&quot;&gt;--name&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;My First Trace&amp;quot;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;HELLO WORLD&amp;quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Then you should be able to see the a new line in the other terminal that we ran &lt;code&gt;otel-cli server tui&lt;/code&gt; just now.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/otel-cli-trace.png&quot; alt=&quot;Result in otel-cli server&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;In this article, I showed you the simplest way you can use otel-cli. To get more valuable information from your traces, you&amp;#39;ll usually need to add nested spans to your trace. It&amp;#39;ll help break down the execution of your program to more smaller unit that can be inspected. To get more advanced example, you should refer to the otel-cli &lt;a href=&quot;https://github.com/equinix-labs/otel-cli#examples&quot;&gt;examples&lt;/a&gt;.&lt;/p&gt;
</description>
    </item>
    <item>
      <title>How to use OPA policy from an S3 bucket when using Atlantis policy check</title>
      <link>https://pokgak.xyz/articles/conftest-pull-from-s3/</link>
      <guid>https://pokgak.xyz/articles/conftest-pull-from-s3/</guid>
      <pubDate>Wed, 16 Nov 2022 14:00:00 GMT</pubDate>
      <description>&lt;p&gt;&lt;a href=&quot;https://www.runatlantis.io&quot;&gt;Atlantis&lt;/a&gt; is an application used for collaborating on a Terraform code base using pull requests and one of the feature that it has is to run &lt;a href=&quot;conftest.dev&quot;&gt;conftest&lt;/a&gt; and test a set of defined OPA policies. At the moment I&amp;#39;m writing this article, Atlantis only supports using local sources i.e. local filesystem as the source of the policy. In this article, I&amp;#39;ll show an example of how to use an S3 bucket instead as the source for the policies.&lt;/p&gt;
&lt;h2&gt;Custom workflow and &lt;code&gt;run&lt;/code&gt; step&lt;/h2&gt;
&lt;p&gt;Atlantis supports using &lt;a href=&quot;https://www.runatlantis.io/docs/custom-workflows.html&quot;&gt;custom workflows&lt;/a&gt; to override the default commands that it runs and as part of that feature, it supports defining any &lt;a href=&quot;https://www.runatlantis.io/docs/custom-workflows.html#running-custom-commands&quot;&gt;custom commands&lt;/a&gt; to run as part of the steps for each stage. We will be &lt;del&gt;abusing&lt;/del&gt;using this feature to override the default &lt;code&gt;conftest&lt;/code&gt; command that Atlantis uses and specify our policy through the &lt;code&gt;--update&lt;/code&gt; flag of &lt;code&gt;conftest&lt;/code&gt;&lt;/p&gt;
&lt;h2&gt;conftest &lt;code&gt;--update&lt;/code&gt; flag&lt;/h2&gt;
&lt;p&gt;By using &lt;code&gt;--update&lt;/code&gt; you can tell &lt;code&gt;conftest&lt;/code&gt; to pull the policy first every time it wants to run the tests. We will be using an S3 bucket as our source but before we can pull from S3, you have to make sure that wherever the Atlantis server is running, it can access and have permission to pull objects from the bucket. In my case, Atlantis is running as a StatefulSet inside a Kubernetes cluster so I have already configured the IAM permission needed for it to access the bucket.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;conftest&lt;/code&gt; is using the &lt;a href=&quot;https://github.com/hashicorp/go-getter&quot;&gt;go-getter package&lt;/a&gt; underneath to pull these packages so technically it should be possible to also pull from other sources that &lt;code&gt;go-getter&lt;/code&gt; supports, other than just S3.&lt;/p&gt;
&lt;h2&gt;Result&lt;/h2&gt;
&lt;p&gt;Combining both of the features described above, here&amp;#39;s an example of a simplified &lt;a href=&quot;https://www.runatlantis.io/docs/server-side-repo-config.html&quot;&gt;repo config&lt;/a&gt; that I use:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-yaml&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;# minimal config for brevity; you might need to configure more options to make atlantis works properly&lt;/span&gt;
&lt;span class=&quot;hljs-attr&quot;&gt;repos:&lt;/span&gt;
  &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-attr&quot;&gt;id:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;github.com/$ORG/$REPO&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;workflow:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;custom&lt;/span&gt;

&lt;span class=&quot;hljs-attr&quot;&gt;workflows:&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;custom:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;policy_check:&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;steps:&lt;/span&gt;
        &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;show&lt;/span&gt; &lt;span class=&quot;hljs-comment&quot;&gt;# important don&amp;#x27;t skip this step&lt;/span&gt;
        &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-attr&quot;&gt;run:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;conftest&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;test&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;$SHOWFILE&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;--update&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;s3::https://s3-us-east-1.amazonaws.com/$BUCKET_NAME/policy&lt;/span&gt;

&lt;span class=&quot;hljs-attr&quot;&gt;policies:&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;policy_sets:&lt;/span&gt;
    &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-attr&quot;&gt;name:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;policy-from-s3&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;path:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;/home/atlantis/policy&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;source:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;local&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;In the example above, under the &lt;code&gt;workflows&lt;/code&gt; key, I&amp;#39;m defining a custom workflow named &lt;code&gt;custom&lt;/code&gt; and inside that custom workflow, I&amp;#39;m overriding the default &lt;code&gt;policy_check&lt;/code&gt; steps with my own. My custom &lt;code&gt;policy_check&lt;/code&gt; steps consists of the &lt;code&gt;show&lt;/code&gt; step and the custom &lt;code&gt;run&lt;/code&gt; step. The &lt;code&gt;show&lt;/code&gt; step is crucial since this is when Atlantis will run &lt;code&gt;terraform show&lt;/code&gt; to convert your Terraform planfile to a JSON formatted file.&lt;/p&gt;
&lt;p&gt;When using the custom &lt;code&gt;run&lt;/code&gt; step, Atlantis will store the path to this JSON formatted file in variable &lt;code&gt;$SHOWFILE&lt;/code&gt; so when I ran my conftest command you can see that I&amp;#39;m using &lt;code&gt;$SHOWFILE&lt;/code&gt; to run &lt;code&gt;conftest&lt;/code&gt; against the file. Optional: if you want to run &lt;code&gt;conftest&lt;/code&gt; against the Terraform files too, you can add &lt;code&gt;*.tf&lt;/code&gt; after &lt;code&gt;$SHOWFILE&lt;/code&gt; and it will include all the &lt;code&gt;*tf&lt;/code&gt; files in that project directory.&lt;/p&gt;
&lt;p&gt;Next comes the &lt;code&gt;--update&lt;/code&gt; flag, to specify the S3 bucket, I&amp;#39;m using a URL format that is &lt;a href=&quot;https://github.com/hashicorp/go-getter#s3-bucket-examples&quot;&gt;specified by the &lt;code&gt;go-getter&lt;/code&gt; package&lt;/a&gt; replacing &lt;code&gt;$BUCKET_NAME&lt;/code&gt; with the bucket name that I have configured with the correct permission and network access. Inside the S3 bucket, this is how I structured the files. I put all the OPA policies inside a folder &lt;code&gt;policy&lt;/code&gt; since &lt;code&gt;conftest&lt;/code&gt; complains when I just put all the policies directly at the root level inside the bucket. YMMV.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;&lt;span class=&quot;hljs-variable&quot;&gt;$BUCKET_NAME&lt;/span&gt;/
├─ policy/
│  ├─ stop_it&lt;span class=&quot;hljs-selector-class&quot;&gt;.rego&lt;/span&gt;
│  ├─ dont_kill_server&lt;span class=&quot;hljs-selector-class&quot;&gt;.rego&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;After defining our custom workflow, we can specify the custom worklow as the default workflow for a repo. This is done by setting the &lt;code&gt;repos[].workflow&lt;/code&gt; value to the name of our custom workflow, in my case it&amp;#39;s &lt;code&gt;custom&lt;/code&gt;. &lt;/p&gt;
&lt;p&gt;Next, as part of the using the policy check feature in Atlantis, you are required to set the &lt;code&gt;policies&lt;/code&gt; values. You can refer to the &lt;a href=&quot;https://www.runatlantis.io/docs/server-side-repo-config.html#policies&quot;&gt;docs&lt;/a&gt; for the full configuration required. Inside the &lt;code&gt;policies&lt;/code&gt; key, there is a required &lt;code&gt;policy_check&lt;/code&gt; key that is used to specify where Atlantis can find the OPA policies to use when running &lt;code&gt;conftest&lt;/code&gt;. Usually, this is a folder on a local filesystem already containing the policies but in our case, since we&amp;#39;re using the &lt;code&gt;--update&lt;/code&gt; flag, we just need to specify any folder on the local filesystem that will be writable by the Atlantis user. You can see in the example above that I&amp;#39;m using &lt;code&gt;/home/atlantis/policy&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;That&amp;#39;s all you need to do configure to make Atlantis pulls policies from S3 (and include Terraform source code files in your &lt;code&gt;contest&lt;/code&gt; run). Shoutout to a &lt;a href=&quot;https://doordash.engineering/2022/09/20/how-doordash-ensures-velocity-and-reliability-through-policy-automation/&quot;&gt;DoorDash engineering blog post&lt;/a&gt; which mentioned briefly that they pulled their policies from S3 and made me curios how to do the same using Atlantis. You can mention me on Twitter (&lt;a href=&quot;https://twitter.com/pokgak73&quot;&gt;@pokgak73&lt;/a&gt;) if this article has helped you. That would most definitely made my day :)&lt;/p&gt;
</description>
    </item>
    <item>
      <title>Getting started with OPA and conftest</title>
      <link>https://pokgak.xyz/articles/opa-conftest/</link>
      <guid>https://pokgak.xyz/articles/opa-conftest/</guid>
      <pubDate>Sat, 12 Nov 2022 05:00:00 GMT</pubDate>
      <description>&lt;p&gt;I started using OPA at my &lt;code&gt;$dayJob&lt;/code&gt; recently and there are some parts that I think is not intuitive to grok for beginners.&lt;/p&gt;
&lt;p&gt;If you want to play around with the rules and input, you can use the &lt;a href=&quot;https://play.openpolicyagent.org/&quot;&gt;Rego Playground&lt;/a&gt;. It&amp;#39;s super useful and I used it to test out policies and test my hypothesis when playing around with Rego language.&lt;/p&gt;
&lt;h2&gt;How does OPA, Rego, and conftest related to each other?&lt;/h2&gt;
&lt;p&gt;Rego is a declarative language used to write OPA policies. Then, OPA is the engine that takes in the policies written in Rego and evaluates it, producing a set of documents called &amp;quot;rules&amp;quot;. You can use OPA and the Rego language directly to write policies for your config files but using conftest will make the DX much better. conftest builds on top of OPA and provide some extra functionality that makes using OPA easier.&lt;/p&gt;
&lt;h2&gt;Rego Basics&lt;/h2&gt;
&lt;p&gt;The Rego language focuses on querying the input to look for a given condition. If the input satisfies the query, then it will produce the document. &lt;/p&gt;
&lt;h3&gt;Variable Assignments&lt;/h3&gt;
&lt;p&gt;Variable assignment in Rego works the same like in other language. The expression &lt;code&gt;foo := &amp;quot;hello&amp;quot;&lt;/code&gt; will assign the value &lt;code&gt;&amp;quot;hello&amp;quot;&lt;/code&gt; to the variable &lt;code&gt;foo&lt;/code&gt;. &lt;/p&gt;
&lt;p&gt;One difference in Rego is that it implicitly assigns value &lt;code&gt;true&lt;/code&gt; to the document if the condition given evaluates to &lt;code&gt;true&lt;/code&gt;. In the example below, there&amp;#39;s two ways to write the Rego expression. Rego actually implicitly assigns the value &lt;code&gt;true&lt;/code&gt; so we can also remove &lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-opa&quot;&gt;foo :&lt;span class=&quot;hljs-operator&quot;&gt;=&lt;/span&gt; &amp;quot;hello&amp;quot;

# &lt;span class=&quot;hljs-keyword&quot;&gt;first&lt;/span&gt; way: explicitly assigns `&lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;` &lt;span class=&quot;hljs-keyword&quot;&gt;to&lt;/span&gt; `&lt;span class=&quot;hljs-keyword&quot;&gt;result&lt;/span&gt;` &lt;span class=&quot;hljs-keyword&quot;&gt;when&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;condition&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;is&lt;/span&gt; satisfied
&lt;span class=&quot;hljs-keyword&quot;&gt;result&lt;/span&gt; :&lt;span class=&quot;hljs-operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt; if foo &lt;span class=&quot;hljs-operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;hljs-operator&quot;&gt;=&lt;/span&gt; &amp;quot;hello&amp;quot;

# Rego implicitly assigns `&lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;` &lt;span class=&quot;hljs-keyword&quot;&gt;to&lt;/span&gt; `&lt;span class=&quot;hljs-keyword&quot;&gt;result&lt;/span&gt;` &lt;span class=&quot;hljs-keyword&quot;&gt;when&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;condition&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;is&lt;/span&gt; satisfied
&lt;span class=&quot;hljs-keyword&quot;&gt;result&lt;/span&gt; if foo &lt;span class=&quot;hljs-operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;hljs-operator&quot;&gt;=&lt;/span&gt; &amp;quot;hello&amp;quot;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Let&amp;#39;s bring this up to next level. Most of the time, the condition you&amp;#39;re checking is not as straight forward as checking the value against a static value. You might also need to evaluate expressions in between and save the intermediary values in a variable to help improve readability. In previous example we only used a one-liner for the rule body but you can also have more complex rule body like the following using curly braces.&lt;/p&gt;
&lt;h3&gt;Declarative Rego Language&lt;/h3&gt;
&lt;p&gt;The Rego language is declarative and useful to query data structures for any value. Consider the following example (&lt;a href=&quot;https://play.openpolicyagent.org/p/gZcwi6zzAA&quot;&gt;Rego playground link&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;Let&amp;#39;s assume our input is an array of object, each containing the keys &amp;quot;id&amp;quot; and &amp;quot;name&amp;quot;. In this policy we&amp;#39;re checking that the objects doesn&amp;#39;t have any forbiden value for &amp;quot;name&amp;quot;.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-opa&quot;&gt;forbidden_names := &lt;span class=&quot;hljs-selector-attr&quot;&gt;[&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;foobar&amp;quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;john&amp;quot;&lt;/span&gt;]&lt;/span&gt;

user_forbidden &lt;span class=&quot;hljs-keyword&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;hljs-selector-tag&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;hljs-selector-class&quot;&gt;.users&lt;/span&gt;&lt;span class=&quot;hljs-selector-attr&quot;&gt;[i]&lt;/span&gt;&lt;span class=&quot;hljs-selector-class&quot;&gt;.name&lt;/span&gt; == forbidden_names&lt;span class=&quot;hljs-selector-attr&quot;&gt;[j]&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;This code would look something like this in Python:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-python&quot;&gt;users = [{&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;name&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;foo&amp;quot;&lt;/span&gt;}, {&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;name&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;bar&amp;quot;&lt;/span&gt;}]
forbidden_names = [&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;foobar&amp;quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;john&amp;quot;&lt;/span&gt;]

user_forbidden = []
&lt;span class=&quot;hljs-keyword&quot;&gt;for&lt;/span&gt; i &lt;span class=&quot;hljs-keyword&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;range&lt;/span&gt;(&lt;span class=&quot;hljs-built_in&quot;&gt;len&lt;/span&gt;(&lt;span class=&quot;hljs-built_in&quot;&gt;input&lt;/span&gt;.users):
  &lt;span class=&quot;hljs-keyword&quot;&gt;for&lt;/span&gt; j &lt;span class=&quot;hljs-keyword&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;range&lt;/span&gt;(&lt;span class=&quot;hljs-built_in&quot;&gt;len&lt;/span&gt;(forbidden_names)):
    user_forbidden.push(&lt;span class=&quot;hljs-built_in&quot;&gt;input&lt;/span&gt;.users[i].name == forbidden_names[j])

&lt;span class=&quot;hljs-keyword&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;any&lt;/span&gt;(user_forbidden)&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;For both codes, &lt;code&gt;user_forbidden&lt;/code&gt; will evaluate to &lt;code&gt;true&lt;/code&gt; if one of the user name is included in the &lt;code&gt;forbidden_names&lt;/code&gt; list. In the Python code, we used &lt;code&gt;for&lt;/code&gt; loops with the &lt;code&gt;any()&lt;/code&gt; function to check that none of the value is true. In the Rego code, we don&amp;#39;t have to use any for loop or iterate through the user list. &lt;code&gt;forbidden_names[i]&lt;/code&gt; means &amp;quot;for any of the values in &lt;code&gt;forbidden_names&lt;/code&gt;. So in our Rego code, we essentially tells OPA, if any of the value in &lt;code&gt;input.users&lt;/code&gt; is the same as any of the value in &lt;code&gt;forbidden_name&lt;/code&gt;, then return set the value of &lt;code&gt;user_forbidden&lt;/code&gt; to &lt;code&gt;true&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;In this case, since we are not using the index &lt;code&gt;i&lt;/code&gt; and &lt;code&gt;j&lt;/code&gt; to reference the value at those index anywhere in the policy, we can simplify it more by using &lt;code&gt;_&lt;/code&gt; (underscore) instead for the index. &lt;code&gt;_&lt;/code&gt; is like a throwaway value and we don&amp;#39;t care about the index, we just care if one of the values is the same in &lt;code&gt;user.input&lt;/code&gt; and &lt;code&gt;forbidden_names&lt;/code&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-opa&quot;&gt;user_forbidden &lt;span class=&quot;hljs-keyword&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;hljs-selector-tag&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;hljs-selector-class&quot;&gt;.users&lt;/span&gt;&lt;span class=&quot;hljs-selector-attr&quot;&gt;[_]&lt;/span&gt;&lt;span class=&quot;hljs-selector-class&quot;&gt;.name&lt;/span&gt; == forbidden_names&lt;span class=&quot;hljs-selector-attr&quot;&gt;[_]&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;h3&gt;More complex policies&lt;/h3&gt;
&lt;p&gt;Before this our policies are all simple one liner but Rego also supports writing the rule body in multiple lines. In the example below, we are adding an exception to the rule that the previous rule doesn&amp;#39;t apply to user with &lt;code&gt;id == 5&lt;/code&gt;. So if one our user &lt;code&gt;name&lt;/code&gt; value is &lt;code&gt;john&lt;/code&gt; but have &lt;code&gt;id == 5&lt;/code&gt; then &lt;code&gt;user_forbidden&lt;/code&gt; won&amp;#39;t evaluate to &lt;code&gt;true&lt;/code&gt;. Note that we are using the same index &lt;code&gt;i&lt;/code&gt; when accessing the &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;id&lt;/code&gt; property. This means we are referring to the same user. If we use &lt;code&gt;_&lt;/code&gt; or a different index when accessing the &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;id&lt;/code&gt;, the rule will evaluate to &lt;code&gt;true&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;If any of the expressions inside the rule body evaluates to &lt;code&gt;false&lt;/code&gt; or &lt;code&gt;undefined&lt;/code&gt; then it will stop evaluating the rule body and return &lt;code&gt;undefined&lt;/code&gt; for &lt;code&gt;user_forbidden&lt;/code&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-opa&quot;&gt;forbidden_names := &lt;span class=&quot;hljs-selector-attr&quot;&gt;[&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;foobar&amp;quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;john&amp;quot;&lt;/span&gt;]&lt;/span&gt;

user_forbidden &lt;span class=&quot;hljs-keyword&quot;&gt;if&lt;/span&gt; {
    &lt;span class=&quot;hljs-selector-tag&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;hljs-selector-class&quot;&gt;.users&lt;/span&gt;&lt;span class=&quot;hljs-selector-attr&quot;&gt;[i]&lt;/span&gt;&lt;span class=&quot;hljs-selector-class&quot;&gt;.name&lt;/span&gt; == forbidden_names&lt;span class=&quot;hljs-selector-attr&quot;&gt;[_]&lt;/span&gt;
    &lt;span class=&quot;hljs-selector-tag&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;hljs-selector-class&quot;&gt;.users&lt;/span&gt;&lt;span class=&quot;hljs-selector-attr&quot;&gt;[i]&lt;/span&gt;&lt;span class=&quot;hljs-selector-class&quot;&gt;.id&lt;/span&gt; != &lt;span class=&quot;hljs-number&quot;&gt;5&lt;/span&gt;

    false
    &lt;span class=&quot;hljs-built_in&quot;&gt;print&lt;/span&gt;(&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;this will not be printed&amp;quot;&lt;/span&gt;)
}&lt;/code&gt;&lt;/pre&gt;&lt;h2&gt;Using conftest&lt;/h2&gt;
&lt;p&gt;Previously, we used arbitrary names for our rules but conftest introduces a few keywords that we must use so that it can detect any failed rules and includes it in the output. Conftest will pick up any rules with name &lt;code&gt;deny&lt;/code&gt;, &lt;code&gt;warn&lt;/code&gt;, or &lt;code&gt;violation&lt;/code&gt; and the summary will be shown in conftest output.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;➜ tree conftest
conftest
├── &lt;span class=&quot;hljs-keyword&quot;&gt;input&lt;/span&gt;.json
└── &lt;span class=&quot;hljs-keyword&quot;&gt;policy&lt;/span&gt;
    └── names.rego&lt;/code&gt;&lt;/pre&gt;&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;# &lt;span class=&quot;hljs-keyword&quot;&gt;input&lt;/span&gt;.json
{
    &amp;quot;users&amp;quot;: [
        {
            &amp;quot;id&amp;quot;: &lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;,
            &amp;quot;name&amp;quot;: &amp;quot;john&amp;quot;
        },
        {
            &amp;quot;id&amp;quot;: &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;,
            &amp;quot;name&amp;quot;: &amp;quot;bar&amp;quot;
        },
        {
            &amp;quot;id&amp;quot;: &lt;span class=&quot;hljs-number&quot;&gt;3&lt;/span&gt;,
            &amp;quot;name&amp;quot;: &amp;quot;foobar&amp;quot;
        }
    ]
}&lt;/code&gt;&lt;/pre&gt;&lt;pre&gt;&lt;code class=&quot;hljs language-opa&quot;&gt;# &lt;span class=&quot;hljs-keyword&quot;&gt;policy&lt;/span&gt;/names.rego
package main

&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; future.keywords.contains
&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; future.keywords.&lt;span class=&quot;hljs-keyword&quot;&gt;if&lt;/span&gt;

deny contains msg &lt;span class=&quot;hljs-keyword&quot;&gt;if&lt;/span&gt; {
    forbidden_names := [&amp;quot;john&amp;quot;]
    &lt;span class=&quot;hljs-type&quot;&gt;name&lt;/span&gt; := &lt;span class=&quot;hljs-keyword&quot;&gt;input&lt;/span&gt;.users[_].name
    &lt;span class=&quot;hljs-type&quot;&gt;name&lt;/span&gt; == forbidden_names[_]
    
    msg := sprintf(&amp;quot;username %v is not allowed&amp;quot;, [&lt;span class=&quot;hljs-type&quot;&gt;name&lt;/span&gt;])
}

warn contains msg &lt;span class=&quot;hljs-keyword&quot;&gt;if&lt;/span&gt; {    
    id := &lt;span class=&quot;hljs-keyword&quot;&gt;input&lt;/span&gt;.users[_].id
    id == &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;
    
    msg := sprintf(&amp;quot;id %v is not allowed&amp;quot;, [id])
}&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Run conftest against our input file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;➜ conftest test &lt;span class=&quot;hljs-keyword&quot;&gt;input&lt;/span&gt;.json &lt;span class=&quot;hljs-comment&quot;&gt;--policy policy/ &lt;/span&gt;
WARN - &lt;span class=&quot;hljs-keyword&quot;&gt;input&lt;/span&gt;.json - main - id &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;is&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;not&lt;/span&gt; allowed
FAIL - &lt;span class=&quot;hljs-keyword&quot;&gt;input&lt;/span&gt;.json - main - username john &lt;span class=&quot;hljs-keyword&quot;&gt;is&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;not&lt;/span&gt; allowed

&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt; tests, &lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt; passed, &lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt; warnings, &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt; failures, &lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt; exceptions&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Note the values output here, the &lt;code&gt;deny&lt;/code&gt; rule will be output as &lt;code&gt;FAIL&lt;/code&gt; if the rule passes while the &lt;code&gt;warn&lt;/code&gt; rule is counted as &lt;code&gt;WARN&lt;/code&gt;. Here, conftest takes the output values from the OPA engine and formats the output for us to make it easier to interpret or integrate with other tools. You can also change the output format of conftest by passing in the &lt;code&gt;--output&lt;/code&gt; flag. I like the &lt;code&gt;github&lt;/code&gt; output since it will automatically prints the output in a format that Github Actions understoods and will surface error in Github UI approriately. You can also output it as JSON, which is great if you want to process the result output using tools like &lt;code&gt;jq&lt;/code&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;➜ conftest test --help
[...]
  -o, --&lt;span class=&quot;hljs-keyword&quot;&gt;output&lt;/span&gt; string         &lt;span class=&quot;hljs-keyword&quot;&gt;Output&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;format&lt;/span&gt; for conftest results - valid &lt;span class=&quot;hljs-keyword&quot;&gt;options&lt;/span&gt; are: [stdout json tap &lt;span class=&quot;hljs-keyword&quot;&gt;table&lt;/span&gt; junit github] (default &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;stdout&amp;quot;&lt;/span&gt;)&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;JSON output:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;➜ conftest test &lt;span class=&quot;hljs-keyword&quot;&gt;input&lt;/span&gt;.json &lt;span class=&quot;hljs-comment&quot;&gt;--output json&lt;/span&gt;
[
  {
    &amp;quot;filename&amp;quot;: &amp;quot;input.json&amp;quot;,
    &amp;quot;namespace&amp;quot;: &amp;quot;main&amp;quot;,
    &amp;quot;successes&amp;quot;: &lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt;,
    &amp;quot;warnings&amp;quot;: [
      {
        &amp;quot;msg&amp;quot;: &amp;quot;id 2 is not allowed&amp;quot;
      }
    ],
    &amp;quot;failures&amp;quot;: [
      {
        &amp;quot;msg&amp;quot;: &amp;quot;username john is not allowed&amp;quot;
      }
    ]
  }
]&lt;/code&gt;&lt;/pre&gt;&lt;h3&gt;parsers: using other format as input files&lt;/h3&gt;
&lt;p&gt;Until now all our input has been in JSON format but conftest also has built-in parsers that can automatically detect the input format and converts it to JSON for us. As of this moment, here is the list of valid parsers: [cue dockerfile edn hcl1 hcl2 hocon ignore ini json jsonnet properties spdx toml vcl xml yaml dotenv].&lt;/p&gt;
&lt;p&gt;Example is for HCL2 code used for Terraform:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-hcl2&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;# input.tf&lt;/span&gt;
&lt;span class=&quot;hljs-attribute&quot;&gt;resource&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;aws_imaginary_resource&amp;quot;&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;this&amp;quot;&lt;/span&gt; {
  &lt;span class=&quot;hljs-attribute&quot;&gt;name&lt;/span&gt; = &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;this&amp;quot;&lt;/span&gt;
  instance_type = &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;r5.4xlarge&amp;quot;&lt;/span&gt;
  security_groups = [&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;12345&amp;quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;45678&amp;quot;&lt;/span&gt;]
}

resource &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;aws_imaginary_resource&amp;quot;&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;that&amp;quot;&lt;/span&gt; {
  &lt;span class=&quot;hljs-attribute&quot;&gt;name&lt;/span&gt; = &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;that&amp;quot;&lt;/span&gt;
  instance_type = &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;t3.medium&amp;quot;&lt;/span&gt;
  
  ingress {
    &lt;span class=&quot;hljs-attribute&quot;&gt;port&lt;/span&gt; = &lt;span class=&quot;hljs-number&quot;&gt;1234&lt;/span&gt;
    cidr = [&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;0.0.0.0/0&amp;quot;&lt;/span&gt;]
  }
}&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;We can use &lt;code&gt;conftest parse&lt;/code&gt; to see how conftest will parse the Terraform file and then write our policy based on the parsed input.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-&quot;&gt;➜ conftest parse input.tf
{
  &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;resource&amp;quot;&lt;/span&gt;: {
    &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;aws_imaginary_resource&amp;quot;&lt;/span&gt;: {
      &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;that&amp;quot;&lt;/span&gt;: {
        &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;ingress&amp;quot;&lt;/span&gt;: {
          &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;cidr&amp;quot;&lt;/span&gt;: [
            &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;0.0.0.0/0&amp;quot;&lt;/span&gt;
          ],
          &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;port&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-number&quot;&gt;1234&lt;/span&gt;
        },
        &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;instance_type&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;t3.medium&amp;quot;&lt;/span&gt;,
        &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;name&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;that&amp;quot;&lt;/span&gt;
      },
      &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;this&amp;quot;&lt;/span&gt;: {
        &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;instance_type&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;r5.4xlarge&amp;quot;&lt;/span&gt;,
        &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;name&amp;quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;this&amp;quot;&lt;/span&gt;,
        &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;security_groups&amp;quot;&lt;/span&gt;: [
          &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;12345&amp;quot;&lt;/span&gt;,
          &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;45678&amp;quot;&lt;/span&gt;
        ]
      }
    }
  }
}&lt;/code&gt;&lt;/pre&gt;&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;I just got started with OPA but considering the flexibility of it when used with conftest, I feel like you can use this for a lot of use cases. The ability to separate the policy logic from your application is powerful and the declarative nature of the Rego language also helps simplify the policy a lot as demonstrated in my comparison of the Python code above.&lt;/p&gt;
</description>
    </item>
    <item>
      <title>OpenTelemetry Basics</title>
      <link>https://pokgak.xyz/articles/otel-basics/</link>
      <guid>https://pokgak.xyz/articles/otel-basics/</guid>
      <pubDate>Sat, 13 Aug 2022 09:43:00 GMT</pubDate>
      <description>&lt;p&gt;I got to work on integrating &lt;a href=&quot;https://opentelemetry.io/&quot;&gt;OpenTelemetry&lt;/a&gt; in an application that our team maintains recently so I&amp;#39;m starting a series documenting my learnings throughout this journey.&lt;/p&gt;
&lt;p&gt;A little background info on the application I&amp;#39;m working on: it&amp;#39;s a Slack chatbot written in Typescript using BoltJS. Our goal is to know how many users are using our Slack bot with a breakdown of the percentage of successful and error interactions. When an error happened, we also want to know what exactly the user did and the current state of the application that caused it to error. Based on my reading, the last sentence is exactly what observability promises, So that&amp;#39;s why we&amp;#39;re giving it a try.&lt;/p&gt;
&lt;p&gt;OpenTelemetry can be divided into three categories: tracing, logging, and metrics; but I&amp;#39;ll be focusing on tracing in this series.&lt;/p&gt;
&lt;h2&gt;Tracing Primers&lt;/h2&gt;
&lt;p&gt;To get started you should know some basic concepts about tracing.&lt;/p&gt;
&lt;h3&gt;Traces, Spans&lt;/h3&gt;
&lt;p&gt;A trace consists of multiple spans and a span is a unit of work with a start and end time. In a span, you can create events that marks when something happened in the lifetime of the span. &lt;/p&gt;
&lt;p&gt;A span can also have nested spans and these are called child spans. The parent span is usually representing some abstract unit of work, like the lifetime of a HTTP request when it from when it hits the application
until the response is sent. Child spans can be used to get more details into the operations done during the lifetime of that parent span ie. API call to another service to fetch more informations. &lt;/p&gt;
&lt;h3&gt;Span attributes, Status, Errors&lt;/h3&gt;
&lt;p&gt;To add context to the spans, you can set custom attributes. Ideally, you want to send all the information that will help when debugging your application in the future so that later you don&amp;#39;t have to modify the code and add more attribute when you noticed an issue and realized that you don&amp;#39;t have enough information to debug the issue.&lt;/p&gt;
&lt;p&gt;If your application encounters an error, you can set the span status to ERROR and also add the stack trace to the context for use in debugging. By default your span status will be set to OK.&lt;/p&gt;
&lt;h3&gt;Span Exporter&lt;/h3&gt;
&lt;p&gt;After the span ends, you&amp;#39;ll want to send it to a backend service that will store and process it so that you can use it later. The sending is done by &lt;a href=&quot;https://opentelemetry.io/docs/instrumentation/js/exporters/&quot;&gt;OTel Exporters&lt;/a&gt;. There are multiple backend available that accepts OTel traces as inputs but such as Jaeger, Zipkin but for my testing I&amp;#39;m using Honeycomb with the OLTP Collector. &lt;/p&gt;
&lt;h3&gt;Debugging&lt;/h3&gt;
&lt;p&gt;For debugging, there&amp;#39;s also the &lt;code&gt;ConsoleSpanExporter&lt;/code&gt; which will print out your spans in the console instead of sending it anywhere. I find this very useful to get fast response on what is being sent over but it&amp;#39;s hard to do analysis with it so in production environment you should configure the exporter to use other backends instead.&lt;/p&gt;
&lt;h2&gt;Automatic vs Manual Instrumentation&lt;/h2&gt;
&lt;p&gt;Now we got the basics out of the way, let&amp;#39;s look at how you can start adding spans to your application to build traces. &lt;/p&gt;
&lt;p&gt;The easiest way to get started is to use auto instrumentation which will automatically injects code in the HTTP,
requests, DNS, libraries that you&amp;#39;re using to create spans and events. In nodejs, this can be done by installing the &lt;a href=&quot;https://www.npmjs.com/package/@opentelemetry/auto-instrumentations-node&quot;&gt;&lt;code&gt;auto-instrumentations-node&lt;/code&gt;&lt;/a&gt; NPM package. This package pulls in several other packages to automatically instrument your application.&lt;/p&gt;
&lt;p&gt;This is a nice onboarding experience but I get overwhelmed by the amount of data sent when by these auto instrumentation package. Therefore, I recommend to you to start with manual instrumentation instead.&lt;/p&gt;
&lt;p&gt;With manual instrumentation, you&amp;#39;re forced to be intentional with the data that you&amp;#39;re sending to the backend. With this I get to decide which information I want to send over and already have in mind what I want to do with it and which information I would like to gain from it.&lt;/p&gt;
&lt;h3&gt;Initialization&lt;/h3&gt;
&lt;p&gt;Whatever approach you end up with for the instrumentation, you&amp;#39;ll want to make sure that you&amp;#39;re initializing the OTel libraries at the start of your application. This is required because if you starts it later, your application might already be handling request when your OTel libraries are not initialized yet, causing it to miss some requests, or worse encounter errors.&lt;/p&gt;
&lt;p&gt;The recommended way to do it is to use the &lt;code&gt;-r&lt;/code&gt; flag from the &lt;code&gt;node&lt;/code&gt; command:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;-r, --require module
            Preload the specified module at startup.  Follows &lt;code&gt;require()&lt;/code&gt;&amp;#39;s module resolution rules.  module may be either a path to a file, or a Node.js module name.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So in your &lt;code&gt;package.json&lt;/code&gt; you&amp;#39;ll have to add that to your &lt;code&gt;start&lt;/code&gt; command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-json&quot;&gt;scripts&lt;span class=&quot;hljs-punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;hljs-punctuation&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;&amp;quot;start&amp;quot;&lt;/span&gt;&lt;span class=&quot;hljs-punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;node -r ./tracing.js app.js&amp;quot;&lt;/span&gt;&lt;span class=&quot;hljs-punctuation&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;hljs-punctuation&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;If you&amp;#39;re using Typescript like me, you&amp;#39;ll want to use the &lt;code&gt;NODE_OPTIONS&lt;/code&gt; shell variable to specify the flag instead:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-json&quot;&gt;scripts&lt;span class=&quot;hljs-punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;hljs-punctuation&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;&amp;quot;start&amp;quot;&lt;/span&gt;&lt;span class=&quot;hljs-punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;NODE_OPTIONS=&amp;#x27;-r ./tracing.js&amp;#x27; ts-node app.ts&amp;quot;&lt;/span&gt;&lt;span class=&quot;hljs-punctuation&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;hljs-punctuation&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;h3&gt;NodeSDK vs NodeTracerProvider Confusion&lt;/h3&gt;
&lt;p&gt;One thing that made me confused is how different the code for initializing auto instrumentation compared to manual instrumentation.&lt;/p&gt;
&lt;p&gt;This is the code provided by Honeycomb to use auto instrumentation. The key there is the &lt;code&gt;getNodeAutoInstrumentation()&lt;/code&gt; function which will register all the supported auto instrumentation libraries. One more thing is that it is using the &lt;code&gt;NodeSDK&lt;/code&gt; class.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-typescript&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;// tracing.js&lt;/span&gt;
(&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;use strict&amp;quot;&lt;/span&gt;);

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; { &lt;span class=&quot;hljs-title class_&quot;&gt;NodeSDK&lt;/span&gt; } = &lt;span class=&quot;hljs-built_in&quot;&gt;require&lt;/span&gt;(&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;@opentelemetry/sdk-node&amp;quot;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; { getNodeAutoInstrumentations } = &lt;span class=&quot;hljs-built_in&quot;&gt;require&lt;/span&gt;(&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;@opentelemetry/auto-instrumentations-node&amp;quot;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; { &lt;span class=&quot;hljs-title class_&quot;&gt;OTLPTraceExporter&lt;/span&gt; } = &lt;span class=&quot;hljs-built_in&quot;&gt;require&lt;/span&gt;(&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;@opentelemetry/exporter-trace-otlp-proto&amp;quot;&lt;/span&gt;);

&lt;span class=&quot;hljs-comment&quot;&gt;// The Trace Exporter exports the data to Honeycomb and uses&lt;/span&gt;
&lt;span class=&quot;hljs-comment&quot;&gt;// the environment variables for endpoint, service name, and API Key.&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; traceExporter = &lt;span class=&quot;hljs-keyword&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;hljs-title class_&quot;&gt;OTLPTraceExporter&lt;/span&gt;();

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; sdk = &lt;span class=&quot;hljs-keyword&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;hljs-title class_&quot;&gt;NodeSDK&lt;/span&gt;({
    traceExporter,
    &lt;span class=&quot;hljs-attr&quot;&gt;instrumentations&lt;/span&gt;: [&lt;span class=&quot;hljs-title function_&quot;&gt;getNodeAutoInstrumentations&lt;/span&gt;()]
});

sdk.&lt;span class=&quot;hljs-title function_&quot;&gt;start&lt;/span&gt;()&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;On the other hand, this is the code example from opentelemetry.io to start manual instrumentation. Notice that it&amp;#39;s not using the &lt;code&gt;NodeSDK&lt;/code&gt; class anymore and you need to create the Resource and &lt;code&gt;NodeTracerProvider&lt;/code&gt; objects and configure it yourself.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-typescript&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; opentelemetry = &lt;span class=&quot;hljs-built_in&quot;&gt;require&lt;/span&gt;(&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;@opentelemetry/api&amp;quot;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; { &lt;span class=&quot;hljs-title class_&quot;&gt;Resource&lt;/span&gt; } = &lt;span class=&quot;hljs-built_in&quot;&gt;require&lt;/span&gt;(&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;@opentelemetry/resources&amp;quot;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; { &lt;span class=&quot;hljs-title class_&quot;&gt;SemanticResourceAttributes&lt;/span&gt; } = &lt;span class=&quot;hljs-built_in&quot;&gt;require&lt;/span&gt;(&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;@opentelemetry/semantic-conventions&amp;quot;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; { &lt;span class=&quot;hljs-title class_&quot;&gt;NodeTracerProvider&lt;/span&gt; } = &lt;span class=&quot;hljs-built_in&quot;&gt;require&lt;/span&gt;(&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;@opentelemetry/sdk-trace-node&amp;quot;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; { registerInstrumentations } = &lt;span class=&quot;hljs-built_in&quot;&gt;require&lt;/span&gt;(&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;@opentelemetry/instrumentation&amp;quot;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; { &lt;span class=&quot;hljs-title class_&quot;&gt;ConsoleSpanExporter&lt;/span&gt;, &lt;span class=&quot;hljs-title class_&quot;&gt;BatchSpanProcessor&lt;/span&gt; } = &lt;span class=&quot;hljs-built_in&quot;&gt;require&lt;/span&gt;(&lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;@opentelemetry/sdk-trace-base&amp;quot;&lt;/span&gt;);

&lt;span class=&quot;hljs-comment&quot;&gt;// Optionally register automatic instrumentation libraries&lt;/span&gt;
&lt;span class=&quot;hljs-title function_&quot;&gt;registerInstrumentations&lt;/span&gt;({
  &lt;span class=&quot;hljs-attr&quot;&gt;instrumentations&lt;/span&gt;: [],
});

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; resource =
  &lt;span class=&quot;hljs-title class_&quot;&gt;Resource&lt;/span&gt;.&lt;span class=&quot;hljs-title function_&quot;&gt;default&lt;/span&gt;().&lt;span class=&quot;hljs-title function_&quot;&gt;merge&lt;/span&gt;(
    &lt;span class=&quot;hljs-keyword&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;hljs-title class_&quot;&gt;Resource&lt;/span&gt;({
      [&lt;span class=&quot;hljs-title class_&quot;&gt;SemanticResourceAttributes&lt;/span&gt;.&lt;span class=&quot;hljs-property&quot;&gt;SERVICE_NAME&lt;/span&gt;]: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;service-name-here&amp;quot;&lt;/span&gt;,
      [&lt;span class=&quot;hljs-title class_&quot;&gt;SemanticResourceAttributes&lt;/span&gt;.&lt;span class=&quot;hljs-property&quot;&gt;SERVICE_VERSION&lt;/span&gt;]: &lt;span class=&quot;hljs-string&quot;&gt;&amp;quot;0.1.0&amp;quot;&lt;/span&gt;,
    })
  );

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; provider = &lt;span class=&quot;hljs-keyword&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;hljs-title class_&quot;&gt;NodeTracerProvider&lt;/span&gt;({
    &lt;span class=&quot;hljs-attr&quot;&gt;resource&lt;/span&gt;: resource,
});
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; exporter = &lt;span class=&quot;hljs-keyword&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;hljs-title class_&quot;&gt;ConsoleSpanExporter&lt;/span&gt;();
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; processor = &lt;span class=&quot;hljs-keyword&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;hljs-title class_&quot;&gt;BatchSpanProcessor&lt;/span&gt;(exporter);
provider.&lt;span class=&quot;hljs-title function_&quot;&gt;addSpanProcessor&lt;/span&gt;(processor);

provider.&lt;span class=&quot;hljs-title function_&quot;&gt;register&lt;/span&gt;();&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;TBH I&amp;#39;m still not clear what is the difference betwen using &lt;code&gt;NodeSDK&lt;/code&gt; vs manually configuring the &lt;code&gt;NodeTracerProvider&lt;/code&gt;. When using &lt;code&gt;NodeSDK&lt;/code&gt; does the &lt;code&gt;NodeTracerProvider&lt;/code&gt; got configured automatically? &lt;/p&gt;
&lt;h2&gt;How and when to start tracing?&lt;/h2&gt;
&lt;p&gt;To start manually instrumenting your application, you&amp;#39;ll have to create a root span. A root span is the first span you create once the request enters your application. &lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/request-flow.png&quot; alt=&quot;Request/Response Flow&quot;&gt;&lt;/p&gt;
&lt;p&gt;Now, if you have a normal HTTP request/response-based application, it is easy to figure out where to start and end your root spans. All your incoming requests will most likely be handled by a controller and each endpoint will be handled by a method. In this type of application, your root span can be started once the request hits the application in the method in your controller and ends before you send the response.&lt;/p&gt;
&lt;p&gt;During the lifetime of that request, you can create child spans to track other works done while processing the request. There&amp;#39;s only one entry point for requests and exiting the entry point means the request is finished. If your application encountered errors during the execution, it can set the span status to ERROR and add the stack trace info to the span.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Once you managed to create spans, set attributes, and then export it to a backend. You&amp;#39;re pretty much done with the basics of instrumenting your application. Go ahead and add more traces to your application!&lt;/p&gt;
</description>
    </item>
    <item>
      <title>Instrumenting a Slack bot with OpenTelemetry</title>
      <link>https://pokgak.xyz/articles/otel-slack-integration/</link>
      <guid>https://pokgak.xyz/articles/otel-slack-integration/</guid>
      <pubDate>Sat, 13 Aug 2022 09:43:00 GMT</pubDate>
      <description>&lt;p&gt;&lt;em&gt;Note: I&amp;#39;m using pseudocode in the code example in this article to keep the article brief. Please refer to the official Slack and OpenTelemetry documentation for the actual code.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I&amp;#39;ve talked about the basics of OpenTelemetry in my previous article. In this one, I&amp;#39;ll explain more on how we&amp;#39;re integrating OpenTelemetry with our Slack-based application.&lt;/p&gt;
&lt;p&gt;At the end of this article, this is roughly how the span lifetime and events created will look like:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/slack-span-lifetime.png&quot; alt=&quot;Summary of spans and events created&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Slack BoltJS Socket Mode&lt;/h2&gt;
&lt;p&gt;Compared to the a standard HTTP request/response, we&amp;#39;re using BoltJS with socket mode. This gives us the advantage of not having the application exposed publicly to be able to accept requests from Slack but this also means that we cannot just use the auto-instrumentation for HTTP developed by the community.&lt;/p&gt;
&lt;p&gt;Socket mode uses WebSocket to establish connection to Slack and exchange messages through that connection. There is no official auto-instrumentation support for the &lt;code&gt;ws&lt;/code&gt; library that is used by BoltJS socket mode but I found &lt;a href=&quot;https://www.npmjs.com/package/opentelemetry-instrumentation-ws&quot;&gt;opentelemetry-instrumentation-ws&lt;/a&gt;, a 3rd-party library for &lt;code&gt;ws&lt;/code&gt; library auto-instrumentation. &lt;/p&gt;
&lt;p&gt;Spent a few days integrating it into our application and in the end I concluded that the auto-instrumentation provided by the opentelemetry-instrumentation-ws is too low-level. Our goal is to track user interactions with the application - when they use the bot, which option they choose, what were they trying to do, and whether the interaction ends successfully or with an error. The library, however, created spans when a new connection is established between our application and Slack but no spans or events for user interactions.&lt;/p&gt;
&lt;p&gt;So, the conclusion? We&amp;#39;ll instrument the application manually.&lt;/p&gt;
&lt;h2&gt;Creating and Ending Spans&lt;/h2&gt;
&lt;p&gt;Since this application is used company-wide, it&amp;#39;s highly likely that multiple users will be using it in parallel. To track user interactions independent from each other, we&amp;#39;ll also need separate spans for each user. &lt;/p&gt;
&lt;p&gt;I decided to go with an object &lt;code&gt;spanStore&lt;/code&gt; storing the user spans. Like a singleton pattern, a new span will be created for that user if it doesn&amp;#39;t exist yet in &lt;code&gt;spanStore&lt;/code&gt;, otherwise it will just return the existing user span.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-javascript&quot;&gt;spanStore = {}

&lt;span class=&quot;hljs-keyword&quot;&gt;function&lt;/span&gt; &lt;span class=&quot;hljs-title function_&quot;&gt;getUserSpan&lt;/span&gt;(&lt;span class=&quot;hljs-params&quot;&gt;username&lt;/span&gt;) {
    &lt;span class=&quot;hljs-keyword&quot;&gt;if&lt;/span&gt; (user &lt;span class=&quot;hljs-keyword&quot;&gt;in&lt;/span&gt; spanStore) {
        &lt;span class=&quot;hljs-keyword&quot;&gt;return&lt;/span&gt; spanStore[username]
    }

    span = &lt;span class=&quot;hljs-title function_&quot;&gt;startSpan&lt;/span&gt;(username)
    spanStore[username] = span

    &lt;span class=&quot;hljs-keyword&quot;&gt;return&lt;/span&gt; span
}&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Now that we have a function to create the span, when in the lifetime of the incoming event do we create the span? Ideally, as early as possible before anything else so that we can track everything. BoltJS supports setting a &lt;a href=&quot;https://slack.dev/bolt-js/concepts#global-middleware&quot;&gt;global middleware&lt;/a&gt; that will be called before the event handler function are called. This is where I call the &lt;code&gt;getUserSpan()&lt;/code&gt; function above. For the first event for that user, it will create a new span and for the next events it will just return the existing spans that I can use.&lt;/p&gt;
&lt;p&gt;Next, when do you end the span? Due to how the application works, we&amp;#39;re assuming that each user can only have one session at one time and at the end there will always be an finishing event triggered when the user finished their interaction with the application. Based on that fact, I wrote an event listener that will respond to this finishing event by calling the OTel function to end the span and remove the span from the &lt;code&gt;spanStore&lt;/code&gt; object above.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-javascript&quot;&gt;app.&lt;span class=&quot;hljs-title function_&quot;&gt;event&lt;/span&gt;({&lt;span class=&quot;hljs-attr&quot;&gt;id&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;finishing_event&amp;#x27;&lt;/span&gt;}, &lt;span class=&quot;hljs-title function_&quot;&gt;async&lt;/span&gt; ({username}) =&amp;gt; {
    span = &lt;span class=&quot;hljs-title function_&quot;&gt;getUserSpan&lt;/span&gt;(username)
    span.&lt;span class=&quot;hljs-title function_&quot;&gt;end&lt;/span&gt;()
    &lt;span class=&quot;hljs-title function_&quot;&gt;removeSpanFromSpanStore&lt;/span&gt;(span)
})&lt;/code&gt;&lt;/pre&gt;&lt;h2&gt;Tracking User Actions with Span Events&lt;/h2&gt;
&lt;p&gt;With these, we have a separate span for each user for the whole duration of their interaction with the application. With only one span, we don&amp;#39;t have insights yet into what the user are doing, which actions are taken by the user, so we&amp;#39;ll need a way to track user actions.&lt;/p&gt;
&lt;p&gt;With Slack BoltJS, we can trigger a listener function on every user interaction. I wrote a function that will create a new span event using the user input id as the event name. I also passed in the whole payload so that the we can see the payload of user actions later when debugging issues. Add this as another global middleware, now we&amp;#39;re creating a new event for every user actions.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;hljs language-javascript&quot;&gt;app.&lt;span class=&quot;hljs-title function_&quot;&gt;action&lt;/span&gt;(&lt;span class=&quot;hljs-string&quot;&gt;&amp;#x27;callback_id&amp;#x27;&lt;/span&gt;, &lt;span class=&quot;hljs-title function_&quot;&gt;async&lt;/span&gt; ({username, action_id, payload}) =&amp;gt; {
    span = &lt;span class=&quot;hljs-title function_&quot;&gt;getUserSpan&lt;/span&gt;(username)
    span.&lt;span class=&quot;hljs-title function_&quot;&gt;addEvent&lt;/span&gt;(action_id, {payload})
})&lt;/code&gt;&lt;/pre&gt;&lt;h2&gt;Confession&lt;/h2&gt;
&lt;p&gt;I&amp;#39;m actually not convinced that my way of doing this is correct. One of the reason is that since I use one root span for the whole interaction for a user, I&amp;#39;m also tracking the duration taken by the user to do the next action. From our perpective, this made the duration of the span tracked is now kinda useless for us since it also includes factors that are not controllable by us (time taken for users to do the next action).&lt;/p&gt;
&lt;p&gt;Instead of one root span and creating new span events for every user interaction, maybe a new span for each interaction, linked to the previous span would be better since we only track the duration that we are in control of, not how long the user takes to click a button.&lt;/p&gt;
&lt;p&gt;Nevertheless, since I already implemented like this now, let&amp;#39;s see how that will turn out. Like the saying, you either die a hero, or you live long enough to see yourself become the villain.&lt;/p&gt;
</description>
    </item>
  </channel>
</rss>