SUMO Kubernetes Support Guide¶
K8s commands¶
Most of the examples use
sumo-prod
as an example namespace. SUMO dev/stage/prod run in thesumo-dev
/sumo-stage
/sumo-prod
namespaces respectively.
General¶
Most examples are using the kubectl get ...
subcommand. If you’d prefer output that’s more readable, you can substitute the get
subcommand with describe
:
kubectl -n sumo-prod describe pod sumo-prod-web-76b74db69-dvxbh
Listing resources is easier with the
get
subcommand.
To see all SUMO pods currently running:
kubectl -n sumo-prod get pods
To see all pods running and the K8s nodes they are assigned to:
kubectl -n sumo-prod get pods -o wide
To show yaml for a single pod:
kubectl -n sumo-prod get pod sumo-prod-web-76b74db69-dvxbh -o yaml
To show all deployments:
kubectl -n sumo-prod get deployments
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
sumo-prod-celery 3 3 3 3 330d
sumo-prod-cron 0 0 0 0 330d
sumo-prod-web 50 50 50 50 331d
To show yaml for a single deployment:
kubectl -n sumo-prod get deployment sumo-prod-web -o yaml
Run a bash shell on a SUMO pod:
kubectl -n sumo-prod exec -it sumo-prod-web-76b74db69-xbfgj bash
Scaling a deployment:
kubectl -n sumo-prod scale --replicas=60 deployment/sumo-prod-web
Check rolling update status:
kubectl -n sumo-prod rollout status deployment/sumo-prod-web
Working with K8s command output¶
Filtering pods based on a label:
kubectl -n sumo-prod -l type=web get pods
Getting a list of pods:
kubectl -n sumo-prod -l type=web get pods | tail -n +2 | cut -d" " -f 1
Structured output:
See the jsonpath guide here
kubectl -n sumo-prod get pods -o=jsonpath='{.items[0].metadata.name}'
Processing K8s command json output with jq:
jsonpath may be more portable
kubectl -n sumo-prod get pods -o json | jq -r .items[].metadata.name
K8s Services¶
List SUMO services:
kubectl -n sumo-prod get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
sumo-nodeport NodePort 100.71.222.28 <none> 443:30139/TCP 341d
Secrets¶
Secret values are base64 encoded when viewed in K8s output. Once setup as an environment variable or mounted file in a pod, the values are base64 decoded automatically.
Kitsune uses secrets specified as environment variables in a deployment spec:
To list secrets:
kubectl -n sumo-prod get secrets
To view a secret w/ base64-encoded values:
kubectl -n sumo-prod get secret sumo-secrets-prod -o yaml
To view a secret with decoded values (aka “human readable”):
This example uses the ksv utility
kubectl -n sumo-prod get secret sumo-secrets-prod -o yaml | ksv
To encode a secret value:
echo -n "somevalue" | base64
The
-n
flag strips the newline before base64 encoding. Values must be specified without newlines, thebase64
command on Linux can take a-w 0
parameter that outputs without newlines. Thebase64
command in Macos Sierra seems to output encoded values without newlines.
Updating secrets:
kubectl -n sumo-prod apply -f ./some-secret.yaml
Monitoring¶
New Relic¶
Primary region, A + B “rollup view”
sumo-prod-oregon
-
sumo-prod-oregon-a
-
sumo-prod-oregon-b
-
sumo-prod-frankfurt
Papertrail¶
All pod output is logged to Papertrail.
Operations¶
Cronjobs¶
The sumo-prod-cron
deployment is a self-contained Python cron system that runs in only one of the primary clusters.
# Oregon-A
kubectl -n sumo-prod get deployments
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
sumo-prod-celery 3 3 3 3 330d
sumo-prod-cron 1 1 1 1 330d
sumo-prod-web 25 25 25 25 331d
# Oregon-B
kubectl -n sumo-prod get deployments
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
sumo-prod-celery 3 3 3 3 330d
sumo-prod-cron 0 0 0 0 330d
sumo-prod-web 50 50 50 50 331d
Manually adding/removing K8s Oregon-A/B/Frankfurt cluster nodes¶
If you are modifying the Frankfurt cluster, replace instances of
oregon-*
below withfrankfurt
.
login to the AWS console
ensure you are in the
Oregon
regionsearch for and select the
EC2
service in the AWS consoleselect
Auto Scaling Groups
from the navigation on the left side of the pageclick on the
nodes.k8s.us-west-2a.sumo.mozit.cloud
ornodes.k8s.us-west-2b.sumo.mozit.cloud
row to select itfrom the
Actions
menu (close to the top of the page), clickEdit
the
Details
tab for the ASG should appear, set the appropriateMin
,Desired
andMax
values.it’s probably good to set
Min
andDesired
to the same value in case the cluster autoscaler decides to scale down the cluster smaller than theMin
.
click
Save
if you click on
Instances
from the navigation on the left side of the page, you can see the new instances that are starting/stopping.you can see when the nodes join the K8s cluster with the following command:
watch 'kubectl get nodes | tail -n +2 | grep -v master | wc -l'
The number that is displayed should eventually match your ASG
Desired
value. Note this value only includes K8s workers.
Manually Blocking an IP address¶
login to the AWS console
ensure you are in the
Oregon
regionsearch for and select the
VPC
service in the AWS consoleselect
Network ACLs
from the navigation on the left side of the pageselect the row containing the
Oregon-a and b
VPCclick on the
Inbound Rules
tabclick
Edit
click
Add another rule
for
Rule#
, select a value < 100 and > 0for
Type
, selectAll Traffic
for
Source
, enter the IP address in CIDR format. To block a single IP, append/32
to the IP address.example:
196.52.2.54/32
for
Allow / Deny
, selectDENY
click
Save
There are limits that apply to using VPC ACLs documented here.
Manually Initiating Cluster failover¶
Note: Route 53 will provide automated cluster failover, these docs cover things to consider if there is a catastrophic failure in Oregon-A and B and Frankfurt must be promoted to primary rather than a read-only failover.
verify the Frankfurt read replica
eu-central-1
(Frankfurt) has a read-replica of the SUMO production databasethe replica is currently a
db.m4.xlarge
, while the prod DB isdb.m4.4xlarge
this may be ok in maintenance mode, but if you are going to enable write traffic, the instance type must be scaled up.
SRE’s performed a manual instance type change on the Frankfurt read-replica, and it took ~10 minutes to change from a
db.t2.medium
to adb.m4.xlarge
.
although we have alerting in place to notify the SRE team in the event of a replication error, it’s a good idea to check the replication status on the RDS details page for the
sumo
MySQL instance.specifically, check the
DB Instance Status
,Read Replica Source
,Replication State
, andReplication Error
values.
decide if promoting the read-replica to a master is appropriate.
it’s preferrable to have a multi-AZ RDS instance, as we can take snapshots against the failover instance (RDS does this by default in a multi-AZ setup).
if data is written to a promoted instance, and failover back to the us-west-2 clusters is desirable, a full DB backup and restore in us-west-2 is required.
the replica is automatically rebooted before being promoted to a full instance.
ensure image versions are up to date
Most MySQL changes should already be replicated to the read-replica, however, if you’re reading this, chances are things are broken. Ensure that the DB schema is correct for the iamges you’re deploying.
scale cluster and pods
the prod deployments A and B yaml contain the correct number of replicas, but here are some safe values to use in an emergency:
# Oregon A - ALSO runs cron pod kubectl -n sumo-prod scale --replicas=50 deployment/sumo-prod-web kubectl -n sumo-prod scale --replicas=3 deployment/sumo-prod-celery kubectl -n sumo-prod scale --replicas=1 deployment/sumo-prod-cron # Oregon B - Does NOT run cron pod kubectl -n sumo-prod scale --replicas=50 deployment/sumo-prod-web kubectl -n sumo-prod scale --replicas=3 deployment/sumo-prod-celery kubectl -n sumo-prod scale --replicas=0 deployment/sumo-prod-cron
DNS
point the
prod-tp.sumo.mozit.cloud
traffic policy at the Frankfurt ELB