AWS open source news and updates #109
April 22nd, 2022 - Instalment #109
Welcome to edition #109 of the AWS open source newsletter. Big news, I have shaken things up and will be changing the publish date to Friday mornings, starting today with this edition. Over the months I have received some feedback about changing the published date to Fridays, so I am hoping this will give everyone plenty of time to check out the projects, read the posts and provide everyone with something to do over the weekend (if they want!).
It has been a couple of weeks since the last newsletter, thanks to the Easter break. That means that this week there is a lot to unpack. We have over 13 great new open source projects, including “ts2asl” that simplifies developing AWS Step Functions using the TypeScript programming language, “route53-cli” a neat command line tool for interacting with Rout53, “sageinspector” another cli tool, this time to make it easier to interact with your Amazon SageMaker resources, “automating-pii-data-detection” a nice solution to help you automate the detection of pii data, and lots of more great tools, samples and demos.
Featured content in the newsletter this week covers a number of topics, including a number of articles and posts on AWS CDK and Kubernetes. Other topics featured include Istio, Crossplane, Apache Flink, AWS Orbit Workbench, OpenZFS, Open Policy Agent, MySQL, AWS Lambda Powertools, Apache Hudi, Jaeger, MySQL, Karpenter, cdktf, and more.
This weeks videos feature reckoner, Consul Service Mesh and Data Science on AWS taking a look at dbt and Delta Lake. Finally, I have updated the events section to include up coming events, so make sure you check those out (and please let me know what events you might be attending that I should include)
Big news from last week was announcement of updates and improvements to the OpenSearch governance model around maintainers, and the first non AWS maintainer in the OpenSearch project. There were many tweets, too many to list, but here was one of the first I saw.
Whilst I am talking about OpenSearch, the project created a new issue last week asking for feedback around client compatibility as new versions roll out. Check out the issue, [PROPOSAL] Ensure clients are compatible across at least 2 major versions and share your thoughts with the project.
Do you have an interesting open source project you want to share?
As always, if you are working on anything interesting you would like me to include in this weekly round up, please drop me a line at email@example.com.
Celebrating open source contributors
The articles posted in this series are only possible thanks to contributors and project maintainers and so I would like to shout out and thank those folks who really do power open source and enable us all to build on top of what they have created.
So thank you to the following open source heroes: Isan-Rivkin, Emir Özbir, Lorenzo Garuti, Vara Bonthu, Manabu McCloskey, Nima Kaviani, Farooq Ashraf, Jeremy Ber, Olalekan Elesin, Nick Corbett, Ajish Abraham, Paul Hargis, Matt Winkler, Jack G. M. FitzGerald, Kevin Coleman, Apoorva Kulkarni, Mikhail Shapirov, Jerome Van Der Linden, Dariusz Osiennik, Dmitry Kolomiets, Rosh Plaha, Ken Winner, Michael Lin, Saravanan G, Prima Virani, Vu Dao, Viji Sarathy, Michael Hauss, Eric Hsueh, Benjamin Menuet, Anouar Zaaber, Moshir Mikael, and Armando Segnini.
Make sure you find and follow these builders and keep up to date with their open source projects and contributions.
Latest open source projects
ts2asl this project enables developers to define AWS Step Functions using the TypeScript programming language. It allows developers to benefit from a familiar syntax, type safety, and mature ecosystem of tools for linting, editing, and automated testing. Good docs and examples should help you get started quickly with this tool.
route53-cli if you are looking for a cli tool to interact with the Route53 service, then you are in luck. Isan-Rivkin has put together this neat tool with great documentation and plenty of examples to help you stay in the terminal.
Invictus-AWS this tool is a python script that will help automatically enumerate and acquire relevant data from an AWS environment. The tool doesn’t require any installation it can be run as a standalone script with minimal configuration required. The goal for Invictus-AWS is to allow incident responders or other security personnel to quickly get an insight into an AWS environment to answer the following questions: What services are running in an AWS environment, for each of the services what are the configuration details, and what logging is available for each of the services that might be relevant in an incident response scenario.
k8s-aws-terraform-cluster Lorenzo Garuti has created this repository that will help you to deploy in a few minutes a high available Kubernetes cluster on Amazon AWS using mixed on-demand and spot instances. Great and detailed documentation makes this easy to follow, and some examples stacks are provided so you can see how this works (including of course, WordPress)
kubectl-irsa is a tool from Emir Özbir that provides a kubectl plugin to test abilities of IAM policies which is assigned to the serviceAccount roles via AWS IAM Policy simulator service.
sageinspector this is an open source cli tool to inspect SageMaker resources more easily. I think this is a pretty neat tool, so make sure you check this out if you are a user of Amazon SageMaker.
amazon-sagemaker-training-jobs-benchmarks this repository contains examples and related resources for Amazon SageMaker Training jobs over different instance types focusing on the aspects of time to train and cost to train. Amazon SageMaker makes it easy to train machine learning using EC2 instances. There are many instance types to choose from and this choice affects the speed and cost of training. This repository contains example benchmark for various deep learning use cases. You can see results directly in the notebook, reproduce results by re-running the notebooks. And alter the notebooks to create new scenarios to benchmark.
aws-backup-compliance this repo contains code that integrates Backup Audit Manager with Security Hub and AWS CodePipeline. The integration with Security Hub configures AWS Backup Audit Manager framework with 5 default controls (and you can additional controls to the template), which generate and trigger AWS Config rules and the rule evaluations are converted to Security Hub findings. The integration with AWS CodePipeline, enables developers to embed automated backup controls for AWS resources in their development workflows and shift left with backup compliance in AWS.
eks-jumphost this repo contains a Terraform module to create an EC2 instance used as a jump host to interact with a private EKS cluster. Its usage is meant for development environments, not production: in the latter case provisioning should be done via a continuous integration and deployment platform.
Demos and Samples
devsecops-quickstart this repo will help development teams to quickly set up a ready to use environment integrated with a multi-account CI/CD pipeline following security and DevOps best practices, and the use of a number of open source tools such as Bandit, Snyk, cfn-nag, and enables you define and enforce policies using Open Policy Agent (OPA).
automating-pii-data-detection-and-data-masking-tasks-with-aws-glue-databrew-and-aws-step-functions this repository provides an AWS CloudFormation template that deploys a sample solution demonstrating how to leverage AWS Glue DataBrew to automatically detect PII data, and mask the respective PII data with its native transformation functions. In the post, Build a data pipeline to automatically discover and mask PII data with AWS Glue DataBrew, Samson Lee walks you through this project in detail.
amazon-sagemaker-fine-tune-and-deploy-wav2vec2-huggingface this repo contains code that will help you fine-tune and deploy Wav2Vec2 model for speech recognition with HuggingFace and SageMaker. There is a helpful blog post, Fine-tune and deploy a Wav2Vec2 model for speech recognition with Hugging Face and Amazon SageMaker, that explains how to use SageMaker to easily fine-tune the latest Wav2Vec2 model from Hugging Face, and then deploy the model with a custom-defined inference process to a SageMaker managed inference endpoint
aws-lambda-domain-model-sample this project contains a Lambda function with domain model objects. By using Hexagonal Architecture (Ports and Adapters pattern), it separates domain model from other layer code. The Hexagonal Architecture, or ports and adapters architecture, is an architectural pattern used in software design.
This week we had several posts related to Kubernetes, so they get their own section.
Last week I was excited to read about the introduction of a new open-source project called EKS Blueprints that makes it easier and faster for you adopt Amazon Elastic Kubernetes Service (Amazon EKS). EKS Blueprints is a collection of Infrastructure as Code (IaC) modules that will help you configure and deploy consistent, batteries-included EKS clusters across accounts and regions. Kevin Coleman, Apoorva Kulkarni, and Mikhail Shapirov have collaborated on a blog post, Bootstrapping clusters with EKS Blueprints that dives into the details. This is an exciting project that will help make it even easier to use your favourite open source technologies on Amazon EKS.
Crossplane is an open source Kubernetes add-on that enables platform teams to assemble infrastructure from multiple vendors, and expose higher level self-service APIs for application teams to consume, without having to write any code. (check out the GitHub repo for more info). In the post Introducing AWS Blueprints for Crossplane, Vara Bonthu, Manabu McCloskey, and Nima Kaviani share how we have open sourced AWS Blueprints for Crossplane. Crossplane offers a higher abstraction layer called Compositions, and these allow users to build opinionated templates for deploying cloud resources. This new project aims to simplify and accelerate your journey to managing AWS resources with Crossplane example Compositions. [hands on]
When building SaaS solutions in the Cloud, many builders leverage Istio, an open-source service mesh, for deploying their multi-tenant applications. It provides features such as traffic management, security, and observability at the Kubernetes pod level. In the post, SaaS Identity and Routing with Istio Service Mesh and Amazon EKS, Farooq Ashraf explains how to develop an architecture based on Amazon EKS that demonstrates a siloed SaaS deployment model, using Istio Service Mesh to manage request authentication and per-tenant routing. [hands on]
AWS Community Builder (and prolific blogger) Vu Dao has created a two series blog post looking at Karpenter. Karpenter is an open-source node provisioning project built for Kubernetes. Its goal is to improve the efficiency and cost of running workloads on Kubernetes clusters. Whilst this is currently integrated with AWS, the project has been designed so that other providers could be added. In AWS Karpenter Hands-on Vu provides a good introduction and gets you up and running, and then in Karpenter with AWS Node Termination Handler he explores how you might use this with Spot instances. [hands on]
osquery is an open source SQL powered operating system instrumentation, monitoring, and analytics framework. Fleet is the most widely used open source osquery manager. Combining the two, and showing you how you can deploy them on AWS is Prima Virani in the blog post, Hosting FleetDM on Amazon EKS [hands on]
AWS Distro for OpenTelemetry
AWS Distro for OpenTelemetry (ADOT) offers AWS customers the ability to reduce the installation footprint of observability tools in their environments. Amazon EKS add-ons are a capability within Amazon EKS that were introduced in December 2020 to provide lifecycle management for operational software in your clusters that make it easy for users to operate production-grade clusters in a stable and secure manner. In the post, Metrics and traces collection using Amazon EKS add-ons for AWS Distro for OpenTelemetry Viji Sarathy, Michael Hauss, and Eric Hsueh share an overview of the design of Amazon EKS add-ons for ADOT and how the add-on employs an ADOT Operator to manage the lifecycles of one or more instances of an ADOT Collector in an EKS cluster.
AWS and Community blog posts
dbt has established itself as one of the most popular tools in the modern data stack. The dbt tool makes it easy to develop and implement complex data processing pipelines in SQL, and provides developers with a simple interface to create, test, document, evolve, and deploy their workflows. Benjamin Menuet, Anouar Zaaber, Moshir Mikael, and Armando Segnini have put together Build your data pipeline in your AWS modern data platform using AWS Lake Formation, AWS Glue, and dbt Core share how to deploy a data pipeline in your modern data platform using the dbt-glue adapter built by the AWS Professional Services team in collaboration with dbtlabs.
Kyle Weller from Onehouse highlights the key integrations between Apache Hudi and AWS in his post, Apache Hudi Native AWS Integrations where you can learn how you can build an open Lakehouse on AWS with Apache Hudi.
Amazon RDS MySQL
Automate RDS Slow Query Log Analysis With Slack Integration is an interesting post from the folks at ShellKode, that helps you to automate the slow query log analysis using an open source tool, pt-query-digest, sending the results to the developers on daily basis via email or Slack. I think this is interesting as a few weeks ago I shared an open source project called aws-slack-clickoops-watcher which caught the interest of a lot of readers of this newsletter. [hands on]
Jaeger is an open source distributed tracing platform created by Uber Technologies, that is useful for monitoring microservices-based distributed systems. Dmitry Kolomiets has put together this blog post, Introducing Jaeger Quick Start — Deploying on AWS that explores an alternative tracing backend for your AWS originated traces that you might want to know more about. This post will provide you everything you need to know to get started.
A number of great posts last week on AWS CDK. Starting off with Rosh Plaha’s post, We’ve begun to move towards the AWS CDK and here’s why who provides a nice overview of the key features of CDK and then looks at the trade offs and pros/cons of moving towards AWS CDK.
Following that we have Ken Winner from those nice folks at Stedi who wrote a few weeks ago, Parallel CDK stack deployments with GitHub Actions diving in how to accelerate deployments using AWS CDK, and the journey they took in dramatically reducing their deployment times. Great read, and essential if you are using CDK.
The next post is super interesting, and covers an area that is well underserved from content. The post, Deploy Infrastructure using CDK for Terraform with Go from Michael Lin shows you how you can use cdktf to deploy a Go application in a different Cloud provider. cdktf works in a similar fashion to AWS CDK, except that rather than synthesising to CloudFormation, this generates Terraform code and allows you to leverage the hundreds of providers and thousands of module definitions provided by Terraform and the Terraform ecosystem. This is a great example of the CDK project being used by builders in a much broader context than just AWS. [hands on]
The final post, and still on cdktf, is AWS Community Builder Saravanan G with his post, Create AWS Infrastructure using CDK for Terraform that provides an introduction into using cfktf, using it to deploy a sample Python application. [hands on]
AWS Orbit Workbench
AWS Orbit Workbench is an open source framework for building a data analytics workbench on AWS, which I featured back in #69 of this newsletter. You can build a workbench that gives you access to the right tools for your use cases, either through the out-of-the-box integrations or through the extensible architecture. AWS Hero Olalekan Elesin, Head of Data Platform & Data Architect at HRS Group did a guest blog post on the open source blog, Scheduling Jupyter Notebooks with AWS Orbit Workbench where he shares how this project has become an integral part of their data platform, and how this has enabled a simplified experience from data exploration to productionising data workloads within the business.
Open Policy Agent
Open Policy Agent (OPA) is an open source general-purpose policy engine, licensed under the Apache License 2.0, that allows you to decouple policy decision-making from application code. Ajish Abraham writes, Easily Running Open Policy Agent Serverless with AWS Lambda and Amazon API Gateway demonstrates how to run OPA as a service within a container in Lambda using just the standard precompiled OPA binary. OPA is commonly used in cloud-native environments and ran as a service or container. Because OPA decisions are stateless, OPA is a great candidate to run in a serverless architecture for cost savings, simplicity, and performance. [hands on]
AWS Lambda Powertools
In the post, Handling Lambda functions idempotency with AWS Lambda Powertools, Jerome Van Der Linden and Dariusz Osiennik explores what idempotency is and how to implement it more easily with AWS Lambda Powertools.
Amazon Kinesis Data Analytics Studio makes it easy for customers to analyse streaming data in real time, as well as build stream processing applications powered by Apache Flink. Jeremy Ber shares how to get started querying data interactively from an Amazon Kinesis Data Stream using the Python API for Apache Flink (Pyflink) in his post, Query your data streams interactively using Kinesis Data Analytics Studio and Python [hands on]
Other posts worth checking out
- Announcing the General Availability of openCypher support for Amazon Neptune looks at announcement last week of the general availability of openCypher query language support with Amazon Neptune
- Tracing an AWS App Runner service using AWS X-Ray with OpenTelemetry shares how you can instrument applications deployed using AWS App Runner with the AWS Distro for OpenTelemetry (ADOT) [hands on]
- Develop and test AWS Glue version 3.0 jobs locally using a Docker container develop and test your AWS Glue scripts locally (spark-submit, pyspark, JupyterLab, and pytest) using this solution [hands on]
- Let’s Architect! Using open-source technologies on AWS explores how you can use a number of open source projects from AWS when building your solutions
- Deploy .NET Blazor WebAssembly Application to AWS Amplify shows you how to build a full CI/CD pipeline for a Blazor WebAssembly using the AWS amplify [hands on]
A couple of interesting case studies this week, featuring the Amazon Genomics CLI and the use of open source big data projects at Uber.
The Amazon Genomics CLI is an open source tool that simplifies genomics workflows in the cloud. The UC Santa Cruz Genomics Institute shared how they were collaborating with AWS and using tools like the Amazon Genomics CLI in their blog post, UCSC and Amazon Web Services work to accelerate genomics research
Presto® on Apache Kafka® At Uber Scale is a look at how Uber uses Presto on Apache Kafka at scale, and is a really great read. Essential reading this week.
Amazon MQ now provides support for ActiveMQ 5.16.4. This update to ActiveMQ contains several fixes and enhancements compared to the previously supported version, ActiveMQ 5.16.3.
On April 19th, 2022 Amazon announced quarterly security and critical updates for Amazon Corretto Long-Term Supported (LTS) versions of OpenJDK. Corretto 18.0.1, 17.0.3, 11.0.15, and 8u332 are now available for download.
Amazon Keyspaces (for Apache Cassandra), a scalable, highly available, and fully managed Cassandra-compatible database service, now helps you read and write data in Apache Spark more easily by using the open-source Spark Cassandra Connector. Apache Spark is an open-source engine for large-scale data analytics. Customers use Apache Spark to perform analytics on data stored in Amazon Keyspaces more efficiently. Customers also use Amazon Keyspaces to provide applications consistent, single-digit-millisecond read access to analytics data from Spark. Now, you can read and write data between Amazon Keyspaces and Spark more easily by using the open-source Spark Cassandra Connector. Amazon Keyspaces support for the Spark Cassandra Connector helps you run Cassandra workloads in Spark-based analytics pipelines more easily by using a fully managed and serverless database service. With Amazon Keyspaces, you don’t need to worry about Spark competing for the same underlying infrastructure resources as your tables. Amazon Keyspaces tables scale up and down automatically based on your application traffic.
This quarter, we released 13 new or updated datasets including CMIP5, 1950s US Decennial Census, and open genomics data for Galaxy. Read the post Downscaled CMIP5, 1950 US Census, and open genomics data for Galaxy: The latest open data on AWS for some highlights among the new datasets.
On a related note, Jack G. M. FitzGerald wrote Amazon releases 51-language dataset for language understanding in the Amazon Science blog, sharing three announcements including news about the availability of a new dataset called MASSIVE, which is composed of one million labeled utterances spanning 51 languages, along with open-source code, which provides examples of how to perform massively multilingual NLU modelling. Read the post to learn more.
You can now configure your database connections on Amazon Aurora MySQL-Compatible Edition from an allowable list of ciphers. Configurable cipher suites help provide you with more security control over the connection encryption that your database server accepts. The supported ciphers, dependent on the version of your Aurora MySQL-compatible database, include the following:
AWS DataSync now supports transferring files to and from Amazon FSx for OpenZFS, a fully managed service that offers highly reliable, scalable, performant, and feature-rich file storage built on the open-source OpenZFS file system. Using DataSync, you can easily and securely migrate your on-premises file or object storage to FSx for OpenZFS or perform ongoing transfers of your data between FSx for OpenZFS and your on-premises storage or AWS Storage services. You can also use DataSync to move data between FSx for OpenZFS file systems.
Videos of the week
Data Science on AWS
Antje Barth and Chris Fregly introduce a couple of guest speakers, Paul Hargis and Matt Winkler that share how to use use open source Delta Lake and dbt in your ML data pipelines. Essential viewing this weeks folks.
Consul Service Mesh
Continuing in their series of shows featuring HashiCorp open source tools running on AWS, colleague and fellow DA Jenna Pederson and J. Cole Morrison from HashiCorp show you how to set up a Consul Service Mesh for their microservices architecture on ECS. They cover the main concepts in Consul and build out the infrastructure components required for the Consul servers.
It has been a while since I have shared a video from the Containers from the Couch team, but last week they (Justin Garrison and Sai Vennam, plus Luke Reed from Fairwinds) put together a show that featured how to manage your helm charts with open source tools such as reckoner from Fairwinds, that lets you declaratively manage multiple helm charts. Watch to see more.
Events for your diary
AWS London Summit April 27th
We have a number of open source sessions (including my very own on Apache Airflow), so if you are about later this week why not register and pop along.
AWS Berlin Summit May 11th/12th
Aside from the AWS open source sessions (including me again, talking about Apache Airflow) we will have our very own Spot and myself manning the open source booth. Really looking forward to this and would love to see you come down and share your open source projects on our booth.
KubeCon May 16th-20th, Valencia Spain
The Cloud Native Computing Foundation’s flagship conference gathers adopters and technologists from leading open source and cloud native communities in Valencia, Spain from 16 – 20 May 2022. I will be there with many of the open source team and other AWS colleagues, so if you are going, make sure you swing by the AWS Booth.
Find out more about the event here.
GitOpsCon Europe May 17th, Valencia Spain
GitOpsCon Europe is designed to foster collaboration, discussion, and knowledge sharing on GitOps. This event is aimed at audiences that are new to GitOps as well as those currently using GitOps within their organisation. Get connected with others that are passionate about GitOps. Learn from practitioners about pitfalls to avoid, hurdles to jump, and how to adopt GitOps in your cloud native environment.
The event is vendor-neutral and is being organised by the CNCF GitOps Working Group. Topics include getting started with GitOps, scaling and managing GitOps, lessons learned from production deployments, technical sessions, and thought leadership.
Read more about this from the official page here.
CDK Day May 26th - Virtual
This is a community organised event about AWS CDK, cdktf, projen and cdk8s. This will be third year they run this event, and if the previous two are anything to go by, this will be essential viewing - live streamed via You Tube. Check out and register for the event over at their home page at https://www.cdkday.com/
BOSC 2022 July 13-14, Madison, Wisconsin, USA
The Bioinformatics Open Source Conference (BOSC) has been held annually since 2000, and this year AWS is proud to be a platinum sponsor for this event. BOSC covers all aspects of open source bioinformatics software and open science, including (but not limited to) these topics, Open Science and Reproducible Research, Open Biomedical Data, Citizen/Participatory Science, Standards and Interoperability, Data Science Workflows, Open Approaches to Translational Bioinformatics, Developer Tools and Libraries, Inclusion, and Outreach and Training. This is a hybrid event (in person/virtual) and you find out more by checking out the event page, BOSC 2022
OpenSearch Every Tuesday, 3pm GMT
This regular meet-up is for anyone interested in OpenSearch & Open Distro. All skill levels are welcome and they cover and welcome talks on topics including: search, logging, log analytics, and data visualisation.
Sign up to the next session, OpenSearch Community Meeting - Feb2022