In the 21st century, we can create intelligent machines that are more efficient than humans in regular jobs. Arguably anyone with necessary programming skills can put together a script that pulls data from a source, executes some business logic, maybe updates data points in the business intelligence system, and finally sends notifications afterward. It does not sound too complicated. The real problems start from the moment when you decide to productionalize your code. In this article, we will review a list of tools with their pros and cons that helps us to get value out of our code. Needless to say, do not consider this list as comprehensive. There are tons of workflow automation tools.
Let’s start with a dinosaur of the automation world – cron, a time-based job scheduler in Unix-like computer operating systems. This software utility was released almost half a century ago and is still popular among developers. You can easily install it on your Unix-based machine and immediately create a cron job. For commands that need to be executed repeatedly (e.g., hourly, daily, or weekly), you can use the crontab command. The crontab command creates a crontab file containing commands and instructions for the cron daemon to execute. Though we are talking about classic cron jobs, some solutions make your life easier, like https://healthchecks.io or https://cronitor.io.
- It is free.
- It is easy to install.
- Cron is super flexible in terms of programming languages, you can execute any script.
- Infrastructure overhead. You have to maintain a Unix-based machine to be able to run cron jobs.
- Lack of visibility. There is no easy way to find logs of your cron jobs or check the status of the last execution.
- It does not have a UI. Lots of modern solutions have a UI component that makes user life much more comfortable.
Apache Airflow is an open-source platform to programmatically create, schedule, and monitor workflows. It is entirely open-source and is especially useful in architecting complex data pipelines. Airflow was created to solve the issues that come with long-running cron tasks that execute hefty scripts.
It’s written in Python, so you’re able to integrate external features to its core by simply dropping files in your plugins folder. Airflow offers a generic toolbox for working with data. Different organizations have different stacks and different needs. Using Airflow plugins can be a way for companies to customize their Airflow installation to reflect their ecosystem.
Anyone who is involved in data engineering probably heard of this tool. Since its inception as an open-source project at Airbnb in 2015, Airflow has quickly become the gold standard for data engineering, getting public contributions from folks at major organizations like Bloomberg, Lyft, Robinhood, and many more. It became so popular that cloud solutions emerged: Google Cloud Composer and Astronomer.
- It is an open-source solution.
- Airflow has a UI.
- Many add-on modules created and supported by the community.
- You can monitor the statuses of your jobs and check logs via web UI.
- There is a significant infrastructure overhead. To set up Airflow on your server, you probably need to spin up multiple containers for web server, scheduler, database.
- Airflow uses its own lingo/terminology, which you’ll have to learn and use.
- If you scale beyond a point, you have to take care of scaling the database as well, adding database administration work.
Serverless framework (AWS Lambda, Azure Functions, Google Functions)
Serverless is a hot technology right now. The first usages of the term seem to have appeared around 2012. It became more popular in 2015, following the AWS Lambda launch in 2014, and grew further in popularity after Amazon’s API Gateway launched in July 2015. You can read more about serverless and its origins in Mike Robert’s blog, “Serverless Architecture.”
There have been several interpretations proposed for this term and its implications. This is a concept wherein applications with (read: server-side) logic are run over stateless containers, managed by a third party provider in its entirety, which is known as Function as a Service (Faas).
The Serverless Framework helps you develop and deploy serverless applications. It’s a CLI that offers structure, automation, and best practices for deployment of both code and infrastructure, allowing you to focus on building sophisticated, event-driven, serverless architectures comprised of Functions and Events.
- It manages your code as well as your infrastructure.
- It supports multiple languages (Node.js, Python, Java, and more).
- There is a web UI.
- It is not free.
- You need to have a basic knowledge of a cloud platform you are going to use to deploy your serverless functions.
- It is not a trivial task to set up a serverless configuration.
- Cloud provider limits.
And last but not least – Jenkins. Most developers are probably familiar with Jenkins, the lightweight, and venerable Continuous Integration framework. It has been widely used in the software community for years to automate and deploy software builds.
In a nutshell, Jenkins is a task execution platform. And it provides a lot of different ways to run those tasks. And they don’t always have to be bundled up in a software build process. So whether you want to transfer log files from A to B, poll a repository for some data to download, or allow your non-technical friend to run that report again, Jenkins has you covered.
- It is an open-source solution.
- Infrastructure overhead. You have to maintain a server with Jenkins
- You need to wrestle with configuring everything in the way you want
While Airflow gives you everything you can wish for data pipelines – there is always a complexity of the system on the other side of the scales. Going with Jenkins will provide you extensibility beyond imaginable, but there is your time you need to sacrifice to glue everything together and enhance it for your needs. You can always fall back to cron: figure out everything provided within other solutions by yourself, enriching your skills and Unix-like systems knowledge. Or if you would like to relieve yourself from headaches working with an operating system – serverless is your choice. And still, an acquaintance with AWS, Azure, or other platforms should be gained.
Remember that there is no silver bullet. Every solution will have its pros and cons. So, choose the cons you would like to deal with.
You can read more on https://blog.seamlesscloud.io/.