Employee Spotlight

GitHub Actions: Query Your Snowflake Warehouse

Daniel Pollack
|
July 25, 2024
July 15, 2021
Discover with Anecdotes how to use GitHub Actions to query your Snowflake Warehouse
Table of Contents

TL;DR

Anecdotes uses Snowflake DB to store its customer data. Each customer has a corresponding database which contains several tables that hold its compliance-related data, collected for customers.

Our CI/CD runs on GitHub Actions, and it contains workflows that assume an empty customer database. For example, we have a customer onboarding workflow, which needs to start with an empty database.

We created a GitHub action which lets us execute SQL queries in our databases, based on Snowflake’s Python connector package.

Recap

What we’re going to cover:

  • Overview of creating GitHub actions
  • Snowflake’s connector Python library, a short description
  • Asynchronous database querying and using some asyncio features for it
  • And finally, how we put all the above together to create our Snowflake-query GitHub action, combined with some cool tricks to use in our CI/CD

GitHub Actions — How to Create Them

GitHub offers three ways to create your Actions:

  1. Run an app on Docker container
  2. Run a node app directly on the instance
  3. Run cmd’s directly on the runner machine

We chose to write our action’s app in Python, and run it on a Docker container with the first option. This choice has few advantages -

Flexibility — Running in Docker gives full freedom in how to write our action app.

Isolation — Using a containerized app, Python with virtualenv assures isolated, robust and stable action run, independent in the machine running the workflow.

Asynchrony — Using Python’s asyncio library and Snowflake's asynchronous query execution function, we can run our queries asynchronously and independently, fetch results when available without waiting, and save expensive time for our CI/CD.

Snowflake Connector Python Library

connector context-manager implementation

Snowflake supplies a Python library which gives the basic functionality of connecting and querying a Snowflake database. The connection is made by the snowflake.connection object using the database credentials. We wrapped this action with a class which implements the "with semantics", so we don't need to worry about closing the connection.


{{banner-image}}

executing queries

After connecting successfully, you can start querying the database using the Cursor object, obtained from the connection object. There are two ways of executing queries with the library - execute for synchronous execution and execute_async for asynchronous ones.

The synchronous way is simple and good for a batch of quick/dependent queries. But if you have independent queries, some of which may take a long time to run, there’s no reason to block each other and use the synchronous way.

In Anecdotes, our initial case was cleaning our CI/CD databases as a setup step, which runs batches of independent queries, so we chose to implement an async query executor.

Asynchronous Queries

Our implementation of the asynchronous query mechanism is quite simple. It uses two of Snowflake’s methods:

  1. execute_async — Sends the query to our Snowflake DB.
  2. get_query_status — This method returns the state of a query previously executed. We are polling the state of the query, and while it’s running, we put our polling loop to asynchronous sleep, so other query tasks can run concurrently:

When the polling ends, we fetch the available query results with cursor.get_results_from_sqfid() .

We create a polling task for every query, and wait for all of them to finish with await asyncio.wait(..., return_when=asyncio.FIRST_COMPLETED), which returns completed tasks every time, so we can fetch and print available query results while other queries are still executing.

Asynchronous Error Handling

Exceptions handling coroutines in Python is a bit more complex than regular exception handling, due to multithreading, concurrency of asyncio tasks within events loop, and in general, non-linear flow of running.

When an exception is raised within a coroutine which was run as a task, using asyncio.create_task, it should be retrieved rather than caught in a try-except scope, if the task is not awaited. One may retrieve the exception raised with asyncio.Task.result() method, which re-raises the exception, if one occurred.

The Jewel in the Crown

Finally, we came up with a Snowflake-query GitHub action. This action can query any Snowflake instance with just a few simple configuration steps.

Because our action uses sensitive input parameters, we add them as secrets to your repository, or organization, and then use their variables in the workflow.

An example for a simple use of the action:

As it can be seen, we pass a configuration for the database connection under the with statement, which indicates input parameters to the action. Following them, we have the queries parameter which holds sql queries separated by ; to be executed.

The action has one output, called queries_results, which is a JSON object that holds all the result rows for all queries. Its format is as follow:

{

  "<query1_id>":

      ["<result_row1>",

       "<result_row2>",

       ...],

 "<query2_id>": [...]

 ...

}

This output parameter may be accessed with the following GitHub actions expression: ${{steps.run_queries.outputs.queries_results}} (assuming the step uses the action is called run_queries).

A simple use-case for the output param is to check if an expected string appears in it. This may be achieved with the help of the useful GitHub actions function, contains.

The function contains returns a bool, and running true/false on bash sets the exit code to success/failure correspondingly. In the above example, we verify that the database’s version appears in the output JSON , as a result of the query select CURRENT_VERSION() .

In the GitHub console, it looks like this:

How We Use It

We run our CI/CD on development, staging, and production environments. For every environment, we have different databases we want to empty in our workflow. In order to avoid “yaml duplications”, we created one workflow that uses GitHub Action’s dynamic matrix feature:

Then, we configured the next job to run on the matrix selected above:

Notice the usage of fromJSON function, which lets the workflow use a string as a JSON object in its configurations.

Then we use our Snowflake-query action:

${{matrix.customer}} represents a customer from the previously selected deployment environment.

The End

We had a great time putting together a couple of these modern technologies, platforms, and programming methodologies in order to make our automation processes better and robust, as well as letting others use and learn from our results.

Feel free to contact me with questions about the discussed topics above.

Daniel Pollack
Backend developer at anecdotes. Curious about complex systems and problem solving.
Link 1
Link 1
Link 1

Explore Our Compliance Leader Playground

No items found.