preloader
image

Pre-deployment smoke testing with Sentinel

How I’m trying to get advanced warning of exactly what workspaces will be affected before I deploy a policy.

This is the context of using Hashicorp Sentinel for governance controls in your Terraform pipelines. While Sentinel can be used for Vault and Nomad, this is particularly focused on Terraform.

Tl;Dr: Download all the mocks, those are your new pass test cases, see what happens.

Table of Contents

  1. Background
  2. Sentinel Mocks
  3. Making the Tests
  4. Pre-Deployment Awareness

Background

I really enjoy working with Sentinel, but there’s currently not a great way to get a picture of what would happen if I deployed a given policy. Sure, there’s using Advisory mode, but I don’t want to train our developers to ignore Advisory warnings. So, without a way to put in a silent policy and without a way to detect a sort of “what-if” scenario, I built my own and am working on scaling it up.

Sentinel Mocks

Sentinel mocks are of course central to writing central policies, and they’re a key part in unit testing your Sentinel policies. As such, these make an attractive starting place for a sort of unit testing at scale. Essentially, the idea at play here is to use real deployed infrastructure code to be the basis for a “unit test” which thus could provide advanced warning. High level steps:

  1. Start with a list of URLs of your TFE/TFC instance(s)
  2. For each one of those, list out the Organizations
  3. For each one of those, list out the Workspaces
  4. For each one of those, list out the latest applied plan which was not a destroy plan
  5. Queue mocks download preparation, download them, & extract them
  6. For each workspace’s mocks, create a “pass” unit test using these mocks
  7. Run sentinel test against each of these new sets of unit tests, collecting errors
  8. Review any error output as your early warning!

You can probably guess that step 1 is done with a small config file which provides the URLs and information required for performing authentication. Steps 2-5 are relatively straightforward utilization of the Terraform Enterprise/Terraform Cloud API, and of course steps 7-8 are pretty boilerplate Python/human review. That leaves us with the novel bit of step 6.

Making the Tests

Fortunately, you don’t need to really read through every unit test (e.g.: pass.hcl, pass-delete-bad-resources.hcl, fail.hcl, etc.) to determine what mocks you need to import; you can simply import each type. Since they follow a standard naming convention, here’s what you can do:

  1. Clone a copy of the policies/unit tests repo to your working directory (we do this in a pipeline, so it’s in that container)
  2. Create pass.hcl with all the mocks declared in each policy’s test folder, just as you’d do with a regular unit test
  3. For each mocks download folder, copy the boilerplate pass.hcl to something like instance-org-workspace-pass.hcl and update the paths to point to the folder where the mocks are
  4. At this point, you may have something like 50+ policies each with 100’s or 1000’s of unit tests. That’s great!
  5. sentinel test --verbose policies/*.sentinel <- Something like that will get your unit tests kicked off

Of course, this is easy to read/watch when there are few policies & workspaces, but there does come a tipping point at which one may wish to let all this run in a pipeline and to let the automation tell you only what needs a closer look by returning the count of policies tested, the count of workspaces tested, the count of passing policies, and the count + details of the failing ones. You could even get fancy with some sort of roll-up metric combining orgs/instances together for a higher-level view.

Pre-Deployment Awareness

So there you have it, you can get an idea BEFORE deploying a new policy as to what will break if you deploy. There are some risks of course, including:

  • Mocks unavailable for some reason (see FAQ’s below)
  • Old runs may not represent future runs. Tangentially, I’ll note that enforcing periodic re-paving and re-running plans to detect drift is a good idea anyway, and this may help with this risk
  • Missed orgs due to token permissions, etc.

I believe each of these are solvable and generally avoidable, but more importantly I’d like to note that without a solution like this in place, you get NONE of the feedback, so this is already a drastic improvement.

FAQ

Where’s the GitHub repo for this?

I would LOVE to share this code; however, that process is still ongoing and I’m unable to presently share that externally. I hope I’ve conveyed enough information for you to get a kickstart in implementing this sort of a solution though! If such a time comes that I’m able to open source the code for this, I’ll be stoked to share ASAP.

Why not just kick off speculative runs?

Great question! In many orgs – my current one included – the team making the sentinel policies doesn’t have the authorization to be running other teams' workspaces. This may sound off-putting to some, but at this scale it’s really better this way, and more secure to keep our permissions as minimal as possible. Further, none of our runs would work directly out of the console; they must be created by a pipeline. Not only do I not have permissions to that pipeline, but there’s additional hardening in place beyond that. Long story short: I can’t kick off runs in workspaces I don’t own, so this method was not viable for me, but it may well be for you, and if so you should absolutely use it!

HTTP Imports and making POST requests

This would be one method to go about this, yes. I could write a policy, set it to pass no matter what, and POST more details to another server. A few things made this look like a good second option if needed:

  • I would need another server, which is both more compute and more setup/management overhead for my team and I (e.g.: Certs, requests, etc.)
  • The policies would need to authenticate to the server… more on this below

Authentication for the Sentinel HTTP import has a few implications to it beyond the standard tasks of getting a pair of service accounts with vaulted credentials and maintaining everything around them – it means to also manage things like:

  • How will the policy authenticate to Vault? Probably Sentinel Parameters, but:
  • Will I use different params/tokens for each TFC/TFE organization? What about each TFE instance?
  • If I will use different ones, how will I build the logic to appropriately and reliably use the correct token(s)?
  • How will I rotate these Vault access tokens? Who will maintain them?

And that’s all just how Sentinel would get the token to access Vault which has the token for creating a POST request to a second server. Yes, this sounds pretty tedious, though potentially something to look into.

Why not use some sort of logging infrastructure (e.g.: Splunk, Elastic, etc.?)

I’d LOVE to. However, this falls into the first concern of advisory enforcement, and secondarily that the integrations for centralized logging and Sentinel policy evaluation aren’t quite where I’d like to see them. I expect that in the future this might be part of a solution though.

Why might mocks be unavailable?

There are a few reasons. Downloading mocks is a workspace-level permission, so if you’re not the instance/org/workspace owner, you’ll need one of those folks to add the “download mocks” permission to your team (the TFE/TFC lingo of “team”, that is). Without this you’ll get the 401/404 in downloading.

Another reason may be the age of the mocks. Historically mocks used to be unavailable after 7 days from the run completion, but in the API I’ve found it to be much longer (6+ months). That said, some mocks may be unavailable for one reason or another, and your automation should handle this.