Transcript for "Replacing Terraform Module Forks with Automatic Policy Transformation Rules":
Hello. I hope everybody's enjoying IEC conf. This session is going to be replaced Terraform module forks with automatic policy transformation rules, and it's gonna be run by Anton Babanko. And then just a couple of housekeeping things. If you have any question and answers that you wanna kind of put in the chat, I would do so in the q and a box. So that way, when it comes time for some questions at the end, we can kind of run through those. And then if you have any questions that you really wanna have answered, we can use the upvotes for that. Alright. I'll leave it to you, Anton. Thank you very much, Gabriela. And, yeah, it's my pleasure to attend this conference. I knew about, existence of this conference for quite some time, but, it's my first time when I actually, have possibility to present. And I'm really excited to present, the subject which I was, having in my head for a very long time since the well, approximately, since first time I discovered Terraform and figured out that, well, why cannot I use variables inside life cycles? Right? You probably know about me already. Well, my name is Anton Babienko. And, since 2015, I've been doing bunch of different Terraform projects. I'm also AWS community hero. And, you may have used Terraform AWS modules and maybe a pre commit Terraform, as a collection of different git hooks, to help you with Terraform quality standards. Recently, I worked on Terraform skills because, as we all know, there is AI. And, even after listening to Corey's talk, I was thinking, like, yeah, it it was a problem that, AI generates bad telephone code or doesn't know anything about it. Like, it can hallucinate. So that's why I spent, quite a lot of time making this telephone skill. And, it combined with official MCP server by HashiCorp does miracle. I mean, I have almost no problems with it. Of course, I use bunch of other tools as well, but, it helps. So if you, want to improve your AI, way of generating Terraform code, I'm not just generating, but working with Terraform in general. Please take a look. I also work on, compliance.t f, which is, partially related to this talk, but, I want this talk to be, not about compliance TF, but about the problem, which you can solve. And I will show how you can solve it, do it yourself way. Also, Terraform best practices, serverless TF, and a few other resources you may have seen before. So this talk is about, a couple of things. So first, the fork graveyard, which many of us have experienced in the past. And then I will explain the difference between validation versus transformation. And, I will explain what is actually operational rules, and I will share, the provider and explain, how to implement all of this yourself using, many different open source tools, which you have already used, if you use Terraform. And also, I will explain the migration process. So the fork graveyard, we already, may imagine the situation that we, start using Terraform. We found some module which doesn't have everything what we want. So we fork it, we add our, functionality there, and now we are in maintenance burden for unknown amount of time. And we will miss different upstream patches and it will be our drift and something what we have to handle. As one of, company I talked to, has, said that they fork Terraform AWS modules for few modules, hardcoded security, and now they start maintaining them forever. That's pretty much what I have showed, and that's pretty much, many companies who use, open source Terraform modules, went through. So, of course, you may think that, well, that's a lot of accounts. That's a lot of modules, to maintain. Overall, DevOps engineers or SRE engineers, so to speak, spend roughly 40 hours a month, just to make sure that they, implement MISR functionality, which we already have in the upstream version of the module, or debug different edge cases, which again we have implemented in central Terraform Midwest module repositories. Let's look into validation or transformation. So the key point here is that policy as code, can reject your plan. And, it's pretty hard to do that policy as code mechanism can reply to your module. Everything is possible, but it's not the question of, whether you should use OPA or Sentinel or Conpad test to add, life cycle block. If it's missing, then plan will fail. That's it. And someone will still have to fork the module and implement this functionality. So where this potential misconfiguration can happen? Obviously, if you are typing an HCL code, then everything can be placed there. Some misconfigurations, some wrong defaults, some wrong parameters, hopping from ancient example, everything can happen. If you are specifying variables, then you may specify some unsecure or not safe default values, there. If you are writing module source, then, most Terraform AWS modules, does not limit you on what you want to put them, and this is intentional. Also, if you run, plan, then different verification across resources, may not be always discovered in the right time. Right? And, also, if you're on policy check, as I said earlier, it can just say that something is bad, but, you will have to fix it somehow later. So if we look into these two different enforcement models, where policy validation runs at plan time and it evaluates the plan and decide plan or fail, policy transformation, which in my case happens at, the load time, or when you run Terraform init and hit, certain registry endpoints, it already contained code, which is, updated, which, was written for certain requirements and so on. So different kind of, dynamic, workarounds, which historically was not possible to implement because life cycle cannot be placed dynamically or provisioner was placed there, but you know that putting provisioner is a bad thing to do. And, this kind of functionality will be removed, when you use policy transformation at the load time. So there are two, different, types of of of checks. One is, so to speak, compliance controls, where things are strictly, required. For example, encryption is required, or public block access should be required and so on. They are strictly related to your audit process. So if you, don't do this, then most likely you will have issues, during audit. And while operational rules is a new thing which I'm talking about here, it's, something what you should do. For example, life cycle protection will unlikely to affect your audit, but it will significantly improve, life of DevOps engineer who doesn't have to maintain the whole module with single line change. Or provisioner removal will be a great addition so that you reduce amounts of potential threats and people will not be able to execute, malicious shelf scripts, when instance is created. Right? Also, things like, different instance type restrictions as well as different naming conventions, is just helping you to have a good operational scope and is not related to audit in most cases. Well, I don't have possibility to run code here because, well, you can see my slides. I think they're awesome. But, you can just imagine that, on the left side, I show before where I'm using, Terraform AWS module, from official HashiCorp Terraform registry. And, when I run Terraform in it, then it downloads it. And then when I run, make before, or o p a before, the only thing which is happening there is that it checks, the Terraform plan, and then it checks inside of plan, does it have this life cycle, block presented? And, obviously, it does not, because it's official public Terraform AWS module. While on the right side, after you use rules registry, which is sourced from registry compliance TF, And then inside of these three dots, there is exactly the same module source. You can specify, rules you want to apply. For example, in this case, a rule called prevent destroy data. And then, when you run it through OPA, it will not be able to identify, life cycle, prevent destroyer because it was removed, as you download this module during Terraform unit. So this is only difference which you see here. So the overall difference, for the user is just one line of the code where you change, instead of loading from one official endpoint to another one, and you append different rules you want to use there. You don't need to fork. Obviously, you don't need to do any modifications. You receive code which is already, in expected condition. So there are four, categories, what people typically change inside of modules. There are a little bit more categories, to be honest. I realize it, when I talk to customers. But, overall, there are four. So the first one is when you want to, work around limitations of HushCorp, which we know since 2016. There are open issues there. OpenTofo helps with some of this, but not all of them. For example, add or modify life cycle blocks, or, strip the whole entire blocks, for example, provisioners. Or you may want to specify certain, things to be allowed, for example, different instance families or different AMI IDs or certain regions, where infrastructure can be deployed. Most of these things are pretty hard to do, inside of reusable, modules even if they span across, multiple teams. Because different teams may have slightly different requirements, and then it will be hard for them to get, like, one canonical example for which satisfy all of them. And also, different, rejects based cleanups. For example, you may want to to to make sure that, some credentials, are never ever in any situations, possible to expose. Or you want to make sure that duplicated attributes are never ever, placed there or debugged and so on. So overall, it's like this prevent and destroy data, which are typically applicable for s three, RDS, and DynamoDB because you simply don't want to, destroy data in production environment, but you want to be able to do this in dev environment, for example, Or even, ignore tag changes, which is quite common request. Or ignore auto scaling changes, right, inside of DynamoDB tables, for example. And, no provisioner is one of my favorite where, I never ever allow provisioners to be used. If you want to use them, please find any other solution, but no provisioners. So program destroy data, is pretty much, just adding three lines of code to s three buckets and KMS keys. While, no provisioner is just removing of provisioner block from null resources, or from any other, resources where provisioner, local exec specifically, was used. Restricting instance type is another way to make sure that, GPUs and high memory families, which can cost up to, I think, $25 per hour, is not an option. So we have to find another way. And I think about different types of, of, of rules. So most common of what we have heard is that detector. Right? OPA tells us for many months, not months. Sorry. For many years, that it can identify Terraform plan. It can evaluate Terraform plan, and then it can decide whether something is okay to be applied or denied. To my mind, this is already extremely, extremely late. First of all, you will need to provide credentials to be able to run and to get this plan, and, it it's already, like, too late. With AI specifically, I don't want to do anything else, like in five, ten, twenty minutes. Only now, if I write some code, I want to know immediately that my code, is good or is, having potential to be applied. So detective is good, but, it's more like legacy, I would say. And preventive is exactly the place where a solution should happen. So if something should be denied, well, then I don't have to even know about it. Just remove from the code, provide me the solution where this misconfiguration can never happen and I will be blessed. So preventive is the future to my mind. And the residual is something what happens pretty much outside of Terraform, good or click ops, and, different pipelines, where things are pretty much not in our control. Console click ops is typical example. So I still have, some time. Let me talk about one of my favorite thing which I have kind of, developed, but, I I really want to hear feedback from, from the community. So, the raw AWS resources, can still slip, past, through different module rules. Think about this that, you write Terraform code and you don't want to have any extra tool, which can control whether you, are allowed to, apply certain code or not. You already run, let's say, Terraform apply, and, it's using AWS provider. And AWS provider is doing gRPC calls, like AWS AWS core and AWS provider, they do speak to AWS endpoints. Right? And the provider which I have developed is under namespace compliance tier, is, staying in front of, official HashiCorp AWS provider. It has extremely small surface of, what it can do. The only thing which it can do is it it verify whether something is okay or not. So, for example, if you're on Terraform plan, with the code, which is, like, handwritten resources, but it can be also, any Terraform resources, any nested resources, anything what any tool like Checkoff or Trivia cannot properly dispatch. You run, any Terraform codes, then Terraform codes, calls, compliance d f AWS provider and, AWS provider, compliance d f AWS provider, figure out whether this is something what has to be even passed through to HashiCorp AWS provider or not. And this is the first gate which, can decide whether this is hard fail or soft fail or is it just advisory. In this example, ACL public read is a violation of, CIS two one five Because it says that history must block public access, and you are trying to allow public access. So it can tell you what kind of fix should be applied, and, and you apply this fix. So the Terraform plan, which is shown on the right side, is pretty much what you would be able to see from normal execution of Terraform plan. I think it's pretty cool because you actually can see the fixed inline control, what kind of attribute was affected, what kind of resources, in addition to, traditional, which line, was changed or was affected, which file, and so on. So there are different, levels or different modes, how it can fail. For example, hard fail is a plan fails and nothing goes to, to downstream, provider, or soft fail, which will just warn you and it will, help if you are doing some, migration, then it can tell you that, okay. It's time to fix it. Or advisory will simply write audit log and, it will just pass through. So I I really want to know your feedback on this and think about this that, operational rules can rewrite the source code of the modules. Currently, I'm doing this for Terraform AWS modules, but I will explain how to do this for anything else, I think on the next slide. And then, there is OPA, which can check that this plan, which was generated actually is okay or bad. And then there is, compliance TF AWS provider which catches raw resources. It can catch, any type of resources, whether it's coming from module or raw resources. It doesn't really matter. So feedback is welcome. And let me explain how things are, built. Right? So I will explain using different tools, how you can build it yourself. And, the first tool which, which I use heavily for many years is HCL edit. For example, for the way to add, to add, life cycle block inside of resource. You first append block resource, and then you append attributes inside of this block. Very straightforward, and it works perfectly because it's supported, by HCL. HCL edit has few limitations, but it's not because HCL edit is bad. It's because HCL specification, is implemented this way. Another tool, which is map OTF, I will put a link. Well, well, actually, all of the links will be on the last slide. And if you want to do block removal, it's a little bit trickier than you can do, something like off and then write this magic script. If you can't, you can also, use different, HCL, or, like, Python HCL libraries and convert it like this, or you can always do manipulation by converting into PyHCL. And then from PyHCL, remove it, and then dump it into HCL. But I think, ARC, is is okay here. And we use it in some set of formulas modules for quite some time. Okay. And one more about, adding validation. So for example, the first one is that we add, block into variable, like variable instance type. And then inside of this block, we add condition. So this is, like, first thing. And then if we want to validate that, inside of plan, this condition was actually there, then we check it using this little bit magic GPU expression. If we want to do, content sanitization, it's not actually very much related to Terraform, I would say, but I still see this quite helpful. For example, you may want to rewrite bucket names for certain things regardless of what user has specified it, or you may want to do some refactoring as well. So this kind of set, combined with HCL edit is helpful. But for most cases, we don't want to commit, secrets into git. So git leaks together with pre commit hooks, do the job. So imagine that, you have created, this small scripts. So what you do now, you have to roll your own module registry. The simplest one is, HTTPS endpoints that return a fixed module zip, where inside of, that zip, you apply all of your rules, for example, in Lambda, and you point to your Terraform to use this registry. It can be HTTPS or it can be Terraform registry. Like in this case, it's reg.example.com. It's very well documented, and it has very small, edge cases, I would say. And, well, this is a way how compliance TF is built internally. Of course, it's not so simple, like, for real, but, to get started and to have, your own way of applying operational rules, that's pretty much everything what you need. So, I want to, highlight once more that there are two things, available. Validation, is pretty well known, and we have heard about it for many years. That's all you need to run policies. OPA sentinel, got present for many years. But we are very seldom given options, on how to actually transform it without, forking modules. So I want you to pick on purpose and know that there are different, benefits of each. Speaking of tools, if we are looking into tools which we normally use when we write telephone calls, then it's TF Link, or pre commit, Terraform Hooks, or different language service protocols. When it comes to input variables validation, then, of course, validation block is essential. And, yeah, it's good idea to just reject that values during plan time so that, it it's discovered, during plan time and not during the flight phase while it's not always possible. And, and if you are doing it, during plan, then remember that there are precondition and postcondition, which can help, quite a lot with cross resource, invariance and, provider proxy, which I showed earlier. But proxy and AWS, requests through compliance TF AWS provider can give you hard fail or soft fail, for example, if you want to know immediately and decide early on. Also, detective is, still there. So Sentinel and Compass and OPA are good representative, and there are quite a lot of materials and libraries and code and utilities around this. So it's pretty, it's pretty good idea to have it there, but it's already, like, better, lost phase, I would say. Well, it's not so lost. AWS Config and the CloudTrail and Proller are all detective things. They will only discover, when these already executed. Right? Most likely, it's already too late. So, these are resources. Some of these resources I mentioned, HCL edit and, map, OTF, is by Azure. And, awesome Terraform compliance is my list of resources which I collect and discover, when I work on compliance CF. And, compliance CF itself is is a solution which enforce different compliance controls and operational rules for Terraform modules. It has, pretty big documentation on different controls and frameworks and mappings and so on. And and, yeah, that's pretty much it. So thank you very much. Tell me if you have questions. found And I can read them. Thank you so much, Anton. That was a wonderful presentation. Let's go over to the q and a. If you have questions, please submit them. And if you see good ones, upvote them because we'll start with the most upvoted one. How do you make sure Terraform Cloud models are updated automatically without breaking existing deployments? Can you please repeat, Yeah. How do you make sure Terraform Cloud modules are updated automatically without breaking existing deployments? I see a lot of, environments. where we're pinning to a specific version, and that seems wrong because then you miss out on updates or not pinning, which seems wrong because then things break without you knowing when. What's the right answer? Yeah. Yeah. That's very good question, and I think it's one of the oldest question which, people, look, wondering. So, one thing which, which helps is, that, of course, you pin version. And then, as soon as you pin version, you need to run some sort of, re innovate bot, which actually checks for new version and run sequence of test. For example, it can be, that you, open automatically a pull request with new version and, test it, where, like, test the new version, and try to plan, dev environment. For Terraform AWS modules, we strictly follow semantic release process. Well, I I I don't want to say it strictly because, I immediately know cases when we didn't follow it and people were disappointed, to put it mildly. But, yeah, we tried to do our best. So in general, if you have multiple environments and you can run Terraform plan, then there should be a dependent bot or innovate bot, which checks on your version and try plan for that. It should work with every module including, public and private modules. That's yeah. That's perfect. I think all of your examples were AWS. Will Dewar says, do you have plans to support Azure? Well, interesting. Well, I'm kind of watching what Azure people are doing. And, it's funny enough that, AWS and Azure people live in parallel universe. Azure has quite a lot of good solutions, already. They have, Azure verified modules concept. They have even book about Azure verified modules, which I shared, I think last week at, weekly.tf website. So, I'm interested in, not just sitting and, like, making one by one module because it's kind of boring. I was doing this for last almost ten years. I'm looking into actually using AI. Have you heard about AI? Right? Right. So, using AI to, to propose, to write, to verify, and to do as much as possible, what I've been doing. So one of my project, which I share as Terraform skill, where I, share, like, how I write Terraform code. And, this is a basement of, any module. So for Azure or for AWS. And probably around September, October, I will share another project where I, explain how to write pretty much any module. Not just write, but how to write module on the same quality as, Terraform AWS modules, which is a little bit different than writing your, like, internal module, which no one else will be using except your team. So. I think you're the singular person in the world to write something like that, so we'll we'll keep an eye out for that. Ricardo says, are the transformations behind a paywall? I think, Ricardo is referring to transformation for the operational rules. And, well, that's a good question. I don't remember. No. Actually, I I I think it's not, because as I said on this slide, that, tri three with CIS, d six for AWS, is free forever as well as operational rules. So if you open account and, try these operational rules, it will be working, forever. The only thing which you need is to open free account, which is just click log in with Google. So it's not behind table. Perfect. Alright. We have forty five seconds for one more question, and it's mine. Can these rule these transformation validation rules be run on a git commit hook on a dev machine to kind of push this out to your fleet of developers? Yeah. Absolutely. The only thing which is, you can think about this is that these hooks are, transformation of any piece of code which you want to consume. So if you put, this piece of code inside of your Terraform module source block, then you can proxy this, module source through this transformation engine. So it can be git, it can be s three, it can be I don't know. Anything else would, go get or accept, as part of Terraform. That's perfect. Okay. That's it for our time for today. Thank you so much, Anton. We'll get switched over to the next talk. Yeah. Thank you very much. I'm pleased to be here. Thanks.