|

On Responsibly Integrating AI Tools into Software Development

Or, how to unsloppify your AI generated code. Why guidance and oversight are the missing ingredients in AI-assisted development.

Itamar Zwi
14 min read

AI coding tools have quickly become an integral part of software development ever since ChatGPT's debut in late 2022. Nowadays, it is difficult to be a software engineer without using these tools in at least some capacity; and yet there is significant pushback against AI-generated code from some developers, often referring to it as "vibe coded slop".

These tools are powered by large language models trained on trillions of lines of code, discussions, and Q&A threads from thousands of different sources. This training data gets fine-tuned and weighted, and then fed into different training algorithms with various oversight methods. The end result is a powerful tool that can produce code at blazing fast speeds. With proper guidance, this code can be robust, maintainable, and near production-ready within minutes or hours, instead of the usual days or weeks. So why the pushback? There's one word that makes all the difference: guidance.

Where AI shines

When it comes to making small batches of self-contained code and proof-of-concepts, AI tools can do a fantastic job. They can do it much faster than any single developer possibly could, and high quality can be achieved with little human interaction.

No matter how complex a system is, it will always have small batches of self-contained code, and there will always be a first-draft for any new development, this is where AI truly shines and can boost your productivity tenfold, or even more. For example:

  • Boilerplate code generation
  • Documentation and usage examples
  • Test scaffolding and test case generation

But real-world systems are not only made of small batches of self-contained code and proof-of- concepts. They are complex, with ever-shifting design philosophies that are influenced by the product team, the target audience, and whatever the current hot trend is. Without proper guidance, there is only a slim chance that the AI will output something that truly fits your system's needs.

That is not to say that AI can't be guided. AI can learn from your existing codebase and apply the same architecture and design for new features that need to be developed. It can also learn these things faster than any human could. If you give an AI coding tool a properly defined set of skills and rules, it will follow them every time. Instead of maintaining dozens of pages about the teams' practices and philosophies, spread across Confluence pages and distributed md files, these ideas get maintained as skills in your repository. Now not only can the AI use them, but human developers get a much faster way to access that knowledge. Instead of interrupting your teammates or rummaging through pages of documentation, you can simply ask the AI whether a guideline exists, how a pattern should be implemented, or why a certain architectural decision was made – and get an immediate answer.

Where AI fails and humans shine

Despite all of this, a lot of experienced developers still push back on AI-generated code. You'll hear a lot of the same criticism repeated: it doesn't produce good code, it lacks a real understanding of the system, it generates code that just feels off. Those claims aren't wrong per se; they are describing a real phenomenon, and one that is a real pain in the ass for open source developers nowadays, where unguided AI tools are introduced to existing systems. But again that word makes all the difference, unguided.

These bad results are not a reflection of AI coding tools in general; they're a reflection of what AI will create when it is left unguided. AI behaves a lot like a junior developer. If you expect high- quality, production-ready code without providing guidance, rules, or feedback – the problem lies with you. The difference is that, unlike a junior developer, AI can learn and apply what you teach it almost instantly.

Lack of oversight

The two most common problems I see in how programmers use AI are lack of guidance and lack of oversight. The latter, lack of oversight, tends to show up more with the less experienced developers. They feel more comfortable blindly trusting AI generated code, as they lack the intuition to doubt it in the first place. But AI makes confident-sounding mistakes all the time, it is the nature of an LLM- powered tool. While many features are implemented into AI coding tools to mitigate this, it still happens, and it happens often. AI coding assistants are a powerful but untrustworthy tool; if you can't verify whether the AI response is correct or not, you should not be using it for that task. It's a velocity booster, but if you lack the understanding to produce quality output by yourself, the AI is not going to change that.

But while lack of oversight alone can get a whole article dedicated to it, it is just not that interesting of a problem. It's the natural outcome of inexperience mixed with a desire to deliver and access to a powerful tool that can do just that. It is the "copy and paste the wrong answer from stackOverflow" of the 2020s. Lack of guidance is a much more interesting problem, and it's one that a lot developers, even experienced ones, make. The reason for that is that unlike lack of oversight, which can be correlated to lack of experience, lack of guidance is not a problem that existed before AI coding tools came to be.

Lack of guidance

Before an AI tool can be good, it must be taught what "good" even is. I already mentioned how the underlying LLMs are trained and fine-tuned to produce astoundingly good results, and it's true that AI coding tools can produce good code right off the bat, but programming is an opinionated field. There is no single best way to do anything, and different teams, even different developers from the same team, can produce equally solid and yet vastly different solutions to the same problem.

The senior developers of today have spent years refining their skills through a process of learning to use the tools that they use (shocking, I know). The problem with AI is that learning how to use it is not enough, you must teach it as well before it can produce good results. This unfamiliar process of having to teach the tool before it becomes useful can trigger an initial aversion to it. I had that exact reaction when I first caved and tried an AI coding assistant one year ago.

I had heard from some friends, a CTO and a principal engineer that I had previously worked with, that they are letting AI write full features for them, by simply copy and pasting the work item description into the chat. I was skeptical, but it had piqued my interest enough to make me try it for myself. I wanted to see how the AI would deal with a simple end-to-end task, a work item we had for implementing a notifications settings page.

So I installed Cursor, paid for their pro plan, and pasted in the work item description. The work item was worded very well, included a detailed architecture, and I carefully reviewed the output as it was being generated to make sure that the code was of good overall quality, and it was! Unfortunately, when I looked at the final product, there were many problems with it. That code would never have passed a code review. At least, not in Gridline.

The core of the issue was that the AI didn't know to ask the right questions. Is there a save button for the whole page, or is every change atomic? Are you using optimistic updates, or disabling the form while there's a pending mutation?

It was operating off of its base configuration which was good, but not good enough, and not aligned with the system. You could tell it to look for similar forms and do as they do, but then it might copy the design from the account security settings page, which is intentionally different for security reasons, or the "Contact us" form, which still suffers from tech debt from before the redesign, and should not be looked at for inspiration.

For AI-generated code to be good, it must be properly guided. Some of this guidance can come from skills, but almost every task has domain-specific requirements that would require manual guidance. In our example, optimistic updates might not be in use elsewhere in the system, and there may even be an AI skill in the codebase that guides it to disable forms during pending mutations. But the notifications settings page should feel immediate and responsive, and should use optimistic updates anyway.

Teach your agents

An agent behaves very much like an overconfident junior developer. It won't follow your team's conventions unless you teach them to it. And as I mentioned before, AI is much faster at learning than a junior developer is, it simply needs to be taught.

The quality of your AI generated code is directly corelated to how much it has been taught about the system. It can't load your entire codebase into its context window at once, so it uses skills to dynamically load relevant information and apply complex logic only when it needs to. The more skills you have, and the better you write them, the better the quality of your AI generated code will be.

I will not elaborate about how to write good agent skills, there are plenty of resources for that online already, but I will try to give some advice on how and when to create those skills.

Community skills

There is an abundance of open source and community-maintained skills out there that can help to unsloppify your AI generated code, but the skills marketplace is quite overwhelming. The claude- plugins-official marketplace, a curated directory of high-quality plugins for Claude, has 126 plugins alone. Looking at the top 5, we can see some awesome-sounding plugins:

  • frontend-design – Generates distinctive, production-grade frontend interfaces that avoid generic AI aesthetics.
  • superpowers – Superpowers teaches Claude brainstorming, subagent driven development with built in code review, systematic debugging, and red/green TDD. Additionally, it teaches Claude how to author and test new skills.
  • code-simplifier – Agent that simplifies and refines code for clarity, consistency, and maintainability while preserving functionality. Focuses on recently modified code.

This all sounds amazing! But if you install all of these right away, you will soon find that you are no longer in control of your own generated code. Take frontend-design for instance, the most popular plugin on the official marketplace, that guides Claude to Generate distinctive, production-grade frontend interfaces that avoid generic AI aesthetics. What the hell does that even mean? What guidance does this plugin provide to reach that supposed end goal? And does that truly align with your team's needs?

While it's possible to just install the plugin and see if the output is better, that would be a mistake. Unlike libraries in code, that either fit your needs or don't, an agent skill (or plugin, which for the purposes of this article will be treated as a collection of skills) is a lot more nuanced. It can be mostly good, but include some decisions that you don't want to incorporate. It can seem good, but in reality it will cause difficulties down the road stemming from inflexible rules.

The unfortunately boring answer is that every skill you add to your codebase has to be thoroughly reviewed and vetted by you, or by the team, or the tech lead, or whoever is chosen to hold the responsibility for this. This vetting process is not something that can be handed off to yet another AI tool, because the missing piece that needs to be filled in is the human element. For community- managed and marketplace skills, this can be quite a large task. It can take days to go over a marketplace plugin's definitions fully and refine them to your team's needs, but the end result is a productivity booster that can't be argued against, and absolutely justifies the effort that was put into it.

In-house skills

Aside from community plugins, you should also maintain a set of skills that are built and maintained in-house. These skills encompass the specific conventions and guidelines that are employed in your team. You can think of them like customization of linter rules after using a popular community configuration as a base. These skills serve two purposes. The first one obviously being guiding your AI tools into generating better code. The second purpose is to document the team's conventions in a way that is both human and AI-readable, and as a neat bonus – AI summarizable. Now onboarding can become a lot simpler, as new hires can ask the AI for the relevant guidelines whenever they need to, instead of rummaging through dozens of pages of guideline documentation.

One important thing to remember when maintaining these skills is to teach the AI whenever it makes a mistake. In the past, if you knew that your team uses arrow functions over traditional function expressions, you would just do that. Nowadays, if your AI generated code suddenly uses traditional function expressions, you are faced with 4 choices:

  1. Just accept it. It's a minor change and not worth the effort.
  2. Correct it yourself.
  3. Tell the AI to correct itself in the current session.
  4. Create a skill to teach the AI not to repeat this mistake again.

I hope I don't need to say it, but the only correct option here is #4. There is no reason to deviate from existing or desired standards just because the AI didn't originally adhere to them, correcting it yourself is a waste of time, and telling the AI to correct itself in the current session is a temporary solution that will lead to the same problem again in the future. This fits our "junior developer" analogy perfectly – If a new hire made a PR and had some slight (or big) deviations from the standard, the only correct response would be to teach them that it's wrong, why it's wrong, and guide them to not make that mistake again.

A good way to test if your skills are properly defined is to let the AI try to make a full end-to-end feature, and see where the imperfections are. It's a quick and easy method of seeing what rules are missing, what edge cases were potentially missed, and what can be improved even further. Aside from that, you must maintain a constant cycle of updating your agent skills and creating new ones whenever necessary.

Conclusion

If there is one takeaway I want you to have from this article, it's this: your AI is capable of learning, and you should teach it. Maintaining agent skills is the most important thing that I see teams NOT doing nowadays, and doing it will make your life easier, and take your velocity higher.

AI is changing the way software engineering works, there is no denying that. From AI coding assistants to automated reviews, streamlined DevOps and increased velocity, its effects are felt all across the tech world. But that doesn't come in place of engineering discipline, if anything it amplifies it, as mistakes are much easier to make now.

At Gridline, we make sure to stay on the forefront of AI-assisted development and agentic engineering, but we stand firm that it does not come in place of code quality, and the results cannot be argued with. Our testimonials speak for themselves – high quality code delivered at unparalleled velocity, while the developers maintain a deep understanding of systems they build.

Disclaimer – No generative AI tools were used in the creation of this article