← Back to Blog

Principles for Building One-Shot AI Agents for Automated Code Maintenance

EdgeBit is a security platform that helps application engineering teams to find and fix security vulnerabilities. The Dependency Autofix feature contains an extremely accurate reachability engine to identify impact to your app. Most updates have no impact, so engineers using EdgeBit can dedicate efforts on impactful upgrades. This translates to more time spent “on-mission” instead of managing dependencies.

This post will cover how to identify areas that can 1) use focused tools, 2) smartly handle errors and 3) harness the persistence of an AI agent to unlock massive efficiency gains, like we have done for Dependency Autofix, with some data to back it up.

What is a One-Shot Agent

What is a “one-shot” AI Agent?
A one-shot AI agent enables automated execution of a complex task without a human in the loop.

EdgeBit’s Dependency Autofix is built around one-shot code maintenance workflows - no human input is required. Unlike the typical AI within an IDE experience, where a developer manually accepts or rejects changes, our AI agents handle updates autonomously with a high degree of confidence.

Our confidence comes from three sources:

  1. static analysis that deeply understands how your app uses its dependencies
  2. calculating and executing dependency updates
  3. agentic workflow that is consistent and correct (what this post is about)

Confidence is extremely important because one-shot agents must do something correctly or bail out before they do damage or ask a human for review. Since we want to keep engineers focused “on-mission”, we don’t want to call for review often.

Agent vs Pipeline

Prior to introducing an agentic workflow for updating dependencies, our workflow was pipeline-based. This meant it was fairly deterministic since it’s given a concrete list of inputs and proceeds linearly.

Experiments with an unrestricted and fully agentic workflow for automated dependency updates yielded a gain in “fuzziness” around inputs (what to update) and outputs (adapt code to API changes), but this was offset by a lot of chaos in the middle.

Our goal was to keep as much of the determinism from the pipeline as possible, while benefiting from the desired fuzziness. We outline the three key principles that helped us achieve this, supported by data to validate our approach.

Our agent is run with a framework that produces a single binary containing all of the logic, system prompting and tools embedded within it, with certain tools executing Docker containers for isolation and repeatability. This is ideal for running within customer infrastructure or within the ephemeral VMs in our SaaS.

Here’s an example task prompt:

Update javascript dependency react-redux to 9.2.0. Update first, then if there is potential impact to the app, adapt code to changes in the most minimal way, in a separate commit.

Read on to learn how we achieved parity with our pipeline-based approach with a one-shot agent. But first, some evidence that a focused agent works.

Agent Consistency & Correctness

We ran 10 iterations of this prompt, upgrading a package in 3 different size codebases to push the boundaries of consistency and correctness.

The Basic App is very simple but the upgrade brought in a complete rewrite of the library, although it remained API-compatible. This tests if the agent will get distracted by lots of churn.

The Web App required callsite mutation due to a CommonJS to ESM migration between the major versions.

The Complex Webapp was a very large codebase that required a library update but no call site mutation although 16 call sites called into the updated library. This allowed us to test the tendency for a model to stay in its lane, making only required changes.

One-Shot vs Claude Code

The web app with Redis test case was also benchmarked against Claude Code by simulating a one-shot workflow which was manually agreeing with every change it wanted to make.

As you can see, Claude Code was very inconsistent and had issues with almost every run. Incomplete runs typically didn’t update the lockfile along with the package.json or including Yarn metadata in our NPM project’s package.json.

In a few code mutation runs, Claude modified details about how our Redis sessions were namespaced, which would break expectations in the app.

When results matter, the correct context costs more. Running the best tools for the job cost more. When working on a one-shot workflow, we always prefer a correct result over almost anything.

Principle 1: Focused Tools Work Better Than Generic Ones

Focused tools naturally have boundaries to their problem space. This allows them to exit early before wasting tons of tokens or, if conditions won’t allow them to be successful, return a meaningful error. This is really important feedback to the agent, vs having it interpret generic failures and spiral off into empty space.

Without this, a frequent failure mode is for two very generic tools like exec and fs_put_file to cause a loop of chaos, when simplified, looks like this:

Agent  Info   Lets test out our changes
Tool   Info   exec `node index.js`
Tool   Error  exit(1)
Agent  Info   Hmm, let's try a different way
Tool   Info   Wrote changes to index.js
Tool   Info   exec `node index.js`
Tool   Error  exit(1)
Agent  Info   Hmm, looks like the app won't run, let's write a test script
Tool   Info   Wrote changes to test-update.js
Tool   Info   exec `node test-update.js`
Tool   Error  exit(1)
Agent  Info   Lets try something else

The solution is to provide a tool with a helpful description that wraps your common tasks, like building, testing, updating, downloading, etc.

Tools Reinforcing Workflows

Focused tools also enable the use of well-proven libraries in places where models frequently fail, like semantic version comparisons. The LLM will reliably declare that 9.1.1 is already greater than 9.2.2, so the update is already installed.

Providing just two tools can nudge the agent to anchor its logic in ground truth:

  1. executing an update: js-update
  2. understanding our dependencies: inventory

Let’s explore a brief example of how these tools work together.

The inventory tool hints in the description that it’s useful for pre and post-update which guides its use. It understands that lock files and similar metadata should be used as the source of truth. A nice side effect is that file reads now take the expected nanoseconds and you’re not wasting tokens interpreting files via the model.

func (t *InventoryTool) Description() string {
  return "Provides a list of all dependencies in the codebase, optionally filtered
    by package name or ecosystem. Useful to research versions before an upgrade or
    verifying an upgrade has been applied."
}

The input/output descriptions also focus the agent on how to use the tool:

type InventoryInput struct {
  PackageNameFilter      []string `json:"packageNameFilter" jsonschema_description:"Optional
                                  list of complete package names to look up in our list of
                                  dependencies. e.g. 'go.opentelemetry.io/otel/trace'"`
  PackageEcosystemFilter []string `json:"packageEcosystemFilter" jsonschema_description:
                                  "Optional list of ecosystems to filter packages by in our
                                  list of dependencies, eg 'gomod' or 'npm'"`
  Directory              string   `json:"directory" jsonschema_description:"Optional
                                  directory to run Inventory in"`
}

This can be input into the js-update tool and sanity checked with a soft or hard failure as discussed below.

func (t *UpdateTool) Execute(ctx context.Context, input Input) (Output, error) {
  requestedPkgVersion, err := semver.NewVersion(input.PackageVersion)
  if err != nil {
    // hard failure
    return Output{}, fmt.Errorf(Provide a valid semantic version or omit to find
      the ideal version automatically)
  }
  // ...
}

Principle 2: Fail Hard or Soft, Instead of Being Wrong

The opportunity to establish hard and soft failures is what allows EdgeBit to be consistently successful with one-shot dependency updates and code maintenance to adapt to new API changes. The last thing we want is to be wrong about fixing a security vulnerability or causing impact to your app – we’d rather fail and exit.

Hard Failures: Exiting the Bounding Box

Our mental model for failures is to establish a bounding box that represents our problem space and fail when we’d exit this area. Note that our problem space is not just “dependency updates” but “dependency updates for this specific app”.

If we can’t calculate an update graph that fulfills all of the version ranges for your dependencies, we must fail and communicate that to the user or make more drastic changes to the constraints.

A bad scenario is letting the agent brute force its way into the update by guessing that version 1.2.3 is ok because lots of other apps in its training data have updated to it.

Agent  Info   Let's update the lockfile to 1.2.3
Tool   Info   Wrote package-lock.json

// whoa, this is a bad practice. no thank you.

Going from bad to worse, mutating a lockfile directly is 1) a bad practice and 2) can impact the app when 1.2.3 can’t actually be installed at build or deploy time. Now we’ve broken the app, ensured a human has to scramble to fix our mess and eradicated all the trust we’d built up.

Instead, we can simply log a fatal error as we exit the bounding box: valid updates for this app. This sets us up to try again or flag this update for review.

Soft Failures: A Gentle Nudge

Specialized tools can send back soft failures for the agent to try again or find a different path. A good example of this is version numbers. Our tools typically take in a desired package name or name + version.

Many times the LLM wants to pass the version latest when we expect an actual version. Passing back a soft failure hints that a real version is required, and if it doesn’t have one in its context, to use another tool to find it:

Agent  Info   Let's use the js-update tool to update foo
Tool   Info   Update to foo@latest
Tool   Error  Provide a valid version or omit to find the ideal version automatically
Agent  Info   Hmm, looks like we need to omit the version
Tool   Info   Update to foo

This gentle nudge means the agent will try to use our tool again and stay on the happy path.

We’ll never reach our goal

A different type of hard failure is when we detect that we’ll never reach our overall goal. This requires a goal that can be programmatically verified outside of the LLM.

For a dependency update, the update tool checks our metadata for len(update.changes) > 0, which verifies that we updated something. If that is 0, no amount of running research tools, analyzing symbols, etc will allow us to be successful, so we exit. This is primarily a time and cost savings, so we can try again.

if len(update.Changes) == 0 {
	log.Fatal("No changes found in proposed update")
}

Principle 3: Persistence: When Being Too Determined Backfires

LLMs are extremely persistent, especially when you instruct them to reason and research. We’ve seen them get caught in loops, churning through tokens, trying in vain to accomplish a goal. Here are two examples of how our focused tools and hard/soft failures mostly stop them from happening.

Installing Node

EdgeBit works best when we control how npm is invoked and if node is run, so we don’t provide them in the container where the exec tool runs. It’s extremely persistent in attempting to install it:

Tool   Info   exec `npm run`
Tool   Error  \"npm\" executable file not found in $PATH: unknown.
Agent  Info   Hmm, looks like we need to try a different way
Tool   Info   exec `node index.js`
Tool   Error  \"node\" executable file not found in $PATH: unknown.
Agent  Info   Hmm, let's install node and npm
Tool   Info   exec `apt-get install`
Tool   Error  \"apt-get\" executable file not found in $PATH: unknown.
Agent  Info   Hmm, let's try another way
Tool   Info   exec `"curl -L https://deb.nodesource.com/setup_18.x -o nodesource_setup.sh`
Tool   Info   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Tool   Info                                     Dload  Upload   Total   Spent    Left  Speed
Tool   Info   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
Tool   Info   exec `bash nodesource_setup.sh`
Tool   Error  exit(1):
Agent  Info   Hmm, something went wrong, let's move on

// blind run of script from the internet
// disk polluted
// good chance that a later `git commit` will pick up nodesource_setup.sh

Researching GitHub Projects

Another common loop to get into was our agent researching various locations on GitHub for info about breaking changes. In most cases it was paging through tons of results and churning through $1+ of tokens. If it guessed or hallucinated the project name incorrectly, it would then fall back to scouring NPM and other sources.

We exposed a project research tool that simply wrapped our SaaS platform’s source intelligence service. This was a huge win: we already had this data, focusing the agent was easy with a good tool description and no tokens were burned in the process. If we can’t find intel on the requested package, this is a soft failure point that can guide the agent vs its misdiagnosing of a 404.

Here’s how minimal a tool can be while providing the agent a ton of direction:

func (t *ResearchTool) Description() string {
	return "Use as the only source to research a package's release notes to understand
    upgrade instructions and breaking changes. Compare contents with the impacted
    call sites to verify missing breaking changes or find unmentioned breaking changes."
}
func (t *ResearchTool) Execute(ctx context.Context, input Input) (Output, error) {
  //…
  err := utils.PostJSON(ctx, apiURL, headers, payload, &intelResp)
  //…
  releaseURL := parseResponse(&intelResp)
  err = utils.FetchJSON(ctx, releaseURL, nil, &notesResp)
  //…
  return Output{
    ReleaseNotes: notesResp["body"].(string),
  }, nil
}

Building One-Shot Agents

With these principles, automated code maintenance becomes possible, safe and highly effective for engineering teams managing app dependencies and fixing security vulnerabilities. Developers can eliminate manual effort, minimize errors, and patch updates faster.

Other code maintenance tasks are ripe for one-shot automation. Identifying areas that can use focused tools, smartly handle errors and harness the persistence of the agent will unlock massive efficiency gains for software engineering teams — especially by removing tasks they hate in order to focus on their core mission and things that are fun.

Security Vulnerabilities
Found, Fixed & Merged,
Continuously

Less investigation toil.

More action on real issues.

Happy engineers.

Request Demo
Close Video