EdgeBit is a security platform that helps application engineering teams to find and fix security vulnerabilities. The Dependency Autofix feature contains an extremely accurate reachability engine to identify impact to your app. Most updates have no impact, so engineers using EdgeBit can dedicate efforts on impactful upgrades. This translates to more time spent “on-mission” instead of managing dependencies.
This post will cover how to identify areas that can 1) use focused tools, 2) smartly handle errors and 3) harness the persistence of an AI agent to unlock massive efficiency gains, like we have done for Dependency Autofix, with some data to back it up.
What is a One-Shot Agent
What is a “one-shot” AI Agent? A one-shot AI agent enables automated execution of a complex task without a human in the loop.
EdgeBit’s Dependency Autofix is built around one-shot code maintenance workflows - no human input is required. Unlike the typical AI within an IDE experience, where a developer manually accepts or rejects changes, our AI agents handle updates autonomously with a high degree of confidence.
Our confidence comes from three sources:
- static analysis that deeply understands how your app uses its dependencies
- calculating and executing dependency updates
- agentic workflow that is consistent and correct (what this post is about)
Confidence is extremely important because one-shot agents must do something correctly or bail out before they do damage or ask a human for review. Since we want to keep engineers focused “on-mission”, we don’t want to call for review often.
Agent vs Pipeline
Prior to introducing an agentic workflow for updating dependencies, our workflow was pipeline-based. This meant it was fairly deterministic since it’s given a concrete list of inputs and proceeds linearly.
Experiments with an unrestricted and fully agentic workflow for automated dependency updates yielded a gain in “fuzziness” around inputs (what to update) and outputs (adapt code to API changes), but this was offset by a lot of chaos in the middle.
Our goal was to keep as much of the determinism from the pipeline as possible, while benefiting from the desired fuzziness. We outline the three key principles that helped us achieve this, supported by data to validate our approach.
Our agent is run with a framework that produces a single binary containing all of the logic, system prompting and tools embedded within it, with certain tools executing Docker containers for isolation and repeatability. This is ideal for running within customer infrastructure or within the ephemeral VMs in our SaaS.
Here’s an example task prompt:
Update javascript dependency react-redux to 9.2.0. Update first, then if there is potential impact to the app, adapt code to changes in the most minimal way, in a separate commit.
Read on to learn how we achieved parity with our pipeline-based approach with a one-shot agent. But first, some evidence that a focused agent works.
Agent Consistency & Correctness
We ran 10 iterations of this prompt, upgrading a package in 3 different size codebases to push the boundaries of consistency and correctness.
The Basic App
is very simple but the upgrade brought in a complete rewrite of the library, although it remained API-compatible. This tests if the agent will get distracted by lots of churn.
The Web App
required callsite mutation due to a CommonJS to ESM migration between the major versions.
The Complex Webapp
was a very large codebase that required a library update but no call site mutation although 16 call sites called into the updated library. This allowed us to test the tendency for a model to stay in its lane, making only required changes.
One-Shot vs Claude Code
The web app with Redis test case was also benchmarked against Claude Code by simulating a one-shot workflow which was manually agreeing with every change it wanted to make.
As you can see, Claude Code was very inconsistent and had issues with almost every run. Incomplete runs typically didn’t update the lockfile along with the package.json
or including Yarn metadata in our NPM project’s package.json
.
In a few code mutation runs, Claude modified details about how our Redis sessions were namespaced, which would break expectations in the app.
When results matter, the correct context costs more. Running the best tools for the job cost more. When working on a one-shot workflow, we always prefer a correct result over almost anything.
Principle 1: Focused Tools Work Better Than Generic Ones
Focused tools naturally have boundaries to their problem space. This allows them to exit early before wasting tons of tokens or, if conditions won’t allow them to be successful, return a meaningful error. This is really important feedback to the agent, vs having it interpret generic failures and spiral off into empty space.
Without this, a frequent failure mode is for two very generic tools like exec
and fs_put_file
to cause a loop of chaos, when simplified, looks like this:
Agent Info
Tool Info
Tool Error
Agent Info
Tool Info
Tool Info
Tool Error
Agent Info
Tool Info
Tool Info
Tool Error
Agent Info
The solution is to provide a tool with a helpful description that wraps your common tasks, like building, testing, updating, downloading, etc.
Tools Reinforcing Workflows
Focused tools also enable the use of well-proven libraries in places where models frequently fail, like semantic version comparisons. The LLM will reliably declare that 9.1.1 is already greater than 9.2.2, so the update is already installed
.
Providing just two tools can nudge the agent to anchor its logic in ground truth:
- executing an update:
js-update
- understanding our dependencies:
inventory
Let’s explore a brief example of how these tools work together.
The inventory tool hints in the description that it’s useful for pre and post-update which guides its use. It understands that lock files and similar metadata should be used as the source of truth. A nice side effect is that file reads now take the expected nanoseconds and you’re not wasting tokens interpreting files via the model.
func (t *InventoryTool) Description() string {
return "Provides a list of all dependencies in the codebase, optionally filtered
by package name or ecosystem. Useful to research versions before an upgrade or
verifying an upgrade has been applied."
}
The input/output descriptions also focus the agent on how to use the tool:
type InventoryInput struct {
PackageNameFilter []string `json:"packageNameFilter" jsonschema_description:"Optional
list of complete package names to look up in our list of
dependencies. e.g. 'go.opentelemetry.io/otel/trace'"`
PackageEcosystemFilter []string `json:"packageEcosystemFilter" jsonschema_description:
"Optional list of ecosystems to filter packages by in our
list of dependencies, eg 'gomod' or 'npm'"`
Directory string `json:"directory" jsonschema_description:"Optional
directory to run Inventory in"`
}
This can be input into the js-update
tool and sanity checked with a soft or hard failure as discussed below.
func (t *UpdateTool) Execute(ctx context.Context, input Input) (Output, error) {
requestedPkgVersion, err := semver.NewVersion(input.PackageVersion)
if err != nil {
// hard failure
return Output{}, fmt.Errorf(“Provide a valid semantic version or omit to find
the ideal version automatically”)
}
// ...
}
Principle 2: Fail Hard or Soft, Instead of Being Wrong
The opportunity to establish hard and soft failures is what allows EdgeBit to be consistently successful with one-shot dependency updates and code maintenance to adapt to new API changes. The last thing we want is to be wrong about fixing a security vulnerability or causing impact to your app – we’d rather fail and exit.
Hard Failures: Exiting the Bounding Box
Our mental model for failures is to establish a bounding box that represents our problem space and fail when we’d exit this area. Note that our problem space is not just “dependency updates” but “dependency updates for this specific app”.
If we can’t calculate an update graph that fulfills all of the version ranges for your dependencies, we must fail and communicate that to the user or make more drastic changes to the constraints.
A bad scenario is letting the agent brute force its way into the update by guessing that version 1.2.3 is ok because lots of other apps in its training data have updated to it.
Agent Info
Tool Info
// whoa, this is a bad practice. no thank you.
Going from bad to worse, mutating a lockfile directly is 1) a bad practice and 2) can impact the app when 1.2.3 can’t actually be installed at build or deploy time. Now we’ve broken the app, ensured a human has to scramble to fix our mess and eradicated all the trust we’d built up.
Instead, we can simply log a fatal error as we exit the bounding box: valid updates for this app. This sets us up to try again or flag this update for review.
Soft Failures: A Gentle Nudge
Specialized tools can send back soft failures for the agent to try again or find a different path. A good example of this is version numbers. Our tools typically take in a desired package name or name + version.
Many times the LLM wants to pass the version latest
when we expect an actual version. Passing back a soft failure hints that a real version is required, and if it doesn’t have one in its context, to use another tool to find it:
Agent Info
Tool Info
Tool Error
Agent Info
Tool Info
This gentle nudge means the agent will try to use our tool again and stay on the happy path.
We’ll never reach our goal
A different type of hard failure is when we detect that we’ll never reach our overall goal. This requires a goal that can be programmatically verified outside of the LLM.
For a dependency update, the update tool checks our metadata for len(update.changes) > 0
, which verifies that we updated something. If that is 0, no amount of running research tools, analyzing symbols, etc will allow us to be successful, so we exit. This is primarily a time and cost savings, so we can try again.
if len(update.Changes) == 0 {
log.Fatal("No changes found in proposed update")
}
Principle 3: Persistence: When Being Too Determined Backfires
LLMs are extremely persistent, especially when you instruct them to reason and research. We’ve seen them get caught in loops, churning through tokens, trying in vain to accomplish a goal. Here are two examples of how our focused tools and hard/soft failures mostly stop them from happening.
Installing Node
EdgeBit works best when we control how npm
is invoked and if node
is run, so we don’t provide them in the container where the exec
tool runs. It’s extremely persistent in attempting to install it:
Tool Info
Tool Error
Agent Info
Tool Info
Tool Error
Agent Info
Tool Info
Tool Error
Agent Info
Tool Info
Tool Info
Tool Info
Tool Info
Tool Info
Tool Error
Agent Info
// blind run of script from the internet
// disk polluted
// good chance that a later `git commit` will pick up nodesource_setup.sh
Researching GitHub Projects
Another common loop to get into was our agent researching various locations on GitHub for info about breaking changes. In most cases it was paging through tons of results and churning through $1+ of tokens. If it guessed or hallucinated the project name incorrectly, it would then fall back to scouring NPM and other sources.
We exposed a project research tool that simply wrapped our SaaS platform’s source intelligence service. This was a huge win: we already had this data, focusing the agent was easy with a good tool description and no tokens were burned in the process. If we can’t find intel on the requested package, this is a soft failure point that can guide the agent vs its misdiagnosing of a 404.
Here’s how minimal a tool can be while providing the agent a ton of direction:
func (t *ResearchTool) Description() string {
return "Use as the only source to research a package's release notes to understand
upgrade instructions and breaking changes. Compare contents with the impacted
call sites to verify missing breaking changes or find unmentioned breaking changes."
}
func (t *ResearchTool) Execute(ctx context.Context, input Input) (Output, error) {
//…
err := utils.PostJSON(ctx, apiURL, headers, payload, &intelResp)
//…
releaseURL := parseResponse(&intelResp)
err = utils.FetchJSON(ctx, releaseURL, nil, ¬esResp)
//…
return Output{
ReleaseNotes: notesResp["body"].(string),
}, nil
}
Building One-Shot Agents
With these principles, automated code maintenance becomes possible, safe and highly effective for engineering teams managing app dependencies and fixing security vulnerabilities. Developers can eliminate manual effort, minimize errors, and patch updates faster.
Other code maintenance tasks are ripe for one-shot automation. Identifying areas that can use focused tools, smartly handle errors and harness the persistence of the agent will unlock massive efficiency gains for software engineering teams — especially by removing tasks they hate in order to focus on their core mission and things that are fun.