AI made one thing very cheap, very fast: answers.
That sounds like pure upside. In practice, it shifts the bottleneck. If almost anyone can generate an answer in seconds, the advantage moves to the person who frames the problem better than everyone else.
I keep seeing this in agentic projects. Teams obsess over model choice, but ship weak outcomes because they asked weak questions.
Answers got cheap. Fast.
In day-to-day builder reality, this is obvious now: generating answers is cheap, fast, and easy to scale.
That may suggest a simple reality for builders: “generate text” is no longer scarce. It’s becoming infrastructure.
So if answers are abundant, what stays scarce?
Useful questions.
Prompt engineering grew into context engineering
For a while, “prompt engineering” mostly meant wording tricks.
That still matters, but it is now a small slice of the real work. In production, what you are really doing is context engineering: designing the full information and constraint environment the model operates in.
I kept bumping into this in my own tests: the same intent, phrased a bit differently or run with slightly different context, can produce noticeably different outputs.
The practical implication is bigger than wording:
- Old bottleneck: “Can we get an answer at all?”
- New bottleneck: “Can we shape context so answers are reliable enough to act on?”
That is mostly a systems-thinking problem, not a model-access problem.
In agentic workflows, bad questions compound
With a chatbot, a vague question gives you a vague answer. Annoying, but manageable.
With agents, a vague question becomes a vague plan, then a vague execution, then a confident-looking output log. The error compounds over steps.
In my own experiments on messy codebase tasks, I tested the same kind of ticket in two setups:
- Setup A: short prompt, broad goal, minimal constraints.
- Setup B: same model, but with a context pack (repo map, constraints, output format) plus a small eval checklist.
I quantify this with one blunt metric: manual corrections per 10 runs.
As a practical benchmark, if Setup A sits around 7/10 and Setup B around 2/10, Setup B is the better production choice even if initial setup takes longer.
What failed in Setup A wasn’t model intelligence. It was context quality.
Replace prompt tips with context + eval loops
This is where many articles stay too generic. “Write better prompts” is directionally true, but weak operational advice.
What actually helps is pairing context engineering with evals.
What changed my results was simple: define success criteria first, then test context variants against the same eval set.
Here is the loop I find most useful:
1) Define the decision, not just the task
Specify the outcome you need and what “wrong” would cost.
2) Build a small eval set from real work
Use 20-30 representative examples from your own tasks, including edge cases.
3) Design a context pack
Provide allowed tools, trusted data sources, hard constraints, and required output schema.
4) Score runs on reliability, not style
Track task success, constraint compliance, grounding, and schema-valid outputs.
5) Compare context variants before model upgrades
Test context changes first. Upgrade models only when eval ceilings flatten.
Why good context engineering reduces model dependence
This is the part many teams miss: model capability absolutely matters, but it tends to matter less once context quality is high.
When context is clear, grounded, and constrained:
- weaker models often become “good enough” for many workflow steps,
- output variance drops,
- and upgrades between top-tier models produce smaller gains than expected.
When context is vague, even strong models drift.
So yes, model selection still matters. But for most business workflows, I would prioritize context quality first, then choose the cheapest model that meets the reliability bar.
A practical shift teams can make this week
If you’re leading AI automation work, try this one operational change:
- Pick one recurring workflow and gather a small eval set from recent tasks.
- Create two context variants: current vs tightened (constraints, sources, output schema).
- Run both on the same eval set with the same model.
- Review where failures happen: wrong decision, policy miss, or output-format break.
- Keep the cheaper setup that clears your reliability bar.
Most teams find quality gains before touching the model stack.
That does not mean models do not matter. They do. But the market is making model access cheaper every quarter. Your context architecture is much harder for competitors to copy.
Final thought
We may be entering a phase where “having AI” is normal and “engineering context better than others” is the real moat.
If that holds, the highest-leverage AI skill is not writing longer prompts. It is defining the right objective, boundaries, evidence, and evals so the model can actually do useful work.
And honestly, that is still very human work.