Testing the limits of generative AI

As part of the learning curve with AI and LLMs, experiment all you want, but take the results with some skepticism, especially if you’re using it to write your code.

Contributor, InfoWorld |

Testing the limits of GenAI — Getty Images

To get the most from generative AI (GenAI) tools like GitHub Copilot or ChatGPT, Datasette founder Simon Willison argues you have to be prepared to accept four conflicting opinions. Namely, that AI is neither our salvation nor our doom, neither empty hype nor the solution to everything. It’s none of these and all of them at the same time. As Willison puts it, “The wild thing is that all of these [options] are right!”

Maybe. It all depends on how you use genAI tools and what you expect. If you’re expecting a code generation assistant like AWS CodeWhisperer to produce perfect code that you can accept and use wholesale without change, you’re going to be disappointed. But if you’re using these tools to complement developer skills, well, you just might be in for a very positive surprise.

“Paranoid Android”

The problem is that too many businesses have bought into the hype and expect GenAI to be a magical cure for their problems. As Gartner analyst Stan Aronow highlights, a recent Gartner survey found that “nearly 70% of business leaders believe the benefits [of GenAI] outweigh the risks, despite limited understanding of precise generative AI applicability and risks.” If your business strategy boils down to, “It sounded cool on Twitter,” you deserve the hurt coming your way.

Speaking of large language models (LLMs), Willison says, “It feels like three years ago, aliens showed up on Earth, handed us a USB stick with this thing on, and then departed. And we’ve been poking at it ever since and trying to figure out what it can do.” We know it’s important and we can sense some of the boundaries of what AI, generally, and LLMs, specifically, can do, but we’re still very much in trial-and-error mode.

The problem (and opportunity) of LLMs, Willison continues, is that “you very rarely get what you actually asked for.” Hence the advent of prompt engineering as we fiddle with ways to get the LLMs to yield more of what we do want and less of what we don’t. “Occasionally, someone will find that if you use this one little trick, suddenly this whole new avenue of abilities opens up,” he notes.

We’re all currently searching for that “one little trick,” which brings me to programming.

Everything in the right place

Some suggest that coding assistants will be a huge asset for unskilled developers. That could eventually be true, but it’s not true today. Why? There’s no way to adequately trust the output of an LLM without having sufficient experience to gauge its results. Willison says, “Getting the best results out of them actually takes a whole bunch of knowledge and experience. A lot of it comes down to intuition.”

There are coding hacks that developers will figure out through experimentation, but other areas simply aren’t a good fit for GenAI at the moment. O’Reilly Media’s Mike Loukides writes, “We can’t get so tied up in automatic code generation that we forget about controlling complexity.” Humans, while imperfect at limiting complexity in their code, are better positioned to do it than machines. For example, a developer can’t really prompt an LLM to reduce the complexity of their code because it’s not clear what that would mean. Reducing lines of code? “Minimizing lines of code sometimes leads to simplicity, but just as often leads to complex incantations that pack multiple ideas onto the same line, often relying on undocumented side effects,” Loukides says. Computers don’t care about complexity of code, but the humans who will need to debug it and understand it years later do.

This is all OK. We’re incredibly early in the evolution of AI, despite the fact that it has been around for decades. We in tech like to get ahead of ourselves. We act like cloud is an established norm when it’s still just 10% or so of all IT spending. Despite the flood of investment in AI, it’s not even .01%.

To Willison’s point, now is the time to test the different LLMs and their associated coding tools. Our goal shouldn’t be to see which one will do all of our work for us but rather to discover their strengths and weaknesses and to probe and prompt them until we know how to use their strengths and their failures to our advantage.

Next read this:

Matt Asay runs developer relations at MongoDB. The views expressed herein are Matt’s and do not reflect those of his employer.