How well can OpenAI's o1-preview code? It aced my 4 tests - and showed its work in surprising detail

How well can OpenAI’s o1-preview code? It aced my 4 tests – and showed its work in surprising detail

by Anoop Singh
16 September 2024
2

Usually, when a software company pushes out a major new release in May, they don’t try to top it with another major new release four months later. But there’s nothing usual about the pace of innovation in the AI business.

Although OpenAI dropped its new omni-powerful GPT-4o model in mid-May, the company has been busy. As far back as last November, Reuters published a rumor that OpenAI was working on a next-generation language model, then known as Q*. They doubled down on that report in May, stating that Q* was being worked on under the code name of Strawberry.

Also: Why natural language AI scripting in Microsoft Excel could be a game changer

Strawberry, as it turns out, is actually a model called o1-preview, which is available now as an option to ChatGPT Plus subscribers. You can choose the model from the selection dropdown:

menu — Screenshot by David Gewirtz/ZDNET

As you might imagine, if there’s a new ChatGPT model available, I’m going to put it through its paces. And that’s what I’m doing here.

Also: How ChatGPT scanned 170k lines of code in seconds and saved me hours of work

The new Strawberry model focuses on reasoning, breaking down prompts and problems into steps. OpenAI showcases this approach through a reasoning summary that can be displayed before each answer.

When o1-preview is asked a question, it does some thinking and then displays how long it took to do that thinking. If you toggle the dropdown, you’ll see some reasoning. Here’s an example from one of my coding tests:

It’s good that the AI knew enough to add error handling, but I find it interesting that o1-preview categorizes that step under “Regulatory compliance”.

I also discovered the o1-preview model provides more exposition after the code. In my first test, which created a WordPress plugin, the model provided explanations of the header, class structure, admin menu, admin page, logic, security measures, compatibility, installation instructions, operating instructions, and even test data. That’s a lot more information than was provided by previous models.

Also: The best AI for coding in 2024 (and what not to use)

But really, the proof is in the pudding. Let’s put this new model through our standard tests and see how well it works.

1. Writing a WordPress plugin

This straightforward coding test requires knowledge of the PHP programming language and the WordPress framework. The challenge asks the AI to write both interface code and functional logic, with the twist being that instead of removing duplicate entries, it has to separate the duplicate entries, so they’re not next to each other.

The o1-preview model excelled. It presented the UI first as just the entry field:

Once the data was entered, and Randomize Lines was clicked, the AI generated an output field with properly randomized output data. You can see how Abigail Williams is duplicated, and in compliance with the test instructions, both entries are not listed side-by-side:

In my tests of other LLMs, only four of the 10 models passed this test. The o1-preview model completed this test perfectly.

2. Rewriting a string function

Our second test fixes a string regular expression that was a bug reported by a user. The original code was designed to test if an entered number was valid for dollars and cents. Unfortunately, the code only allowed integers (so 5 was allowed, but not 5.25).

Also: The most popular programming languages in 2024

The o1-preview LLM rewrote the code successfully. The model joined four of my previous LLM tests in the winners’ circle.

3. Finding an annoying bug

This test was created from a real-world bug I had difficulty resolving. Identifying the root cause requires knowledge of the programming language (in this case PHP) and the nuances of the WordPress API.

The error messages provided were not technically accurate. The error messages referenced the beginning and the end of the calling sequence I was running, but the bug was related to the middle part of the code.

Also: 10 features Apple Intelligence needs to actually compete with OpenAI and Google

I wasn’t alone in struggling to solve the problem. Three of the other LLMs I tested couldn’t identify the root cause of the problem and recommended the more obvious (but wrong) solution of changing the beginning and ending of the calling sequence.

The o1-preview model provided the correct solution. In its explanation, the model also pointed to the WordPress API documentation for the functions I used incorrectly, providing an added resource to learn why it had made its recommendation. Very helpful.

4. Writing a script

This challenge requires the AI to integrate knowledge of three separate coding spheres, the AppleScript language, the Chrome DOM (how a web page is structured internally), and Keyboard Maestro (a specialty programming tool from a single programmer).

Also: 6 ways to write better ChatGPT prompts – and get the results you want faster

Answering this question requires an understanding of all three technologies, as well as how they have to work together.

Once again, o1-preview succeeded, joining only three of the other 10 LLMs that have solved this problem.

A very chatty chatbot

The new reasoning approach for o1-preview certainly doesn’t diminish ChatGPT’s ability to ace our programming tests. The output from my initial WordPress plugin test, in particular, seemed to function as a more sophisticated piece of software than previous versions.

Also: I’ve tested dozens of AI chatbots since ChatGPT’s debut. Here’s my new top pick

It’s great that ChatGPT provides reasoning steps at the beginning of its work and some explanatory data at the end. However, the explanations can be chatty. I asked o1-preview to write “Hello world” in C#, the canonical test line in programming. This is how GPT-4o responded:

csharp-gpt4o — Screenshot by David Gewirtz/ZDNET

And this is how o1-preview responded to the same test:

csharp — Screenshot by David Gewirtz/ZDNET

I mean, wow, right? That’s a lot of chat from ChatGPT. You can also flip the reasoning dropdown and get even more information:

csharp-thinking — Screenshot by David Gewirtz/ZDNET

All of this information is great, but it’s a lot of text to filter through. I prefer a concise explanation, with additional information options in dropdowns removed from the main answer.

Yet ChatGPT’s o1-preview model performed excellently. I look forward to how well it will work when integrated more fully with the GPT-4o features, such as file analysis and web access.

Have you tried coding with o1-preview? What were your experiences? Let us know in the comments below.

You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.

Source link

sankai/Getty Images Usually, when a software company pushes out a major new release in May, they don’t try to top it with another major new release four months later. But there’s nothing usual about the pace of innovation in the AI business. Although OpenAI dropped its new omni-powerful GPT-4o model in mid-May, the company has…