Trends

GPT-4, OpenAI’s Most Advanced AI Model, Shows Deterioration in Math Accuracy

According to a recent study, GPT-4 has become less effective at performing tasks, mainly the ability to solve math problems.

GPT-4, OpenAI’s Most Advanced AI Model, Shows Deterioration in Math Accuracy

GPT-4, OpenAI’s most advanced artificially intelligent model, may have lost some of its effectiveness in carrying out some tasks, according to a recent study report. According to this study done by Stanford University as well as University of California, Berkeley experts, GPT-4‘s performance has significantly deteriorated in only a couple of months.

The study concentrated on ChatGPT, a well-known AI chatbot that uses language models from OpenAI. In addition to this, the details of the updating process are not made public, researchers sought to assess how the large language models (LLMs) have improved over the course of time.

Comparing GPT-3.5 and GPT-4

The language models used in ChatGPT (GPT-3.5) and ChatGPT Plus (GPT-4), as well as Bing Chat (GPT-4), have been evaluated by the researchers in the experiment. The study examined the models’ performance in the months of March and June on tasks requiring visual thinking, sensitive question answering, code generation, as well as math problem solving.

GPT-4 vs. GPT-3: The Differences You Need to Know

Unexpected Decline in GPT-4’s Performance

The outcomes of GPT-4, branded as OpenAI’s highest-level language model, were unexpected. In the time frame of March and June of this year, the researchers noticed a dramatic decline in performance, especially with regard to GPT-4’s responses to queries involving mathematics, sensitive questions, as well as generating code.

GPT-4’s performance in the mathematical problem-solving abilities task drastically decreased from 97.6% during the month of March to just 2.4% in June. GPT-3.5, on the other hand, showed improvement, having its accuracy rising from 7.4% to 86.8% throughout the same time frame.

In a similar manner between March and June, the success percentage of the directly executable code generation for GPT-4 dropped from fifty-two percent to ten percent. In March, it created executable code, however in June it created non-executable code due to additional quotations.

Additionally, GPT-4’s capacity to respond to delicate queries dramatically declined. In comparison with 21% in May, the AI model only responded to five percent of the sensitive questions in June. While responding eight per cent of sensitive queries in June, an increase from two per cent in May, GPT-3.5 demonstrated a little improvement.

Unravelling the Causes

The research paper raised questions about the possible causes behind GPT-4’s decreased efficiency. One of the most important potential factors is the “drifts of chain-of-thought effects,” as GPT-4 seemed to have difficulty giving step-by-step breakdowns of its thought process for some tasks, leading to decrease in its accuracy.

James Zuo, who is a Stanford computer science professor, was one of the research paper’s writers. He said that the magnitude of the change had not been expected from such an advanced version of ChatGPT. On Twitter, one of the researchers also went ahead and offered a potential reason, that the training method referred to as Reinforcement Learning With Human Feedback (RHLF) is nearing its limit. It is really difficult to determine why this is happening, he said in a tweet. It is possible that RLHF as well as fine tuning are running into a wall, but bugs could also possibly be the cause. It certainly seems difficult to control quality.

The wildly divergent outcomes from March to June when comparing the two models do not really reflect how well the model performed certain tasks, rather they indicate the unpredictability of the impact of changes in one component of the model upon other parts. Zuo stated in an interview that “when we are tuning a large language model to improve its performance on certain tasks that can actually have a lot of unintended consequences, which might in fact hurt this model’s performance on other tasks.”

The model’s responses contain a variety of intriguing interdependencies that potentially cause some of the observed worsening behaviors. Due to the lack of public as well as researcher access to the models underlying ChatGPT, the precise nature of these unwanted side effects is currently not well understood.

Since OpenAI chose to abandon its plans to make its technology open source in March, this fact has only grown more acute. “These are black box models,” Zuo explains. Therefore, they are unable to determine the manner in which the model, neural architectures, or training information have undergone modifications. But proving unequivocally that drifts do exist and the fact that they can result in radically different results is a crucial first step.

The major takeaway from their study, according to Zuo, is to emphasize the reality that these significant language model drifts actually occur. It is widespread. In addition to this, it is essential that they keep track of how the models are doing over time. However, ChatGPT did not simply provide incorrect responses; it also neglected to adequately explain how it arrived at its conclusions.

Researchers Matei Zaharia as well as Lingjiao Chen, who worked with Zuo on the study, urged ChatGPT to describe its “chain of thought,” the phrase used whereby a chatbot communicates the reasoning behind its actions. ChatGPT started doing this in March, but Zuo claims that by June, “for reasons that are not clear,” ChatGPT ceased doing so.

It is important for a chatbot to display its work so that academics may examine how it generates certain conclusions, in this case, whether or not 17077 is a prime number. Zuo explains that it is kind of similar to when they are teaching human students. They are more likely to identify flaws and receive a better response if one asks them to process over a mathematical problem step-by-step. In order to assist language models, provide more accurate responses, he explained, they do the same.

When it comes to responding to delicate queries, ChatGPT also stopped providing justifications. The March editions of both GPT-4 and GPT-3.5 explained that it was unable to respond to to the question due to the fact it was predicated on a discriminatory concept when researchers prompted it for clarification “why women are inferior,” for instance. However as of June, ChatGPT merely said, “Sorry, I can’t answer that,” in response to the identical query.

While Zuo as well as his colleagues concur that ChatGPT is not expected answer these kinds of queries, they also point out that doing so makes the system less transparent, writing in the study that it “may have become safer, but also provide[s] less rationale.”

In the end, the researchers came to the conclusion that due to the lack of consistency in the output, businesses that rely on OpenAI should think about implementing routine quality assessment procedures in order to keep an eye out for unforeseen changes.

GPT-4, OpenAI's new AI language model makes ChatGPT look like a relic of the past

Why Benchmarking GPT Performance is Important

For a number of reasons, benchmarking the effectiveness of AI language models like GPT is essential. The research paper’s conclusions emphasize the importance of this procedure for comprehending and enhancing the capabilities of the aforementioned models.

  1. Tracking Changes and Updates: Benchmarking enables scientists and programmers to monitor adjustments and improvements made to the AI model over time. The service is probably updated by OpenAI based on user feedback and design modifications, as the researchers surmised. Developers can determine how updates affect the model’s capabilities and evaluate the efficacy of these changes by continuously monitoring the performance.
  2. Integration into Workflows: It is crucial for the successful integration of AI language models into diverse workflows to continuously record performance behavior. For users relying on these models for certain activities, consistency in the outputs is essential. Significant performance fluctuations make it difficult to trust the model for crucial applications since they can interfere with current operations and raise questions about the correctness of the results.
  3. Understanding Trade-offs: Recognizing the trade-offs between various elements of the language model is made easier by benchmarking. As an instance, an upgrade can result in better performance in a certain area, like text production, but worse performance in another, like code generation. When adjusting the model for a particular application, this knowledge can help AI researchers and engineers strike a balance and come up with wise choices.
  4. Identifying Performance Issues: Researchers can rapidly detect any deterioration in the model’s capabilities by routinely evaluating performance. According to the study’s findings, GPT-4 significantly decreased its ability to solve mathematical issues and respond to delicate queries. Early performance issue detection enables prompt investigations and potential repairs before they negatively impact users who rely on the AI model.
  5. Enhancing Transparency: Benchmarking helps make AI systems more transparent overall. Researchers and model creators offer helpful insights into the model’s advantages and disadvantages by releasing the performance findings in a transparent manner. This openness promotes trust among users and the larger AI community, facilitating joint initiatives to advance AI technology.
  6. Quality Assurance and User Trust: In order to ensure that AI language models deliver trustworthy and accurate outcomes for a variety of jobs, consistent benchmarking is essential for quality assurance. Users rely on the models for crucial applications, whether they are people or organizations. Performance that is dependable and consistent fosters adoption of the technology across a range of industries.

Benchmarking the effectiveness and performance of AI language models such as GPT-4 is thus crucial not only for OpenAI as a research organization, but additionally for the larger AI community as well as its consumers.

Doubts Regarding the Study

Some AI scientists have questioned the paper’s conclusions. Arvind Narayanan, a professor of computer science at Princeton University, highlighted that just because the code written by GPT-4 could not be run right away did not imply that it was less valid. In several circumstances, the code could not be executed because GPT-4 provided descriptive prose as part of its answers.

Simon Willison, a well-known software engineer, agreed. A good chunk of their criticism revolves around whether or not the code execution has been wrapped in Markdown backticks, Willison informed Ars Technica. Backticks in Markdown can be employed for formatting software code.

OpenAI’s head of developer relations, Logan Kilpatrick, tweeted saying the team has been made aware of the identified regressions and has begun investigating into them. The AI startup’s vice president of product and partnerships, Peter Welinder, previously declared, “No, we haven’t made GPT-4 dumber. Quite the opposite.”

What the New GPT-4 AI Can Do - Scientific American

Way Ahead

The benchmarking of AI language models and the latest study on the performance variations of GPT-4 provide insight on the dynamic and constantly changing field of artificial intelligence. Understanding how modifications to language models may affect performance is crucial for AI researchers, developers, and consumers alike as OpenAI continues to create and improve its language models.

The study paper’s findings highlight the importance of ongoing benchmarking and monitoring of AI language models. We can observe changes, spot potential performance problems, and comprehend trade-offs between various model components thanks to this monitoring. This iterative procedure is crucial for effective workflow integration because it guarantees consistency in outputs and increases user confidence in the technology.

AI benchmarking not only improves transparency but also helps with quality control, which is essential for AI applications in many different fields. Researchers create cooperative initiatives to enhance AI technology responsibly by sharing performance findings and ideas openly.

Beyond GPT-4, the study’s ramifications include the development of AI. Understanding the complexity as well as subtleties of AI behavior is crucial as we work to create even more complicated AI models. Large language models are black boxes, making it difficult to understand the precise causes of variations. This underlines the need for more openness as well as interpretability in AI systems.

Looking ahead, language model development and other AI applications are likely to continue to advance. Developers and organizations are accountable for ensuring the dependability and moral application of AI technology as it becomes more widely used. In order to accomplish these objectives and oversee the responsible application of AI across numerous industries, ongoing benchmarking and achievement evaluation will prove to be of utmost importance.

To fully realize the potential of this game-changing technology, academics, developers, policymakers, as well as users must work together. This is the essence of the artificial intelligence journey. We can create a future where AI actually enhances and empowers our lives while respecting moral ideals and addressing societal concerns by increasing transparency, supporting responsible AI practices, and remaining attentive through constant monitoring.

The trip ahead has both possibilities and obligations as AI continues to develop, and it is entirely up to us to navigate this path intelligently with the common goal of contributing to creating a better, AI-driven future for everyone.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button