Stories

Meta Plan To Test Its Large Language Model Next Year: Are These So-Called Powerful AI Systems Capable Of Becoming More Capable Than The Human Race; Is “Chitti” Real?

Large language models have fragile performance. It is fair to presume that someone who performs well on one test would also perform well on another. That is not the case with large language models: a minor change to a trial can turn an A into an F.

Sources said that Meta Platforms is putting efforts into a new artificial intelligence system that is as potent as the most advanced model offered by OpenAI. The Facebook parent is planning for its latest AI model to be ready next year, and it is anticipated to be several times more effective than its commercial version, dubbed Llama 2. This planned system, details of which could still be altered, would help other organisations build services that produce sophisticated text, deep analysis and additional output. Meta expects to begin training this new AI system, a large language model, in early 2024.

However, opinions on the evolution of these large language models differ.

According to some studies, large language models can pass tests to detect specific cognitive talents in people. Such findings fuel speculation that these machines may eventually take over white-collar occupations. However, there needs to be more consensus on what such findings represent. Some individuals are blown away by what they perceive to be glimmers of human-like intelligence; others aren’t convinced.

Taylor Webb, a psychologist at the University of California, Los Angeles, who researches how people and computers handle abstract issues, was astounded by what OpenAI’s large language model looked to be capable of in early 2022. A jumped-up autocomplete was a neural network trained just to anticipate the next word in a block of text. Despite this, it correctly answered many of Webb’s abstract problems—the type of thing you’d find on an IQ test. Everything he had expected had been entirely turned on its head. He was used to constructing neural networks with specialised reasoning capacities attached. GPT-3, on the other hand, seems to have picked them up for free.

Large Language Models Will Define Artificial Intelligence

Webb and his colleagues released a paper in Nature last month on GPT-3’s ability to pass a series of tests planned to measure the use of analogies to solve issues (known as analogical reasoning). GPT-3 outperformed a group of undergrads on several of those exams. Webb’s study is merely the most recent in a long line of spectacular feats performed by large language models. When OpenAI announced GPT-4, the successor to GPT-3, in March, the firm revealed an impressive list of professional and academic examinations it claimed its new large language model had aced, including several dozen high school tests and the bar exam

Later, OpenAI collaborated with Microsoft to demonstrate that GPT-4 could pass portions of the United States Medical Licencing Examination. Furthermore, some researchers claim to have shown that large language models may pass tests meant to detect particular cognitive talents in humans, ranging from chain-of-thought reasoning (working through a problem step by step) to theory of mind (guessing what other people are thinking).

These outcomes fuel speculation that these robots may eventually replace teachers, physicians, journalists, and attorneys in white-collar employment. Geoffrey Hinton has stated that one reason he fears the technology he helped design is GPT-4’s apparent capacity to connect thoughts. This forces you to think deeply from the movie Robot that ‘Chitti can think’!

However, the game is not that simple. There are several critical issues with current evaluation techniques for large language models, says Natalie Shapira, a computer scientist at Israel’s Bar-Ilan University. It gives the impression that they have greater capabilities than they do. That’s why a rising group of researchers—computer scientists, cognitive scientists, neuroscientists, and linguists—want to change how they’re evaluated, advocating for more rigorous and comprehensive testing. Some believe grading machines on human tests is a bad idea and should be abandoned.

Since the dawn of AI, people have been administering human intelligence tests—IQ tests, for example—to machines, says Melanie Mitchell, an AI researcher at the Santa Fe Institute in New Mexico. The issue has always been what it means to test a machine like this. It does not have the same meaning for a human. There’s much humanising going on, she observes. And that’s kind of colouring the way humans think about these systems and how humans test them. With expectations and anxieties about this technology at an all-time high, the human race must understand what large language models can and cannot achieve.

Although AI language models are not people, we judge them as if they were.

Most issues with how large language models are evaluated revolve around how the findings are perceived.

Large language model

Human assessments, such as high school examinations and IQ tests, take a lot for granted. When someone performs well on a test, it is reasonable to infer that they have the information, comprehension, or cognitive abilities the exam is designed to assess. In practice, that assumption is limited. Academic tests do not always accurately reflect pupils’ genuine talents. IQ tests assess a specific set of skills rather than general intellect. Both types of examinations favour persons who are skilled at performing them. In layperson’s terms, a topper in academics isn’t a topper in sports. A topper in sports is not a topper in arts. A topper in arts need not the topper in science, and the list is infinite. 

Therefore, when a large language model does well on such tests, what was measured must be clarified. Is this proof of proper comprehension? Is this a senseless statistical trick? Rote repetition? So, giving a high IQ to the machine doesn’t make them intelligent. The models do exceptionally well in these tests, most likely because instances of such exams abound in the models’ training data. We’re attempting to assess and glorify their “intelligence” based on their outputs using these tests, but we don’t completely grasp how they work behind the hood. 

Kids v/s AI: Can AI solve psychological problems, or little human version is way more intelligent than these machines?

There is a long history of developing methods to test the human mind, explains Laura Weidinger, a senior researcher at Google DeepMind. With large language models producing text that appears to be human-like, it is tempting to believe that human psychology tests will be useful in evaluating them. However, this is not the case: human psychology tests rely on several assumptions that may not hold true for huge language models.

Large Language Model

Webb is aware of the problems he has gotten himself into. He agrees that these are challenging problems. He observes that while GPT-3 performed better than undergrads on specific tasks, it gave ludicrous results on others. For example, it failed a form of an analogical reasoning test regarding physical things given to children by developmental psychologists.

In this test, Webb and his associates gave GPT-3 a story about a magical genie transferring jewels between a pair of bottles and then questioned how to transfer gumballs from one bowl to another, utilising objects such as a posterboard and a cardboard tube. The hypothesis is that the story hints at ways to solve the situation. GPT-3 mostly offered elaborate but mechanically nonsensical answers, with many nonessential steps and no precise mechanism by which the gumballs would be transferred between the two bowls. Ah, my three-year-old nephew gives better results than this machine!

These algorithms are particularly horrible at those events that need comprehension of the real world, such as fundamental physics or social interactions—things that humans take for granted. So yeah, the human race wins this psychology race, till the date!

So, how can we explain a machine that passes the bar exam but fails preschool? Large language models, such as GPT-4, are trained on massive amounts of internet-sourced data, including novels, blogs, fan fiction, technical reports, social media postings, and much more. It’s possible that a large number of old exam papers were vacuumed up at the same time. One theory is that models like GPT-4 have encountered so many professional and academic examinations in their training data that they have learned to autocomplete the answers.

OpenAI verified that the tests given to GPT-4 did not contain words that occurred in the model’s training data. OpenAI utilised paywalled test questions in its cooperation with Microsoft on the exam for medical practitioners to ensure that GPT-4’s training data did not include them. However, such protections are not without flaws: GPT-4 might have seen comparable, if not identical, testing.

A machine-learning engineer, Horace He, discovered that GPT-4 scored 10/10 on coding exams issued before 2021 and 0/10 on tests released after 2021 when he tested it on questions obtained from Codeforces, a website that sponsors coding competitions. Others have also seen a drop in GPT-4 test scores for materials manufactured after 2021. Yeah, you read that right. They need to be trained for data after the pandemic.

Artificial Intelligence

Webb and his colleagues adapted Raven’s Progressive Matrices to assess analogical thinking of large language models. These exams include an image showing a series of shapes arranged next to or on top of each other. The goal is to identify the pattern in the provided set of forms and apply it to a new one. Raven’s Progressive Matrices are commonly used in IQ tests to measure nonverbal thinking in both young children and adults. Rather than utilising images, the researchers encoded form, colour, and position into numerical sequences. This guarantees that the tests do not exist in any training data and appear to the computer as a new question.

Melanie Mitchell, as mentioned above, has created her own analogical reasoning exam, ConceptARC, which employs encoded sequences of shapes from Google researcher François Chollet’s ARC (Abstraction and Reasoning Challenge) data set. GPT-4 performs worse on similar tests than individuals in Mitchell’s trials. Mitchell further points out that putting the images into numerical sequences (or matrices) simplifies the challenge for the program by removing the visual part of the riddle. Solving digit matrices does not equate to solving Raven’s problems, she explains. Again, the machine fails.

Brittleness testing.

Large language models have fragile performance. It is fair to presume that someone who performs well on one test would also perform well on another. That is not the case with large language models: a minor change to a trial can turn an A into an F. In general, AI evaluation has not been done in such a way that allows us actually to understand what capabilities these models have, says Lucy Cheke, a psychology professor at the University of Cambridge in the United Kingdom. It’s perfectly reasonable to test how well a system performs at a specific task, but it’s not useful to use that task to make general claims.

Take an example from a piece published in March by a group of Microsoft researchers, in which they asserted to have identified “sparks of artificial general intelligence” in GPT-4. The group assessed the large language model utilising a range of tests. In one, they questioned GPT-4 on how to stack a book, 9 eggs, a laptop, a bottle, and a nail stably. It answered: ‘Position the laptop on top of eggs, with the screen towards down and the keyboard facing upwards. The laptop will adjust snugly within the edges of the book, eggs, and the flat and rigid surface will give a stable platform for the next layer.’

Not bad output by the large language model. But when Mitchell tested her own version of the question, questioning GPT-4 to stack a toothpick, a bowl of pudding, a glass of water, and a marshmallow, it pointed sticking the toothpick in the pudding and the marshmallow on the toothpick, and balancing the full glass of water on top of the marshmallow. (It ended with a valuable note of caution: “Keep in mind that this heap is delicate and may not be very sturdy. Be cautious when building and handling it to avoid spills or accidents.”) Ah, so helpful advice, sarcastically.

The goalposts have been moved.

People felt that beating a grand master at chess would need a computer with the intelligence of a human, says Mitchell. Chess, on the other hand, was won by robots that were just better number crunchers than their human opponents. Not wisdom, but brute might, triumphed.

From image recognition to Go, similar difficulties have been presented and met. Every time a large language model is made to accomplish something that needs human intellect, such as playing games or utilising language, it divides the field. Does GPT-4 demonstrate actual intelligence bypassing all of those tests, or has it discovered an effective, but ultimately stupid, shortcut—a statistical trick taken from a hat containing trillions of correlations over billions of lines of text?

Conclusion.

It all boils down to how large language models work. Some experts want to move away from the focus on test results and instead try to figure out what’s going on below the hood. The issue is that no one understands how large language models function. It is difficult to disentangle the numerous mechanisms contained inside a large statistical model. However, experts believe that it is theoretically conceivable to reverse-engineer a model and determine which methods it employs to pass certain tests. 

At last, in my opinion, AI or the large language model can do what humans are already doing; while only a human can do what humans have never done before.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button