Can’t finish what you don’t start.
About a week ago I stumbled onto a site called Metaculus, which has two “forecasting” posts related too Artificial General Intelligence. I looked at the suggested “arrival” dates, read thru the comments, and had planned to post a short entry of my own. However, multiple sign up attempts failed, so I settled on just trying to capture my thoughts, reserved to try again at a later time.
As added context, there has been a minor row on Twitter about Large Language Models, scaling, and why Machine Learning experts are embracing a term (AGI) that was created 20 years ago in opposition to basically everything the ML community is doing today (something about ML as the wolf in sheep’s clothing, shedding “AI” because they have given it a stink and need something new). Then this morning I saw this tweet and figured my initial comment had to be expanded into a stand alone post.
Gato, the new “generalist” model created by DeepMind, has added fuel to the “General” AI fire. While Yann LeCun continues with his “there is no such thing as general intelligence” creed, the wider audience is holding their “AGI is Nigh” signs in the air. There is no indication that the “sub-units” that are activated based on pre-trained tasks interact with each other in any way, making this nothing more than a container for models, where they don’t share pathways or weights. What is missing, and what matters most, is the transfer learning aspect. Gato is far from being “General Intelligence,” or even being a true “generalist” model.
While there are two forecasts in regards to AGI, I don’t think the operational differences have much of an impact beyond the technical hurdle of being fully embodied. Both efforts are going to encounter the same initial development cycle and require the same level of “public” testing, but the non-embodied version is a prior of the second.
The four criteria set forth represent different milestones (for a “weak” AGI), and are not likely to be completed back-to-back, or even in the same year. They also don’t fully represent the development path.
Classic Atari 2600 games represent a non-stochastic training environment for Deep Learning models, which allow for brute force iteration, and don’t prove anything on their own. An AGI, even at this early stage, will exhibit strong transfer learning ability within all game environments, including, perhaps ironically, 3D driving environments (making a 2-hour livestream of an AI driving around in GTA V, without incident, a better high-water mark).
Winograd Schema challenges only highlight the failing of current language models, and are not actually a “high barrier” for a system that has been correctly designed to “learn” languages. At issue is the mindset that says “language” is a good starting point. Between games and written text, there is an operational gap that has to be bridged. Robotics will play an important role in connecting “arcade actions” with physical objects that are defined using multiple modalities (3D space, having weight, physical properties, and of course the many language terms associated with them, including actions, such as “push soft ball onto floor”).
Standardized admission testing (SAT) for college entry may be a fair way to test an AGI’s ability to understand the material presented and apply the relevant knowledge, but no current AI/ML method functions in the manner required (success rates depend on sample size and the “statistical alignment” of the content with the multiple choice answers). There is no direct path from blank slate to reading comprehension, so before attempting to pass such a test (to satisfy the conditions of the prediction), there are going to be at least 2 other milestones. The first is likely to be examples of summarization of stories, or an ability to read a book written for young adults and to give short, satisfactory answers to any questions given. The natural follow-up in development is multi-lingual ability, such as the complete translation of a book, or reading a book written in one language and answering questions given in many others (and answering in that native language).
While the no longer running Loebner Silver Prize (a version of the Turing Test) only represents text-based interactions, the public testing of an AGI (by a wide range of experts, including critics, rather than just being a “friends and family only invite for good public relations”) is going to favor a version of the Gold Prize. This will likely be a browser-based interface that allows for the dragging and dropping of text, images, audio files, video clips, or links to content that will be a basis for discussion. Text and/or audio reply is likely, but despite the trend in generative images, I would expect the demonstrations to limited to the understanding of content, not in “creative reconstruction.”
I want to expand on this last point by saying a non-public demonstration may exhibit all of the required abilities, but be limited to an 8th grade level, rather than a college level, and with greatly limited exposure to the kind of content relevant to open-ended Q&A sessions. The research and internal testing should determine the pace of expansion (or signal that more understanding is required before attempting to launch a product or service that only mostly works). By the time all four criteria have become publicly demonstrated, assuming they exceed the listed success rates, such an AGI will have already moved away from being “weakly general” in nature.
This brings us to the four criteria that have been set to measure the existence of “strong” AGI.
There is confusion about what “general” actually signifies, so I find it unusual to see a series of competency test in more specialized fields as marker for a system that shouldn’t be exposed to them yet. To say that another way, general intelligence is a reflection of the common knowledge required to be a functioning adult, not of specific cognitive abilities that form the basis of “knowing all the things.” Regardless of the view taken, understanding criminal law, or the history of the Roman Empire, is more of an application of AGI. This is the growth of the system, presumably “on its own,” and should really fall under some kind of post-Singularity framing. As the referenced paper unironically states, there is a gap between unspecialized humans from Amazon Mechanical Turk vs expert-level performance.
Along the same line of requiring “expert” knowledge, having a series of coding challenges seems to fall under a specialization. In fact, there is nothing preventing this isolated milestone from being included with the above set (though there is an argument to be made about “code that works” as being a less fallible measure than the subjective interpretation of answers to questions about politically charged topics). The ability to code may be the only challenge to fully represent human cognition, but it feels like a final exam, rather than an early indicator. We should first expect to see high success rates with online coding challenges that are geared toward code language basics.
As an overlap with testing a “weak” AGI system, there is a suggested protocol for an Adversarial Turing Test that includes the passing of media content in a manner similar to the one I described above. The notion of testing a system against human components, or even having the AI/AGI pretend to be human, has long fallen out of favor. Interesting to note, though, that acting human requires a much deeper understanding of other minds and human psychology. It’s an example of how “faking it” bypasses two challenges at the same time, making the problem seem far easier.
While the first two metrics really should represent post-AGI, and the third being a strong indicator of a “weak” AGI, the fourth criteria expands on the operational gap I mentioned earlier. In terms of cognitive ability, the construction of a plastic model (cutting, sanding, gluing, painting, and applying decals) from a kit should already be a “doable” task. At issue is the fine motor skills and engineering required to build the hands and fingers that mimic human-level dexterity. To use the film Ex Machine as an example, the “mind tech” feels decades ahead of the physical body, which is glossed over as if it wasn’t the real star of the show. We have many examples of “rehearsed” gross motor skills, and should be watching for non-AGI milestones. An example would be a proving ground (DARPA Challenge) for high mobility bipeds, such as “chicken walkers,” or an obstacle course for quadrupeds where speed is a factor. *Should we just drop the “search and rescue” pretext?
Here is my 85 cents worth of prediction:
Pre-language and robotic ability are going to require the most effort to “get right,” and there is no indication that a worthy demonstration will happen in the next 5 years. Progress in the next 8 years could result in accelerated returns that place a safe date for AGI by 2035 (short development window once problem space is fully realized), but reaching a “strong” AGI by the end of this decade is a long shot, even if “the correct efforts” are already under way. Of course, by the same logic, any effort that has been ongoing for more than a few years, and hasn’t at least hit one of the milestones suggested, is unlikely to be on the right path.
Of more importance than public achievement (earliest milestone by 2028?) is the credibility of the design. Current trends favor splashy headlines, but lack of operational progress means zero real-world advancement toward machines that think in a meaningful way. The better time table would be based on two demonstrations, by the same system, that can be counted as actual experience, rather than the gluing together of disparate models. That date, plus 3 years (pwning all the white-collar jobs within 5).