Safe Cracking

Steven D Marlow
10 min readFeb 14, 2022


Breaking into the black box, or trying to break out of it?

This post is based on my thoughts after reading thru a 2022 AGI Safety Fundamentals alignment curriculum (here).


Are issues of safety really just issues of current trends in Machine Learning? If an understanding of ML terms is a prerequisite, or if “AGI” is described using the same processes, then there is an inherent disconnect between what is popular today with what some future (actually intelligent) system could be based on. Incremental progress in ML is great for leaderboards or media headlines, but the entire field is operating with blinders on. Any system that works, even if it’s a decade or more away, is still seen as being an extension of what they already know and understand (or don’t understand).

For some in the ML community, AGI might just be a deep learning method that not only works across multiple domains, but also retains that functionality as new domains are added. DeepMind had Agent57, which could play 57 different classic Atari games, but only when trained to do so, one at a time, with no ability to play a previous game where it had achieved “better than human performance.” Something that works has become less of an issue than trying to build something that doesn’t stop working.

While it’s hard to draw an exact line given how AGI is still not so different from other xNN/DL efforts, there is a historical divide between trying to simulate the workings of the brain vs an effort to implement behaviors in a non-brain like system. Personally, I can’t bring myself to use AGI as a goal given it’s fall into the same over-commercialization that is marketing, not science. But from a functional perspective, it tends to represent the same goals as GOFAI, placing it at odds with ML methods.

In the context of safety, it is the dominant (ML) ideology framing the issue which leads to drafting of legislation that could be seen as specific to current implementations, and by extension, incremental developments. Any leap or out-of-the-blue method based on non-DL research that results in AGI or near AGI ability would be immune to any such coverage. It’s a rather ironic misalignment where what mostly works is actually far from the understanding needed to reach the advertised abilities.

Current levels of ML technology should be regulated based on their failure to achieve rather on what they promise they can do in the future.

Intelligence on Demand

A lot of safety concerns are actually post-AGI in nature. It’s a bit like trying to establish speed limits for cars on Mars right now. We are very far from living there and current city planning doesn’t just extend itself into an alien environment. There is no incremental highway design. That the AI community can’t agree on the meaning of fundamental terms is also reflected in its inability to rationalize the form of a superintelligence. It goes back to the idea of having blinders on, where a digital mind can have accelerated thinking, therefore, all science is accelerated. Except, the laws of physics don’t change, and experiments still take regular time.

Drug discovery is big in the news because of the ability to visualize protean folding, but that isn’t “new science” or an example of AI that can solve problems 100 times faster than humans. It’s ONLY a visualization tool. Human intuition and careful testing under controlled conditions of each candidate is still required.

A superintelligence that represents 1,000 human minds working together as a single mind, 24/7, doesn’t translate into that being 1,000 different people able to physically act as human “manipulators” in order to subjugate humanity. Thinking faster has nothing to do with hyper-connectivity with all levels of a functioning society. There is no leap to be made from just thinking about new forms of energy production to “BAM,” new forms of energy production popping into existence.

I don’t think we need to worry about safety issues when talking about cartoon-levels of impossibility. If I create a superintelligence that creates time travel, then sure, I’ll assume full responsibility for that outcome. Can there be any law that would prevent such a thing from happening (that is, one that hasn’t already been written into the fabric of reality)? At what age do people tell their children not to build a time machine? Has ethics and safety already failed because I wasn’t told, at a young age, in a legally binding way, not to build a digital mind that could invent time travel? Do we extend the argument to say Babbage and Turing are at fault for anything bad that happens in a post-AGI world because they were early contributors, setting us on that path?


We can’t talk about an AGI having goals without going back to the idea that the outcome of GOFAI would be a digital mind with some internal world representation that resulted in it having agency vs simply extending current ML methodology where an artifacts “goal” is just a reflection of whatever goal it was trained on. It can’t be described as active thinking vs passive thinking because any system with human level comprehension is demonstrating active thinking. Searle, with his Chinese Room, perfectly describes everything that is wrong with ML systems. It doesn’t, on its own, rule out the creation of a digital consciousness. A fair argument would be that “true AI” is the ability to collate the unified set of rules placed inside the room, which is a reflection of human minds creating the networks, datasets, weight adjustments, and algorithms that define ML.

I’ve said for years that anyone that associates reward functions with AGI just doesn’t get it. And just like trying to train an AI to play a video game based on how many points it scores, the idea of a “reward” for the correct form of behavior in a complex and dynamic environment is equally disconnected. It’s one thing to nudge an existing behavior with a shock or food pellet, but to assume you can create a behavior that way is like… placing someone in front a piano, setting out some sheet music, and whipping them with a stick every 5 minutes until you get a better than average rendition of Beethoven’s Moonlight Sonata. Or, in strict ML terms, whipping them every time they play the wrong note, forcing them to trial and error at each step of the way (a method that works well in environments when a. no human is being harmed; b. you can iterate millions of times a day; and c. with each iteration there is no change in the steps required to attain the correct outcome).

The phrase “inner alignment” implies agency, when most of the people using the term are strictly on team artifact. It seems like a way to spin a negative into something less negative? Bad if some RL algorithm is only ever trying to get the highest score in a game, but not so bad when they develop a secondary goal on there own based on actions that happen in the game world, unrelated to keeping score. This would be doing well on a muffin or dog challenge, not because a network learned to make a distinction between eyes and chocolate chips, but because it’s playing an internal game of tablecloth or not tablecloth. These things are notorious for finding statistical anomalies in images that humans don’t even notice because they are unrelated to the object of focus.

And if it isn’t an issue of having a hidden goal, the implementation of actions to reach a pre-approved outcome are just as likely to have unintended consequences. This is true for humans that are supposed to be aware of all the factors involved, but as “AGI” is just uber-powerful ML, still in a narrow operating domain, all other factors might as well not exist. Telling the domestic service robot you want Chinese for dinner shouldn’t be the start of a Black Mirror episode where the news is reporting a recent string of abductions of Asian people. General intelligence doesn’t just imply knowledge of other domains, but of an ability to navigate human society and many of its idiosyncrasies.

*The drive for common sense, under the lens of ML, is more about the capture of things that are so obvious to us we are no longer consciously aware of them. This is wrong. System competency needs to be the actual focus. That sentiment “aligns” with many definitions of intelligence that claim it’s about actions which lead to correct outcomes.

Compound Interest

I don’t think a wider discussion of potential harms by some omnipotent AI is going to offer any “first steps” toward prevention. Even the safety curriculum spends time on the issue of not getting things perfectly right on the first try. It also makes mention of ML safety, or at least makes references to current trends in ML as being an important focal point regardless of how, or how long it takes, for some AGI to be created.

As with self-driving cars and the need to “align” them on morally correct and socially acceptable accidents, the current problem with autonomy is just getting the damn things to use the brakes. The divide between ML and a general intelligence is this narrow operation of keeping to the road vs a complete understanding of roadways, pedestrians, defensive driving, and a whole host of other driving related activities you wouldn’t find coded into a driving game. It’s being worried about airport security 20 years before the Wright Brothers first flight. That flight didn’t even hint at the infrastructure required to build and operate a modern international airport, nor would anyone at the time have been writing about the sound barrier or jet engines producing enough thrust to lift a steam engine.

An interesting diversion is that people routinely imagine completely impractical things that don’t generate any sort of safety concerns, and we could imagine the reasons being 1. Obviously never going to happen like that or 2. Obviously going to work exactly as describe and there was no depiction of danger or risk. If Elon Musk announces a project to cannon shot a group of bikini models to the moon, where they will walk around, pose, and toss moon dust on each other as if it’s beach sand, no one is going to create a Bikini Models on the Moon safety counsel (The BMMC) to study the safety issues and risks to society by having them do so. Someone makes a horror film where the monster is actually a time traveling robot wrapped in human flesh and OMG it becomes an existential risk to humanity.

As for divergent outcomes from tiny or unforeseen misalignment of goals, it also doesn’t matter if that is a stated goal or one a machine learning system found “hidden” in the data. With a real general intelligence, the fate of the world isn’t going to hinge on some exact phrasing of a statement given at the last minute. In its construction there would be a lot of “error correction” built into the process. If there is content in a book that a group finds “questionable” then there are a dozen other books that would expose it as such. In cases of unresolved human conflicts, I dare anyone to present the alignment material for ML systems that solves the issue once and for all.

*I’m only going to address the “because capitalism” argument as being one of implementation and “ownership of the tools” rather than a serious discussion of shaping a non-biological mind with only the whole of human history as a guide.

The takeaway should be less about threading an imaginary needle and more on how people with just or unjust intentions are using technology (ML), and how their goals are the ones at the center of any ethics or safety issue. Once we reach the level of AI everyone can agree on calling general in intelligence, we will be having discussions of AI and robots with a right to object to top level directives for social, political, or economic domination by the person or persons that has exclusive access to such technology.

Well there’s your Problem

Sections 4, 5 and 6 are explicitly about Machine Learning and are of zero relevance to a general intelligence. There effectiveness within ML space is dubious given the one-note nature of ML systems. It really comes down to how they are constructed, and rather than working as an internal modulator during training. The bulk of these efforts focus on interpretability. That is, more humans twisting dials and rolling chicken bones on the floor in an effort to shape how they want systems to work (but why you can’t just swallow your pride and do it the good ‘ol fasion way with hand-written code is beyond me).

Rules are Hard to Make, but Leadership is Easy

There is a line from this section (week 7) that rubs me the wrong way: “focusing on the framing of AI governance as field-building.” It reminds me of a trope from 80’s sitcoms where the characters go to some event, but there is a “rug pull” to reveal it’s actually a nudist camp, long pitch meeting for a time-share condo, or a gay cruise (which makes me wonder if there’s ever been gay characters that went on a cruise that turned out to be for Christian singles).

Is the danger of misaligned AGI not so terrifying, so ominous on its own, that the idea of regulation has to be sold as helping grow the AI industry? That doesn’t strike anyone as an ethical conflict?

Then there is this: “In the taxonomy of AI governance given by Clarke (2022) in the optional readings (diagram below) this week’s governance readings focus on strategy research, tactics research and field-building, not on developing, advocating or implementing specific policies.”

Not on developing specific policies. Wow. Exactly the reason people like me feel years of taking AI ethics seriously was wasted. Safety, ethics, human rights protections. doesn’t matter how people want to frame it, none of it has ever been about something actionable. Always a problem that has to be addressed. Always the same examples of ML systems failing. But if you try and point-out that the largest “offenders” are the ones funding this new cottage industry, you’re shown the way outside the big tent while someone takes the champagne and canape out of your hands.


The curriculum should have been called “Intro to ML Policy and Oversight” because it seems more concerned with teaching people to sound concerned, and gives them just the right talking points to sound knowledgeable on the subject, while the whole thing shows a complete lack of knowledge on the subject. Worse, the people that don’t know how off the mark this stuff is are going to sell themselves as experts on AI/AGI safety and could have a hand in writing policy!



Steven D Marlow

I'm applying for the mad scientist position. Have robot. Will travel.