AI Is a Black Box. Anthropic Figured Out a Way to Look Inside

0
48


Final yr, the workforce started experimenting with a tiny mannequin that makes use of solely a single layer of neurons. (Refined LLMs have dozens of layers.) The hope was that within the easiest doable setting they may uncover patterns that designate options. They ran numerous experiments with no success. “We tried a complete bunch of stuff, and nothing was working. It appeared like a bunch of random rubbish,” says Tom Henighan, a member of Anthropic’s technical workers. Then a run dubbed “Johnny”—every experiment was assigned a random title—started associating neural patterns with ideas that appeared in its outputs.

“Chris checked out it, and he was like, ‘Holy crap. This seems nice,’” says Henighan, who was surprised as effectively. “I checked out it, and was like, ‘Oh, wow, wait, is that this working?’”

All of a sudden the researchers may determine the encompasses a group of neurons have been encoding. They may peer into the black field. Henighan says he recognized the primary 5 options he checked out. One group of neurons signified Russian texts. One other was related to mathematical features within the Python laptop language. And so forth.

As soon as they confirmed they may identify features within the tiny mannequin, the researchers set in regards to the hairier activity of decoding a full-size LLM within the wild. They used Claude Sonnet, the medium-strength model of Anthropic’s three present fashions. That labored, too. One function that caught out to them was related to the Golden Gate Bridge. They mapped out the set of neurons that, when fired collectively, indicated that Claude was “pondering” in regards to the huge construction that hyperlinks San Francisco to Marin County. What’s extra, when related units of neurons fired, they evoked topics that have been Golden Gate Bridge-adjacent: Alcatraz, California Governor Gavin Newsom, and the Hitchcock film Vertigo, which was set in San Francisco. All informed the workforce recognized tens of millions of options—a type of Rosetta Stone to decode Claude’s neural web. Most of the options have been safety-related, together with “getting near somebody for some ulterior motive,” “dialogue of organic warfare,” and “villainous plots to take over the world.”

The Anthropic workforce then took the following step, to see if they may use that info to vary Claude’s habits. They started manipulating the neural web to reinforce or diminish sure ideas—a sort of AI mind surgical procedure, with the potential to make LLMs safer and increase their energy in chosen areas. “As an instance now we have this board of options. We activate the mannequin, one in all them lights up, and we see, ‘Oh, it is serious about the Golden Gate Bridge,’” says Shan Carter, an Anthropic scientist on the workforce. “So now, we’re pondering, what if we put somewhat dial on all these? And what if we flip that dial?”

Thus far, the reply to that query appears to be that it’s crucial to show the dial the correct amount. By suppressing these options, Anthropic says, the mannequin can produce safer laptop packages and cut back bias. As an example, the workforce discovered a number of options that represented harmful practices, like unsafe laptop code, rip-off emails, and directions for making harmful merchandise.



Source link