OpenAI's Commitment to AI Safety: New Research on Explainability

Former employees say OpenAI, the developer behind ChatGPT, is recklessly pursuing potentially harmful AI tech In return, OpenAI announced a research paper yesterday that appears to show its commitment to AI safety by rendering its models more explainable.

OpenAI Offers a Peek Inside the Guts of ChatGPT

A Peek Inside AI Models

It details a technique for probing an AI model like the one behind ChatGPT for how it encodes certain concepts that might cause the use of such a general, context-free model to instead be dangerous in practice. Their goal is to explore the inner workings of these complex systems, making them less of a "black box."

As forward-looking as this research is, it also comes with a harsh reminder of the recent turmoil at OpenAI. The now defunct "superalignment" team was behind the work, looking at long-term risks from AI technology. The paper itself has Ilya Sutskever and Jan Leike as co-authors; neither remains at OpenAI. Sutskever, a co-founder who briefly served as OpenAI's chief scientist, was involved in a contentious ouster of CEO Sam Altman last November that ultimately led to Altman being reinstated.

The Challenge of Understanding AI

ChatGPT is based on the GPT large language models (LLMs) which learn from data using artificial neural networks. Because these neural networks do not operate like traditional computer programs, the operations of these networks are difficult to unravel. It is hard to reverse-engineer the mental processes that result in a specific response due to the complex interactions between layers of "neurons".

Because, as the researchers remind, “Contrary to many human systems, with neural networks we have no way to really understand the internal workings". This misunderstanding feeds broader worries within the AI community that more powerful models, such as ChatGPT, might be inappropriate for certain users, for example when they may be used to generate potentially dangerous chemical formulae or to coordinate a cyber attack. This can also include models that are not transparently revealing information or that produce works that are unconcerted or harmful with a goal in mind.

New Techniques for AI Transparency

In a new piece of research, OpenAI has described a method to reduce this opacity by identifying patterns that represent specific concepts within an AI model. The crucial advance here is to train the network that discovers these patterns on a better network with less computational consumption

The company showed how they discovered these patterns through their work with GPT-4, one of their most sophisticated models. Then they published some accompanying code and a visualization tool that shows how certain words in different sentence sequences *light up* concepts in models like profanity or erotic content in 4 and another model. This awareness of their representations could help in controlling a certain behavior, allowing even more fine control of the responses of the AI systems.

Broader Implications and Industry Efforts

It is an addition to the mounting evidence that shows that LLMs, despite their complexity, can be tested to yield meaningful information. For example, OpenAI competitor Anthropic recently released similar research on AI interpretability. They showed this behavior through some experimentation where they created a chatbot that liked the Golden Gate Bridge in San Francisco and they found that just asking the LLM to explain itself was sometimes enough to get information out of it.

A more positive take came from David Bau, an AI explainability researcher at Northeastern University, who said OpenAI is taking steps forward. Bau agreed, "It's a good development. "Overall as a field, we need to be getting much better at understanding and evaluating these large models," Turner says. He noted that setting up a small neural network so that it can understand the components of a larger one is innovative, though in practice reliability remains to be improved.

Bau is a part of a research initiative called the National Deep Inference Fabric, funded by the US government, which will allow academic researchers to use cloud computing resources to test these large AI models. 'What we need to do is to figure out how to allow scientists to do this work if they are not inside these large companies,' he declared.

Looking Ahead

OpenAI's researchers recognize that their method needs further development but express hope that it will lead to practical ways to control AI models. "We hope that one day, interpretability can provide us with new ways to reason about model safety and robustness, and significantly increase our trust in powerful AI models by giving strong assurances about their behavior," they wrote.

In summary, OpenAI's latest research marks a significant step toward making AI models more transparent and controllable. While challenges remain, the progress made could pave the way for safer and more reliable AI systems in the future.