Subbarao Kambhampati’s Post

View profile for Subbarao Kambhampati, graphic

Prof at ASU (Former President of AAAI)

My (pure) speculation about what OpenAI o1 might be doing [Caveat: I don't know anything more about the internal workings of o1 than the handful of lines about what they are actually doing in that blog post--and on the face of it, it is not more informative than "It uses Python er.. RL".. But here is what I told my students as one possible way it might be working] There are two things--RL and "Private CoT" that are mentioned in the writeup. So imagine you are trying to transplant a "generalized AlphaGo"--let's call it GPTGo--onto the underlying LLM token prediction substate. To do this, you need to know (1) What are the GPTGo moves? For AlphaGo, we had GO moves). What would be the right moves when the task is just "expand the prompt".. ? (2) Where is it getting its external success/failure signal from? for AlphaGo, we had simulators/verifiers giving the success/failure signal. The most interesting question in glomming the Self-play idea for general AI agent is where is it getting this signal? (See e.g. https://lnkd.in/g7ujAHBe ) My guess is that the moves are auto-generated CoTs (thus the moves have very high branching factor). Let's assume--for simplification--that we have a CoT-generating LLM, that generates these CoTs conditioned on the prompt. The success signal is from training data with correct answers. When the expanded prompt seems to contain the correct answer (presumably LLM-judged?), then it is success. If not failure. The RL task is: Given the original problem prompt, generate and select a CoT, and use it to continue to extend the prompt (possibly generating subgoal CoTs after every few stages). Get the final success/failure signal for the example (for which you do have answer). Loop on a gazillion training examples with answers, and multiple times per example. [The training examples with answers can either be coming from benchmarks, or from synthetic data with problems and their solutions--using external solvers; see https://lnkd.in/g7ujAHBe] Let RL do its thing to figure out credit-blame assignment for the CoTs that were used in that example. Incorporate this RL backup signal into the CoT genertor's weights (?). <At this point, you now have a CoT generator that is better than before the RL stage> During inference, stage, you can basically do rollouts (a la the original AlphaGo) to further improve the effectiveness of the moves ("internal CoT's"). The higher the roll out, the longer the time. My guess is that what O1 is printing as a summary is just a summary of the "winning path" (according to it)--rather than the full roll out tree. === Some corollaries 1. This can at least be better than just fine tuning on the synthetic data (again see https://lnkd.in/g_GZYtGh are getting more leverage out of the data by learning move (auto CoT) generators. [Think behavior cloning vs. RL..] Contd. in the comment below 👇🏻

  • No alternative text description for this image
Subbarao Kambhampati

Prof at ASU (Former President of AAAI)

6d

(contd from main post..) 2. There will not still be any guarantees that the answers provided are "correct"--they may be probabilistically a little more correct (subject to the training data). If you want guarantees, you still will need some sort of LLM-Modulo approach even on top of this (c.f. https://arxiv.org/abs/2402.01817). 3. It is certainly not clear that anyone will be willing to really wait for long periods of time during inference. See https://x.com/rao2z/status/1834314950413877496… The kind of people who will wait for longer periods would certainly want guarantees--and there are deep and narrow System 2's a plenty that can be used for many such cases. 4. There is a bit of a Ship of Theseus feel to calling o1 an LLM--considering how far it is from the other LLM models (all of which essentially have teacher-forced training and sub-real-time next token prediction. That said, this is certainly an interesting way to build a generalized system 2'ish component on top of LLM substrates--but without guarantees. I think we will need to understand how this would combine with other efforts to get System 2 behavior--including LLM-Modulo (https://arxiv.org/abs/2402.01817) that give guarantees for specific classes.

Kris Holland

Futurist. Realist. Innovator. Engineering Grade Artist.

6d

How much do we need to speculate? OAI has posted more or less what the secret sauce is - a chain of thought on an unmodified backend with a filtered middle. The only real question is how they're executing it. I would say they have a mini instance that looks at it for compliance and a summary. From what I've seen on YouTube, the biggest thing is giving the LLM enough time and space to dribble on through messy thinking. Which, arguably, isn't sustainable. But it would certainly explain why they're thinking of $2k/mo. At that price point, you're a timeshare owner on an H100.

L. Thorne McCarty

Professor of Computer Science and Law, Emeritus, at Rutgers University

9h

I appreciate your efforts to decipher how o1 works. But as an academic, don't you think it is offensive that this is even necessary? The normal standard for scientific research puts the burden on the researchers to explain their work to the scientific community. And these explanations must be specific enough that the work can be replicated. (This is also the standard under patent law.) In this AI hype cycle, the burden of proof has been reversed. This is not scientific, and we should not pretend that it is.

Like
Reply
Brody Kutt

Sr. Principal MLE, AI Research at Palo Alto Networks

6d

Speculating: the o1 model appears to be an application of this paper published last year (https://openai.com/index/improving-mathematical-reasoning-with-process-supervision/) wherein you supervise every step of CoT, not just the outcome.

James Bentley

AI and Strategy Director @ Awin (Axel Springer)

6d

In the technical paper OpenAI stated that they will keep the COT ‘hidden’. But in the cipher example they show the ‘expanded’ CoT self monologue. At times the monologue switches from first person to first person plural, as though it is dialoguing with multiple verification agents at each stage, with a central orchestrating agent/ task leader. What I found so curious was how the model repeatedly says ‘hmm’ during the CoT. I understand that people do that, almost unintentionally, but what benefit or purpose would ‘hmm’ serve to an llm agent when solving a problem. Unless some how ‘hmm’ as pre-cursor tokens pushes the model until the problem solving region / pathway in the latent space?

  • No alternative text description for this image
Vishnu Prateek Kakaraparthi, Ph.D.

Ph.D. Student | Gen AI | Data Science | Computer Vision | Machine Learning | Wearables | Smart Cities | HCI | Inventor | Patent Holder | Educator | ASU | CEO and Founder KonnectMe

1d

OpenAI has reportedly issued warnings to users probing the reasoning processes of its o1 model, citing policy violations. The company avoids exposing the model's raw "chain of thought" to users to maintain a competitive advantage and prevent misuse, frustrating some researchers who seek transparency for red-teaming and interpretability. https://arstechnica.com/information-technology/2024/09/openai-threatens-bans-for-probing-new-ai-models-reasoning-process/

Miquel Ramírez

Research Fellow at The University of Melbourne

1d

> (1) What are the GPTGo moves? For AlphaGo, we had GO moves). What would be the right moves when the task is just "expand the prompt".. ? This is, for me, the key question Subbarao Kambhampati. Over a decade ago, I remember discussing the potential use of an MCTS planner with Gerard Casamayor, PhD, for natural language synthesis tasks. We did not make much progress back then because all we could see was the "branching factor." Nearly 15 years on, and being much more knowledgeable on numerical optimization than I was, my guess is that they define several search directions based on the gradients of suitably defined "potential functions." These functions are all defined on the real n-space where word embeddings reside. These approximate value functions propose at each stage of the rollout which regions of that feature space may be promising to sample from. The truly heuristic part of the OpenAI method, if it does resemble anything like the above, is how to choose which is the right "direction" from which to sample new words and determine how many you need, besides any other "modulo" procedures they have to ensure the chatbot stays within rails.

Like
Reply
Pete Dietert

Software, Systems, Simulations and Society (My Opinions Merely Mine)

6d

How close is it to what this paper talked about? Kojima in early 2022 talked about better performance in reasoning tasks by adding to the prompt. Sutskever and others then elaborated. Isn't this all just a more fleshed out version of the same thing, breaking up the input string a bit further, assessing the components, sending it to a bit of a multi-agent architecture for processing, then reassembling a result? https://cdn.openai.com/improving-mathematical-reasoning-with-process-supervision/Lets_Verify_Step_by_Step.pdf

Like
Reply

But can it solve mystery blockworld?

See more comments

To view or add a comment, sign in

Explore topics