Slow Thinking


2024-03-08


The conversation below is a little discussion regarding implementation of slow thinking in LLM with “Mistral Large” LLM


HUMAN: Hi, I want to discuss something

ASSISTANT: Hello! I'm happy to help with your discussion. Please feel free to share the topic or question you have in mind, and I'll do my best to provide a useful response.

HUMAN: You know the GAN model, right? It's an architecture where there is a generator and there is a discriminator. I have an idea to implement this architecture with my LLM, so there is a generator (the LLM itself) and there is a discriminator that judges the consistency and rationality of the output from the LLM. I think about this because I think, the human ability to judge things is just how we can "classify" and make a boundary between various concepts and that's pretty similar to the discriminative paradigm in machine learning. But judgment isn't the only way the human mind works, so there are concepts such as "irrationality" and "unconscious", some kind of "stream" of consciousness that we can't control but based on our experience and I think the concept is parallel to generative paradigm, especially Transformer model in LLM. So, combining generative and discriminative paradigm in this case, is like combining unconsciouss and consciouss mind.

ASSISTANT: Yes, I'm familiar with Generative Adversarial Networks (GANs). Your idea of combining a generative model like a language model (LLM) with a discriminator to judge the consistency and rationality of the generated output is quite interesting. It seems like you're drawing inspiration from the human cognitive process, where both conscious and unconscious processes contribute to our overall understanding and generation of language.

In your proposed architecture, the generator (LLM) would be responsible for creating text based on its learned patterns, much like our unconscious mind generates thoughts based on our experiences. Meanwhile, the discriminator would act as a conscious filter, evaluating the generated text for coherence, rationality, and consistency with the intended context.

This approach could potentially lead to more controlled and coherent text generation, as the discriminator would provide valuable feedback to the generator during the training process. Over time, the generator could learn to produce more consistent and rational outputs, as it adapts to the discriminator's feedback.

However, there are some challenges to consider when implementing this idea:

Balancing the training of the generator and discriminator: In GANs, it's crucial to maintain a balance between the generator and discriminator during training. If one becomes too powerful, the other may fail to improve further. You'll need to carefully manage the training process to ensure both components continue to learn effectively. Defining the criteria for rationality and consistency: It might be challenging to define clear, quantifiable metrics for the discriminator to evaluate the generated text. You'll need to develop a suitable evaluation method that captures the desired aspects of rationality and consistency. Handling the subjective nature of judgment: Human judgment can be highly subjective, and what seems rational or consistent to one person might not seem so to another. To address this, you might need to train your discriminator on a diverse dataset, capturing various perspectives and preferences. Integrating irrationality and the unconscious: While the idea of incorporating irrationality and the unconscious mind is intriguing, it might be challenging to strike the right balance between coherence and creativity. You'll need to carefully design your architecture to allow for some degree of "irrational" or unpredictable output while still ensuring the overall coherence and quality of the generated text. Overall, your idea presents an interesting avenue for exploring the combination of generative and discriminative models in the context of language generation. By drawing inspiration from human cognition, you might be able to create a more nuanced and powerful language model that better reflects the complexities of human thought.

HUMAN: Yeah, because I think. LLM is already pretty good at generating "ideas". But they just lack some kind of "slow thinking" mechanism like humans. LLM is expected to generate accurate responses in their first "thinking loop". Meanwhile, humans can do as much as "thinking loop" inside our heads before answering questions. And I think this is a very misconception in public about LLM intelligence capacity. Because there's a latest paper say:

Common 7B Language Models Already Possess Strong Math Capabilities Published on Mar 8 Abstract: Mathematical capabilities were previously believed to emerge in common language models only at a very large scale or require extensive math-related pre-training. This paper shows that the LLaMA-2 7B model with common pre-training already exhibits strong mathematical abilities, as evidenced by its impressive accuracy of 97.7% and 72.0% on the GSM8K and MATH benchmarks, respectively, when selecting the best response from 256 random generations. The primary issue with the current base model is the difficulty in consistently eliciting its inherent mathematical capabilities. Notably, the accuracy for the first answer drops to 49.5% and 7.9% on the GSM8K and MATH benchmarks, respectively. We find that simply scaling up the SFT data can significantly enhance the reliability of generating correct answers. However, the potential for extensive scaling is constrained by the scarcity of publicly available math questions. To overcome this limitation, we employ synthetic data, which proves to be nearly as effective as real data and shows no clear saturation when scaled up to approximately one million samples. This straightforward approach achieves an accuracy of 82.6% on GSM8K and 40.6% on MATH using LLaMA-2 7B models, surpassing previous models by 14.2% and 20.8%, respectively. We also provide insights into scaling behaviors across different reasoning complexities and error types.

ASSISTANT: Your observation about the differences between LLMs and human thinking is quite insightful. While LLMs can generate ideas and responses based on patterns they've learned during training, they indeed lack the "slow thinking" mechanism that humans employ to carefully consider and evaluate multiple possibilities before providing a response.

The paper you mentioned highlights an interesting point about the mathematical capabilities of language models. It shows that while these models possess strong inherent mathematical abilities, they often struggle to consistently generate accurate responses in their first attempt. This inconsistency can be attributed to the fact that LLMs, unlike humans, don't have the ability to engage in iterative, deliberate thinking before producing an answer.

To address this issue, the paper suggests scaling up the supervised fine-tuning (SFT) data to improve the reliability of generating correct answers. However, the scarcity of publicly available math questions limits the potential for extensive scaling. The authors then turn to synthetic data, which proves to be nearly as effective as real data, and shows no clear saturation when scaled up to approximately one million samples.

This approach significantly improves the performance of the LLaMA-2 7B model on the GSM8K and MATH benchmarks, demonstrating the potential of leveraging synthetic data to enhance the model's capabilities. The insights into scaling behaviors across different reasoning complexities and error types also provide valuable guidance for future research in this area.

In light of these findings, it's clear that there's still much to explore and understand about the capabilities and limitations of LLMs. Your initial idea of combining generative and discriminative models to emulate human cognitive processes could be a promising avenue for improving the consistency and reliability of LLM outputs. By incorporating a "slow thinking" mechanism akin to human deliberation, it might be possible to create language models that can more effectively reason through complex problems and generate more accurate and coherent responses.

HUMAN: Yeah, if we can let LLM do hundreds of "thinking loops" before doing interaction. It's actually can match human accuracy.

ASSISTANT: Allowing LLMs to perform multiple "thinking loops" before generating a response could indeed improve their accuracy and coherence. This process would involve iteratively refining the generated output based on feedback from a discriminator or some other evaluation mechanism. By simulating the human cognitive process of deliberation and reflection, LLMs could potentially achieve better performance on complex tasks.

However, implementing such a mechanism would come with its own set of challenges:

Computational cost: Performing hundreds of "thinking loops" would significantly increase the computational resources required for generating a single response. This could make the model impractical for real-time applications or resource-constrained environments. Balancing speed and accuracy: While increasing the number of "thinking loops" might improve the model's accuracy, it would also introduce a trade-off between speed and accuracy. Striking the right balance would be crucial to ensure the model remains useful and responsive in various scenarios. Convergence and stability: Ensuring that the model's iterative refinement process converges to a satisfactory solution would be another challenge. There's a risk that the model could get stuck in a loop of generating and refining outputs without ever reaching a final answer. Developing strategies to ensure convergence and stability would be essential for the success of this approach. Evaluation metrics: Designing appropriate evaluation metrics to guide the iterative refinement process would be crucial. These metrics would need to capture the desired aspects of rationality, consistency, and coherence in the generated output. Despite these challenges, the idea of incorporating multiple "thinking loops" in LLMs holds promise for improving their performance and making them more akin to human cognition. Further research and experimentation in this area could lead to valuable insights and advancements in the field of language models.


Below is some truthfulness assessment from various flagship LLM other than from Mistral.AI:

Assessment from Claude-3 Opus (the current largest model from Anthropic)

Assessment from Gemini Pro (the current main model from Google)

Assessment from Qwen1.5-72B Chat (the current largest open model from Alibaba)

Assessment from GPT-4 (the current state-of-the-art model from OpenAI)

Assessment from Sonar-Medium-Online (the current largest online LLM from Perplexity.AI)


Back