Why I dont often use large language models.

This is an automatically translated post by LLM. The original post is in Chinese. If you find any translation errors, please leave a comment to help me improve the translation. Thanks!

After the amazing text generation capabilities brought by the emergence of ChatGPT, various large models have gradually emerged like mushrooms after the rain. Developers have also developed many functions based on the capabilities of large language models to empower life and work. However, at its core, a large language model is just a text probability prediction model, not a true general artificial intelligence. Although it demonstrates strong intelligence in many aspects and greatly improves efficiency in some tasks, blindly trusting its results may also bring side effects to work and life. Currently, there is also a lot of research on the illusion and factual nature of large language model responses. This article does not start from a research perspective, but only discusses the considerations of using large models in daily life.

When Do I Use Large Language Models

First, let's talk about the most common scenarios where I currently use large language models and the corresponding tools (in order of frequency from top to bottom): 1. Chinese-English translation (Pot) 2. Code completion, comments, and test case generation (VSCode) 3. Summarizing papers (Zotero) 4. Writing boring materials (ChatBox)

When Do I Not Use Large Language Models

Next, let's discuss when I choose not to use large language models. Not using large language models here can be understood from two perspectives:

  1. Things in my work and life other than those handled by large models
  2. Things that others use large models for but I choose not to use them for

This can be formalized as:

Let the set of all things in my daily work and life be denoted as \(U_{me}\), where the set of things I use large models to handle is denoted as \(U_{me,LLM}\), the set of all things in other people's daily work and life is denoted as \(U_{other}\), and the subset of things they use large models to handle is denoted as \(U_{other,LLM}\). The two definitions above can be represented as the difference of sets:

  1. \(U_{me}-U_{me,LLM}\)
  2. \(U_{other,LLM}-U_{me,LLM}\)

This mainly discusses the second perspective, that is, things that others use large models for but I choose not to use them for. From my observations in the lab, these things mainly include:

  1. Asking the function of a function in a large model package (such as the function of squeeze in pytorch)
  2. Letting the large model write a piece of code for a specific task (such as writing code for a robot control command)
  3. Asking the large model to explain the function of a piece of code
  4. Asking the large model to answer specific scientific questions (such as how to improve the generalization of reinforcement learning models)

What Are the Boundaries for Choosing to Use Large Models?

Comparing the tasks where I choose to use large models and those I choose not to use, we can find that tasks where I choose to use large models typically have the following attributes:

  • Easy to confirm: After the large model outputs results, I can confirm the correctness of the results at a low cost. For example, in the translation software Pot, cross-validation is performed using multiple translation APIs (Cambridge Dictionary, Bing Dictionary, Google Translate, OpenAI, SmartAI).
  • Well-defined: Whether it is a translation task or code completion, the output of the task is highly related to the input and does not require too much creativity.
  • Not important: This is specifically for material generation tasks, dddd

When using large language models to solve tasks with the above attributes, I can improve my productivity almost without any cost of confirming the correctness or factual nature of the generated results. In the current stage where the correctness of large models cannot be guaranteed, I believe it is possible to find work content in the workflow that has the above attributes and let the large model improve efficiency.

Furthermore, tasks that I choose not to use large models for have some characteristics:

  • Factual: Questions that need to be answered are facts. For example, asking about the function of a function, specific questions with answers
  • Ambiguity: The question given to the large model itself has multiple possible interpretations, leading to inaccurate answers. For example, asking about the function of certain functions
  • Creativity: The question given to the large model has too large a solution space, resulting in too many possible outputs, requiring repeated questioning to get a good answer. For example, letting the large model write code for a specific task

At the current stage, large language models are essentially just text probability predictors. Therefore, asking factual questions to them depends on whether they have seen relevant corpus during training. In addition, after fine-tuning on some datasets, the model may produce more serious illusion problems (Gekhman et al. 2024). Therefore, from a practical perspective, I do not recommend using large models to solve factual questions.

Moreover, tasks with the above characteristics are also difficult to confirm the correctness of the results. Or the cost of confirming their correctness is not less than not using large models. For example, if you need to confirm whether the function given by the large model is correct, you may need to find relevant documents for confirmation or write a simple test code to check the output. This confirmation process itself is enough to complete these tasks. In addition, if the results generated by the large model are not confirmed, the consequences of errors can be severe. For example, if you do not confirm the answer given by the large model about improving the generalization of reinforcement learning, and then proceed with the work according to its proposed approach. If there are errors in the answer given by the large model, it may lead to a lot of invalid work.

Finally, from the perspective of personal learning and growth, I strongly do not recommend using large models for tasks like these, which bring the following harms:

  1. Lack of growth: Without using large models, trying to solve these tasks can effectively improve personal abilities. For example, the ability to find and read documents, read code, find and read literature, etc. This will accelerate personal work efficiency (the speed of completing these tasks will become faster and faster). Using large models to complete such tasks continuously does not contribute to the improvement of personal abilities and the acceleration of work efficiency.
  2. Blind belief: Although large models may provide some factual errors, their language organization ability is too good. It is so good that it can make people believe some "lies" he says. If in a field that one is not familiar with, it is easy to distinguish the errors in the answers given by large language models. But if one keeps using large models to answer questions in a field that is not familiar, the convenience and singularity of information sources will gradually lead to the loss of the ability to distinguish correct results from the many search results obtained from search engines, and ultimately can only blindly believe in the results given by large language models.

Conclusion

In conclusion, I believe that it is important to have a correct understanding of large models. At the current stage, large language models are just a tool for text generation, not a "wise man" who can answer all questions. We should treat large models with the attitude of using tools, abandon the deification and blind belief in large models. Let the large models work for us, rather than becoming slaves to large models.