GITM:LLM和RL的一次交融

发表于 2023-06-07 更新于 2023-11-10 分类于 Note 阅读次数： 100 Waline：本文字数： 11k 阅读时长 ≈ 10 分钟

最近，清华大学和商汤发表了一篇名为《Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory》的文章，简称GITM。很有意思，感兴趣的朋友可以读一下原文。

在该文中，作者使用大语言模型来指导智能体的行为，并以语言为交互媒介对智能体进行训练。从而在使用1个32核CPU训练两天的情况下在我的世界环境中达到了非常好的实验效果。并且在训练过程中解锁了我的世界这个游戏中的所有科技树。

GITM与其他模型在挖钻石挑战上的结果对比

GITM与其他模型在科技树解锁率上的对比

问题背景

首先来介绍一下GITM做的工作的背景。该工作要解决的问题是在我的世界游戏环境中，从游戏初始化开始，不断学习达到挖到钻石的目标。在我的世界这个游戏中，钻石通常埋在一定深度的地下。而想要挖到一块钻石矿，至少需要带一个铁镐。而铁镐的制作需要用到铁，那么就需要先挖到铁矿并掌握冶铁的科技。挖铁矿需要用到石稿，石稿的制作需要石头，而石头需要木稿，木稿的值作需要木头和合成台。

而这只是采矿工具上的科技树。在我的世界这个游戏中，还需要不断补充食物来回血和防止饥饿，制造武器和护甲来保护自己的安全，这就给智能体的各方面能力都提出了很高的要求。

强化学习的思路在于，智能体观测环境，根据观测到的数据产生动作，智能体执行动作之后，环境发生相应的变化并给智能体反馈一个奖励信号。然而，在我的世界的环境中，由于其极高的自由度和较深的科技树，导致强化学习在该环境下的表现一直差强人意。

GITM

GITM在解决这个问题时，没有采用强化学习这一奖励指导动作的思路，而是用语言大模型来进行任务分解和规划，最后通过一个接口来与环境进行交互。具体来说，GITM由三部分构成：LLM Decomposer, LLM Planner, LLM Interface. 接下来分别对这三部分进行介绍。

LLM Decomposer

LLM Decomposer 的作用在于对目标进行分解。首先，对目标进行定义，在我的世界里面，每个目标可以被定义为如下五元组： $\begin{matrix} (1) & (O b j e c t, C o u n t, M a t e r i a l, T o o l, I n f o) \end{matrix}$ 其中 $O b j e c t$ 指目标物品， $C o u n t$ 指具体的数量， $M a t e r i a l$ 和 $T o o l$ 表明了获得该物品需要的前置条件， $I n f o$ 中描述了与该目标物品相关的文本知识。LLM Decomposer 在收到一个目标输入之后，便可以根据其前置条件对该目标进行分解，即生成以 $M a t e r i a l$ 和 $T o o l$ 为 $O b j e c t$ 的子任务。这个分解的过程是可以递归进行的，直到分解的出的子目标没有前置条件为止。

这一部分可以使用语言大模型来实现，文章中也给出了具体的prompt，感兴趣的朋友可以用chatGPT尝试一下，效果还不错：

SYSTEM: 
You are an assistant for the game Minecraft. 
I will give you some target object and some knowledge related to the object. 
Please write the obtaining of the object as a goal in the standard form. 
The standard form of the goal is as follows: 
{ 
    "object": "the name of the target object", 
    "count": "the target quantity", 
    "material": "the materials required for this goal, a dictionary in the form {material_name: material_quantity}. If no material is required, set it to None", 
    "tool": "the tool used for this goal. If multiple tools can be used for this goal, only write the most basic one. If no tool is required, set it to None", 
    "info": "the knowledge related to this goal" 
} 

The information I will give you: 
Target object: the name and the quantity of the target object 
Knowledge: some knowledge related to the object. 

Requirements: 
1. You must generate the goal based on the provided knowledge instead of purely depending on your own knowledge. 
2. The "info" should be as compact as possible, at most 3 sentences. The knowledge I give you may be raw texts from Wiki documents. Please extract and summarize important information instead of directly copying all the texts. 

Goal Example: 
{
    "object": "iron_ore",
    "count": 1,
    "material": None,
    "tool": "stone_pickaxe",
    "info": "iron ore is obtained by mining iron ore. iron ore is most found in level 53. iron ore can only be mined with a stone pickaxe or better; using a wooden or gold pickaxe will yield nothing." 
} 
{ 
    "object": "wooden_pickaxe",
    "count": 1,
    "material": {"planks": 3, "stick": 2},
    "tool": "crafting_table",
    "info": "wooden pickaxe can be crafted with 3 planks and 2 stick as the material and crafting table as the tool." 
} 

USER: 
Target object: {object quantity} {object name} 
Knowledge: {related knowledge}

LLM Planner

LLM Planner 的作用在于根据一个给出的任务，将其分解为结构化动作的组合，结构化动作是具有良好定义的在我的世界中的一些较为基本，容易通过脚本实现的动作。每个动作由三部分组成： $\begin{matrix} (2) & (N a m e, A r g u e m e n t s, D e s c r i p t i o n) \end{matrix}$ 结构化动作列表如下表所示：

SYSTEM: 
You serve as an assistant that helps me play the game Minecraft. 
I will give you a goal in the game. Please think of a plan to achieve the goal, and then write a sequence of actions to realize the plan. The requirements and instructions are as follows: 

1. You can only use the following functions. Don’t make plans purely based on your experience, think about how to use these functions. 

explore(object, strategy) 
Move around to find the object with the strategy: used to find objects including block items and entities. This action is finished once the object is visible (maybe at the distance). 
Augments: 
- object: a string, the object to explore. 
- strategy: a string, the strategy for exploration. 

approach(object) 
Move close to a visible object: used to approach the object you want to attack or mine. It may fail if the target object is not accessible. 
Augments:
- object: a string, the object to approach. 

craft(object, materials, tool) 
Craft the object with the materials and tool: used for crafting new object that is not in the inventory or is not enough. The required materials must be in the inventory and will be consumed, and the newly crafted objects will be added to the inventory. The tools like the crafting table and furnace should be in the inventory and this action will directly use them. Don’t try to place or approach the crafting table or furnace, you will get failed since this action does not support using tools placed on the ground. You don’t need to collect the items after crafting. If the quantity you require is more than a unit, this action will craft the objects one unit by one unit. If the materials run out halfway through, this action will stop, and you will only get part of the objects you want that have been crafted. 
Augments: 
- object: a dict, whose key is the name of the object and value is the object quantity. 
- materials: a dict, whose keys are the names of the materials and values are the quantities. 
- tool: a string, the tool used for crafting. Set to null if no tool is required. 

mine(object, tool) 
Mine the object with the tool: can only mine the object within reach, cannot mine object from a distance. If there are enough objects within reach, this action will mine as many as you specify. The obtained objects will be added to the inventory. 
Augments: 
- object: a string, the object to mine. 
- tool: a string, the tool used for mining. Set to null if no tool is required.

attack(object, tool) 
Attack the object with the tool: used to attack the object within reach. This action will keep track of and attack the object until it is killed. 
Augments: 
- object: a string, the object to attack. 
- tool: a string, the tool used for mining. Set to null if no tool is required.

equip(object) 
Equip the object from the inventory: used to equip equipment, including tools, weapons, and armor. The object must be in the inventory and belong to the items for equipping. 
Augments: 
- object: a string, the object to equip. 

digdown(object, tool) 
Dig down to the y-level with the tool: the only action you can take if you want to go underground for mining some ore. 
Augments: 
- object: an int, the y-level (absolute y coordinate) to dig to. 
- tool: a string, the tool used for digging. Set to null if no tool is required.

go_back_to_ground(tool) 
Go back to the ground from underground: the only action you can take for going back to the ground if you are underground. 
Augments: 
- tool: a string, the tool used for digging. Set to null if no tool is required. 

apply(object, tool) 
Apply the tool on the object: used for fetching water, milk, lava with the tool bucket, pooling water or lava to the object with the tool water bucket or lava bucket, shearing sheep with the tool shears, blocking attacks with the tool shield. 
Augments: 
- object: a string, the object to apply to. 
- tool: a string, the tool used to apply. 

2. You cannot define any new function. Note that the "Generated structures" world creation option is turned off. 

3. There is an inventory that stores all the objects I have. It is not an entity, but objects can be added to it or retrieved from it anytime at anywhere without specific actions. The mined or crafted objects will be added to this inventory, and the materials and tools to use are also from this inventory. Objects in the inventory can be directly used. Don’t write the code to obtain them. If you plan to use some object not in the inventory, you should first plan to obtain it. You can view the inventory as one of my states, and it is written in form of a dictionary whose keys are the name of the objects I have and the values are their quantities. 

4. You will get the following information about my current state:
- inventory: a dict representing the inventory mentioned above, whose keys are the name of the objects and the values are their quantities 
- environment: a string including my surrounding biome, the y-level of my current location, and whether I am on the ground or underground 
Pay attention to this information. Choose the easiest way to achieve the goal conditioned on my current state. Do not provide options, always make the final decision. 

5. You must describe your thoughts on the plan in natural language at the beginning. After that, you should write all the actions together. The response should follow the format: 
{ 
    "explanation": "explain why the last action failed, set to null for the first planning", 
    "thoughts": "Your thoughts on the plan in natural languag", 
    "action_list": [ 
        {"name": "action name", "args": {"arg name": value}, "expectation": "describe the expected results of this action"}, 
        {"name": "action name", "args": {"arg name": value}, "expectation": "describe the expected results of this action"}, 
        {"name": "action name", "args": {"arg name": value}, "expectation": "describe the expected results of this action"} 
    ] 
} 
The action_list can contain arbitrary number of actions. The args of each action should correspond to the type mentioned in the Arguments part. Remember to add “‘dict“‘ at the beginning and the end of the dict. Ensure that you response can be parsed by Python json.loads 

6. I will execute your code step by step and give you feedback. If some action fails, I will stop at that action and will not execute its following actions. The feedback will include error messages about the failed action. At that time, you should replan and write the new code just starting from that failed action.

LLM Interface

LLM Interface 的主要作用是把 Planner 下发的基于文本的动作转换为可以与环境交互的动作。主要有两种方式可以实现：人为写的脚本和RL学习的模型。由于在我的世界中结构化动作都有着很好的定义，从而采用了人写脚本的方式来进行控制。

在这一层级，也用到了语言大模型来进行动作分解。首先，在拿到一个目标后（例如 "find material and craft a iron pickaxe"），让语言模型将其分解为一个深度最多为2的树状的规划。

SYSTEM: 
You serve as an assistant that helps me play Minecraft. 
I will give you my goal in the game, please break it down as a tree-structure plan to achieve this goal. 
The requirements of the tree-structure plan are: 
1. The plan tree should be exactly of depth 2. 
2. Describe each step in one line. 
3. You should index the two levels like ’1.’, ’1.1.’, ’1.2.’, ’2.’, ’2.1.’, etc. 
4. The sub-goals at the bottom level should be basic actions so that I can easily execute them in the game. 

USER: 
The goal is to {goal description}. Generate the plan according to the requirements.

在此之后，对于树状规划上的每个叶子节点，将其通过语言模型转换为特定的行为描述（'verb', 'object', 'tools', 'materials'）

SYSTEM: 
You serve as an assistant that helps me play Minecraft. 
I will give you a sentence. 
Please convert this sentence into one or several actions according to the following instructions. 
Each action should be a tuple of four items, written in the form (’verb’, ’object’, ’tools’, ’materials’) 
’verb’ is the verb of this action. 
’object’ refers to the target object of the action. 
’tools’ specifies the tools required for the action. 
’material’ specifies the materials required for the action. 
If some of the items are not required, set them to be ’None’. 

USER: 
The sentence is {sentence}. Generate the action tuple according to the requirements.

关于智能的讨论

在我的世界的环境中，该文章中提出的方法取得了非常好的效果：

相比于目前SOTA的强化学习算法，GITM几乎以碾压的姿态出现在Minecraft的世界中。它的整个流程都如此的合理，以至于即便是如此夸张的性能对比也不会让人过分怀疑。GITM的性能表现说明了一件事：在人类已经攻略的事情上，GPT有潜力通过“查攻略”的方式来达到极高的智能水准。而我们人类的大量知识与技术都是以文本为载体呈现的。因此，这条道路的潜力是巨大的。

尽管如此，我依然认为我们离真正的智能很远。GPT真的太大了，相比于其能力来说，它所需要的训练数据和计算量实在过于庞大。不得不承认Chat GPT在GAI这条路上大幅迈进了一步，不过相比于真正的智能，它更像是一个不太一样的搜索引擎，攻略手册，并且直到今天它依然无法保证较大范围的可靠性。但它确实也让我们仿佛窥到了通用人工智能的一些影子。最后，希望在我们的有生之年，可以看到真正的智能出现。