使用 Llama 3.2-Vision 多模態(tài) LLM 和圖像“聊天”

作者：二旺 2024-12-16 07:00:00

本文專注于了解如何在類似聊天的模式下本地構(gòu)建 Llama 3.2-Vision，并在 Colab 筆記本上探索其多模態(tài)技能。

一、引言

將視覺能力與大型語言模型（LLMs）結(jié)合，正在通過多模態(tài) LLM（MLLM）徹底改變計(jì)算機(jī)視覺領(lǐng)域。這些模型結(jié)合了文本和視覺輸入，展示了在圖像理解和推理方面的卓越能力。雖然這些模型以前只能通過 API 訪問，但最近的開放源代碼選項(xiàng)現(xiàn)在允許本地執(zhí)行，使其在生產(chǎn)環(huán)境中更具吸引力。

在本教程中，我們將學(xué)習(xí)如何使用開源的 Llama 3.2-Vision 模型與圖像進(jìn)行對(duì)話，您將對(duì)其 OCR、圖像理解和推理能力感到驚嘆。所有代碼都方便地提供在一個(gè) Colab 筆記本中。

二、背景

Llama 是 “Large Language Model Meta AI” 的縮寫，是由 Meta 開發(fā)的一系列先進(jìn) LLM。其最新版本 Llama 3.2 引入了先進(jìn)的視覺能力。視覺變體有兩種尺寸：11B 和 90B 參數(shù)，能夠在邊緣設(shè)備上進(jìn)行推理。憑借高達(dá) 128k 的上下文窗口和對(duì)高達(dá) 1120x1120 像素的高分辨率圖像的支持，Llama 3.2 可以處理復(fù)雜的視覺和文本信息。

三、架構(gòu)

Llama 系列模型是僅解碼器的 Transformer。Llama 3.2-Vision 基于預(yù)訓(xùn)練的 Llama 3.1 純文本模型構(gòu)建。它采用了標(biāo)準(zhǔn)的密集自回歸 Transformer 架構(gòu)，與前代 Llama 和 Llama 2 沒有顯著偏離。

為了支持視覺任務(wù)，Llama 3.2 使用預(yù)訓(xùn)練的視覺編碼器（ViT-H/14）提取圖像表示向量，并通過視覺適配器將這些表示集成到凍結(jié)的語言模型中。適配器由一系列交叉注意力層組成，允許模型專注于與正在處理的文本相對(duì)應(yīng)的圖像部分 [1]。

適配器在文本-圖像對(duì)上進(jìn)行訓(xùn)練，以將圖像表示與語言表示對(duì)齊。在適配器訓(xùn)練期間，圖像編碼器的參數(shù)會(huì)更新，而語言模型的參數(shù)保持凍結(jié)，以保留現(xiàn)有的語言能力。

Llama 3.2-Vision 架構(gòu)。視覺模塊（綠色）集成到固定的語言模型（粉色）中

這種設(shè)計(jì)使 Llama 3.2 在多模態(tài)任務(wù)中表現(xiàn)出色，同時(shí)保持了強(qiáng)大的純文本性能。生成的模型在需要圖像和語言理解的任務(wù)中展示了令人印象深刻的能力，并允許用戶與其視覺輸入進(jìn)行交互式通信。在了解了 Llama 3.2 的架構(gòu)后，我們可以深入實(shí)際實(shí)現(xiàn)。但首先，我們需要做一些準(zhǔn)備工作。

四、準(zhǔn)備工作

在 Google Colab 上運(yùn)行 Llama 3.2 — Vision 11B 之前，我們需要進(jìn)行以下準(zhǔn)備工作：

(1) GPU 設(shè)置：

推薦使用至少 22GB VRAM 的高端 GPU 以實(shí)現(xiàn)高效推理 [2]。
對(duì)于 Google Colab 用戶：導(dǎo)航到“運(yùn)行時(shí)” > “更改運(yùn)行時(shí)類型” > 選擇“A100 GPU”。請(qǐng)注意，高端 GPU 可能不適用于免費(fèi) Colab 用戶。

(2) 模型權(quán)限：在此處申請(qǐng) Llama 3.2 模型的訪問權(quán)限。

(3) Hugging Face 設(shè)置：

如果您還沒有 Hugging Face 賬戶，請(qǐng)?jiān)诖颂巹?chuàng)建一個(gè)。
如果您還沒有訪問令牌，請(qǐng)從您的 Hugging Face 賬戶生成一個(gè)。
對(duì)于 Google Colab 用戶，在 Google Colab Secrets 中將 Hugging Face 令牌設(shè)置為名為“HF_TOKEN”的秘密環(huán)境變量。

(4) 安裝所需庫。

五、加載模型

在設(shè)置好環(huán)境和獲取必要權(quán)限后，我們將使用 Hugging Face Transformers 庫實(shí)例化模型及其關(guān)聯(lián)的處理器。處理器負(fù)責(zé)為模型準(zhǔn)備輸入并格式化其輸出。

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto")

processor = AutoProcessor.from_pretrained(model_id)

1.期望的聊天模板

聊天模板通過存儲(chǔ)“用戶”（我們）和“助手”（AI 模型）之間的對(duì)話歷史來保持上下文。對(duì)話歷史被結(jié)構(gòu)化為一個(gè)名為 messages 的列表，其中每個(gè)字典代表一個(gè)對(duì)話輪次，包括用戶和模型的響應(yīng)。用戶輪次可以包括圖像-文本或純文本輸入，{"type": "image"} 表示圖像輸入。例如，經(jīng)過幾次聊天迭代后，messages 列表可能如下所示：

messages = [
    {"role": "user",      "content": [{"type": "image"}, {"type": "text", "text": prompt1}]},
    {"role": "assistant", "content": [{"type": "text", "text": generated_texts1}]},
    {"role": "user",      "content": [{"type": "text", "text": prompt2}]},
    {"role": "assistant", "content": [{"type": "text", "text": generated_texts2}]},
    {"role": "user",      "content": [{"type": "text", "text": prompt3}]},
    {"role": "assistant", "content": [{"type": "text", "text": generated_texts3}]}
]

這個(gè) messages 列表稍后會(huì)傳遞給 apply_chat_template() 方法，以將對(duì)話轉(zhuǎn)換為模型期望格式的單個(gè)可標(biāo)記化字符串。

2.主函數(shù)

在本教程中，我提供了一個(gè) chat_with_mllm 函數(shù)，該函數(shù)支持與 Llama 3.2 MLLM 進(jìn)行動(dòng)態(tài)對(duì)話。此函數(shù)處理圖像加載、預(yù)處理圖像和文本輸入、生成模型響應(yīng)，并管理對(duì)話歷史以啟用聊天模式交互。

def chat_with_mllm (model, processor, prompt, images_path=[],do_sample=False, temperature=0.1, show_image=False, max_new_tokens=512, messages=[], images=[]):

    # Ensure list:
    if not isinstance(images_path, list):
        images_path =  [images_path]

    # Load images 
    if len (images)==0 and len (images_path)>0:
            for image_path in tqdm (images_path):
                image = load_image(image_path)
                images.append (image)
                if show_image:
                    display ( image )

    # If starting a new conversation about an image
    if len (messages)==0:
        messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": prompt}]}]

    # If continuing conversation on the image
    else:
        messages.append ({"role": "user", "content": [{"type": "text", "text": prompt}]})

    # process input data
    text = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(images=images, text=text, return_tensors="pt", ).to(model.device)

    # Generate response
    generation_args = {"max_new_tokens": max_new_tokens, "do_sample": True}
    if do_sample:
        generation_args["temperature"] = temperature
    generate_ids = model.generate(**inputs,**generation_args)
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:-1]
    generated_texts = processor.decode(generate_ids[0], clean_up_tokenization_spaces=False)

    # Append the model's response to the conversation history
    messages.append ({"role": "assistant", "content": [  {"type": "text", "text": generated_texts}]})

    return generated_texts, messages, images

六、與 Llama 對(duì)話

1. 蝴蝶圖像示例

在我們的第一個(gè)示例中，我們將與 Llama 3.2 討論一張孵化中的蝴蝶圖像。由于 Llama 3.2-Vision 在使用圖像時(shí)不支持系統(tǒng)提示，我們將直接在用戶提示中附加指令以指導(dǎo)模型的響應(yīng)。通過設(shè)置 do_sample=True 和 temperature=0.2，我們?cè)试S輕微的隨機(jī)性，同時(shí)保持響應(yīng)的一致性。對(duì)于固定答案，可以設(shè)置 do_sample=False。messages 參數(shù)（保存聊天歷史）最初為空，images 參數(shù)也是如此。

instructions = "Respond concisely in one sentence."
prompt = instructions + "Describe the image."

response, messages,images= chat_with_mllm ( model, processor, prompt,
                                             images_path=[img_path],
                                             do_sample=True,
                                             temperature=0.2,
                                             show_image=True,
                                             messages=[],
                                             images=[])

# Output:  "The image depicts a butterfly emerging from its chrysalis, 
#           with a row of chrysalises hanging from a branch above it."

正如我們所見，輸出準(zhǔn)確且簡(jiǎn)潔，表明模型有效地理解了圖像。在下一個(gè)聊天迭代中，我們將傳遞一個(gè)新的提示以及聊天歷史（messages）和圖像文件（images）。新提示旨在評(píng)估 Llama 3.2 的推理能力：

prompt = instructions + "What would happen to the chrysalis in the near future?"
response, messages, images= chat_with_mllm ( model, processor, prompt,
                                             images_path=[img_path,],
                                             do_sample=True,
                                             temperature=0.2,
                                             show_image=False,
                                             messages=messages,
                                             images=images)

# Output: "The chrysalis will eventually hatch into a butterfly."

我們?cè)谔峁┑?Colab 筆記本中繼續(xù)了這次對(duì)話，并獲得了以下對(duì)話內(nèi)容：

對(duì)話突出了模型通過準(zhǔn)確描述場(chǎng)景來理解圖像的能力。它還展示了其推理能力，通過邏輯連接信息，正確推斷出蛹會(huì)發(fā)生什么，并解釋了為什么有些是棕色的而有些是綠色的。

2. 表情包圖像示例

在這個(gè)示例中，我將向模型展示我自己創(chuàng)建的一個(gè)表情包，以評(píng)估 Llama 的 OCR 能力，并確定它是否理解我的幽默感。

instructions = "You are a computer vision engineer with sense of humor."
prompt = instructions + "Can you explain this meme to me?"


response, messages,images= chat_with_mllm ( model, processor, prompt,
                                             images_path=[img_path,],
                                             do_sample=True,
                                             temperature=0.5,
                                             show_image=True,
                                             messages=[],
                                             images=[])
instructions = "You are a computer vision engineer with sense of humor."
prompt = instructions + "Can you explain this meme to me?"


response, messages,images= chat_with_mllm ( model, processor, prompt,
                                             images_path=[img_path,],
                                             do_sample=True,
                                             temperature=0.5,
                                             show_image=True,
                                             messages=[],
                                             images=[])

這是輸入的表情包：