成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

傳統分塊已死?Agentic Chunking拯救語義斷裂,實測RAG準確率飆升40%,LLM開發者必看! 原創

發布于 2025-2-24 09:40
瀏覽
0收藏

最近公司處理LLM項目的同事咨詢了我一個問題:明明文檔中多次提到同一個專有名詞,RAG卻總是漏掉關鍵信息。排查后發現,問題出在傳統的分塊方法上——那些相隔幾頁卻密切相關的句子,被無情地拆散了。我給了一些通用的建議,比如使用混合檢索代替單一的語義檢索,基于chunk生成QA對等等。接著他又提出了一個問題,有沒有通過分塊技術能減少這類問題的發生?我說你也可以試試最近新提出的一種分塊策略:Agentic Chunking.

為什么分塊如此重要?

在RAG模型中,文本分塊是第一步,也是最關鍵的一步。傳統的分塊方法,比如遞歸字符分割(Recursive character splitting),雖然簡單易用,但它有一個明顯的缺點:它依賴于固定的token長度進行分割,這可能導致一個主題被分割到不同的文本塊中,從而破壞了上下文的連貫性。

另一種常見的分塊方法是語義分割(semantic splitting),它通過檢測句子之間的語義變化來進行分割。這種方法雖然比遞歸字符分割更智能,但它也有局限性。比如,當文檔中的話題來回切換時,語義分割可能會將相關內容分割到不同的塊中,導致信息不連貫。

比如遇到下面這種場景時,它們就會集體失靈:

"小明介紹了Transformer架構...(中間插入5段其他內容)...最后他強調,Transformer的核心是自注意力機制。"

傳統方法要么把這兩句話拆到不同區塊,要么被中間內容干擾導致語義斷裂。而人工分塊時,我們自然會將它們歸為“模型原理”組——這種跨越文本距離的關聯性,正是Agentic Chunking要解決的。

Agentic Chunking的工作原理

Agentic Chunking的核心思想是讓大語言模型(LLM)主動評估每一句話,并將其分配到最合適的文本塊中。與傳統的分塊方法不同,Agentic Chunking不依賴于固定的token長度或語義變化,而是通過LLM的智能判斷,將文檔中相隔較遠但主題相關的句子歸入同一組。

舉個例子,假設我們有以下文本:

On July 20, 1969, astronaut Neil Armstrong walked on the moon. He was leading the NASA’s Apollo 11 mission. Armstrong famously said, “That’s one small step for man, one giant leap for mankind” as he stepped onto the lunar surface.

在Agentic Chunking中,LLM會將這些句子進行propositioning處理,即將每個句子獨立化,確保每個句子都有自己的主語。處理后的文本如下:

On July 20, 1969, astronaut Neil Armstrong walked on the moon.
Neil Armstrong was leading the NASA’s Apollo 11 mission.
Neil Armstrong famously said, “That’s one small step for man, one giant leap for mankind” as he stepped onto the lunar surface.

這樣,LLM就可以單獨檢查每一個句子,并將其分配到最合適的文本塊中。

propositioning 可以看做是對文檔進行“句子級整容”,確保每個句子獨立完整

如何實現Agentic Chunking?

實現Agentic Chunking的關鍵在于propositioning文本塊的動態創建與更新。我們可以使用Langchain和Pydantic等工具來實現這一過程。流程圖如下:

傳統分塊已死?Agentic Chunking拯救語義斷裂,實測RAG準確率飆升40%,LLM開發者必看!-AI.x社區


1. Propositioning文本

首先,我們需要將文本中的每個句子進行propositioning處理。我們可以使用Langchain提供的提示詞模板,讓LLM自動完成這項工作。以下是一個簡單的代碼示例:

from langchain.chains import create_extraction_chain_pydantic
from langchain_core.pydantic_v1 import BaseModel
from typing import Optional
from langchain.chat_models import ChatOpenAI
import uuid
import os
from typing import List

from langchain import hub
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

from pydantic import BaseModel

obj = hub.pull("wfh/proposal-indexing")
llm = ChatOpenAI(model="gpt-4o")

class Sentences(BaseModel):
    sentences: List[str]

extraction_llm = llm.with_structured_output(Sentences)
extraction_chain = obj | extraction_llm

sentences = extraction_chain.invoke(
    """
    On July 20, 1969, astronaut Neil Armstrong walked on the moon.
    He was leading the NASA's Apollo 11 mission.
    Armstrong famously said, "That's one small step for man, one giant leap for mankind" as he stepped onto the lunar surface.
    """
)

2. 創建和更新文本塊

接下來,我們需要創建一個函數來動態生成和更新文本塊。每個文本塊包含主題相似的propositions,并且隨著新propositions的加入,文本塊的標題和摘要也會不斷更新。

def create_new_chunk(chunk_id, proposition):
    summary_llm = llm.with_structured_output(ChunkMeta)
    summary_prompt_template = ChatPromptTemplate.from_messages([
        ("system", "Generate a new summary and a title based on the propositions."),
        ("user", "propositions:{propositions}"),
    ])
    summary_chain = summary_prompt_template | summary_llm
    chunk_meta = summary_chain.invoke({"propositions": [proposition]})
    chunks[chunk_id] = {
        "summary": chunk_meta.summary,
        "title": chunk_meta.title,
        "propositions": [proposition],
    }

3. 將proposition推送到合適的文本塊

最后,我們需要一個AI Agent來判斷新的proposition應該被添加到哪個文本塊中。如果沒有合適的文本塊,Agent會創建一個新的文本塊。

def find_chunk_and_push_proposition(proposition):
    class ChunkID(BaseModel):
        chunk_id: int = Field(descriptinotallow="The chunk id.")
    allocation_llm = llm.with_structured_output(ChunkID)
    allocation_prompt = ChatPromptTemplate.from_messages([
        ("system", "Find the chunk that best matches the proposition. If no chunk matches, return a new chunk id."),
        ("user", "proposition:{proposition} chunks_summaries:{chunks_summaries}"),
    ])
    allocation_chain = allocation_prompt | allocation_llm
    chunks_summaries = {chunk_id: chunk["summary"] for chunk_id, chunk in chunks.items()}
    best_chunk_id = allocation_chain.invoke({"proposition": proposition, "chunks_summaries": chunks_summaries}).chunk_id
    if best_chunk_id not in chunks:
        create_new_chunk(best_chunk_id, proposition)
    else:
        add_proposition(best_chunk_id, proposition)

實測效果如何

我選擇了新加坡圣淘沙著名景點 Wings of Time 的介紹文本作為測試對象,使用 GPT-4 模型進行處理。這段文本包含了景點介紹、票務信息、開放時間等多個方面的內容,是一個很好的測試樣本。

Product Name: Wings of Time

Product Description: Wings of Time is one of Sentosa's most breathtaking attractions, combining water, laser, fire, and music to create a mesmerizing night show about friendship and courage. Situated on the scenic  (https://www.sentosa.com.sg/en/things-to-do/attractions/siloso-beach/) Siloso Beach , this award-winning spectacle is staged nightly, promising an unforgettable experience for visitors of all ages. Be wowed by spellbinding laser, fire, and water effects set to a majestic soundtrack, complete with a jaw-dropping fireworks display. A fitting end to your day out at Sentosa, it’s possibly the only place in Singapore where you can witness such an awe-inspiring performance.  Get ready for an even better experience starting 1 February 2025 ! Wings of Time Fireworks Symphony, Singapore’s only daily fireworks show, now features a fireworks display that is four times longer!   Important Note: Please visit  (https://www.sentosa.com.sg/sentosa-reservation) here if you need to change your visit date. All changes must be made at least 1 day prior to the visit date.

Product Category: Shows

Product Type: Attraction

Keywords: Wings of Time, Sentosa night show, Sentosa attractions, laser show Sentosa, water show Singapore, Sentosa events, family activities Sentosa, Singapore night shows, outdoor night show Sentosa, book Wings of Time tickets

Meta Description: Experience Wings of Time at Sentosa! A breathtaking night show featuring water, laser, and fire effects. Perfect for a memorable evening.


Product Tags: Family Fun,Popular experiences,Frequently Bought

Locations: Beach Station

[Tickets]

Name: Wings of Time (Std)
Terms: ? All Wings of Time (WOT) Open-Dated tickets require prior redemption at Singapore Cable Car Ticketing counters and are subjected to seats availability on a first come first serve basis. ? This is a rain or shine event. Tickets are non-exchangeable or nonrefundable under any circumstances. ? Once timeslot is confirmed, no further amendments are allowed. Please proceed to WOT admission gates to scan your issued QR code via mobile or physical printout for admission. ? Gates will open 15 minutes prior to the start of the show. ? Show Duration: 20 minutes per show. ? Please be punctual for your booked time slot. ? Admission will be on a first come first serve basis within the allocated timeslot or at the discretion of the attraction host. ? Standard seats are applicable to guest aged 4 years and above. ? No outside Food & Drinks are allowed. ? Refer to  (https://www.mountfaberleisure.com/attraction/wings-of-time/) https://www.mountfaberleisure.com/attraction/wings-of-time/ for more information on Wings of Time.
Pax Type: Standard
Promotion A: Enjoy $1.90 off when you purchase online! Discount will automatically be applied upon checkout.
Price: 19





Opening Hours: Daily  Show 1: 7.40pm  Show 2: 8.40pm




Accessibilities: Wheelchair



[Information]

Title: Terms & Conditions
Description: For more information, click  (https://www.sentosa.com.sg/en/promotional-general-store-terms-and-conditions) here for Terms & Conditions


Title: Getting Here
Description: By Sentosa Express: Alight at Beach Station  By Public Bus: Board Bus 123 and alight at Beach Station  By Intra-Island Bus: Board Sentosa Bus A or B and alight at Beach Station     Nearest Car Park   Beach Station Car Park


Title: Contact Us
Description: Beach Station  +65 6361 0088   (mailto:guestrelations@mflg.com.sg) guestrelations@mflg.com.sg

系統首先將原文轉化為 50 多個獨立的陳述句(propositions)。有趣的是,在這個過程中,系統自動將每句話的主語統一為"Wings of Time",這顯示出了 AI 對文本主題的準確把握。

[
    "Wings of Time is one of Sentosa's most breathtaking attractions.",
    'Wings of Time combines water, laser, fire, and music to create a mesmerizing night show.',
    'The night show of Wings of Time is about friendship and courage.',
    'Wings of Time is situated on the scenic Siloso Beach.',
    'Wings of Time is an award-winning spectacle staged nightly.',
    'Wings of Time promises an unforgettable experience for visitors of all ages.',
    'Wings of Time features spellbinding laser, fire, and water effects set to a majestic soundtrack.',
    'Wings of Time includes a jaw-dropping fireworks display.',
    'Wings of Time is a fitting end to a day out at Sentosa.',
    'Wings of Time is possibly the only place in Singapore where such an awe-inspiring performance can be witnessed.',
    'Wings of Time will offer an even better experience starting 1 February 2025.',
    'Wings of Time Fireworks Symphony is Singapore’s only daily fireworks show.',
    'Wings of Time Fireworks Symphony now features a fireworks display that is four times longer.',
    'Visitors should visit the provided link if they need to change their visit date to Wings of Time.',
    'All changes to the visit date must be made at least 1 day prior to the visit date.',
    'Wings of Time is categorized as a show.',
    'Wings of Time is a type of attraction.',
    'Keywords for Wings of Time include: Wings of Time, Sentosa night show, Sentosa attractions, laser show Sentosa, water show Singapore, Sentosa events, family activities Sentosa, Singapore night shows, outdoor night show Sentosa, book Wings of Time tickets.',
    'The meta description for Wings of Time is: Experience Wings of Time at Sentosa! A breathtaking night show featuring water, laser, and fire effects. Perfect for a memorable evening.',
    'Product tags for Wings of Time include: Family Fun, Popular experiences, Frequently Bought.',
    'Wings of Time is located at Beach Station.',
    'Wings of Time (Std) tickets require prior redemption at Singapore Cable Car Ticketing counters.',
    'Wings of Time (Std) tickets are subjected to seats availability on a first come first serve basis.',
    'Wings of Time is a rain or shine event.',
    'Tickets for Wings of Time are non-exchangeable or nonrefundable under any circumstances.',
    'Once the timeslot for Wings of Time is confirmed, no further amendments are allowed.',
    'Visitors should proceed to Wings of Time admission gates to scan their issued QR code via mobile or physical printout for admission.',
    'Gates for Wings of Time will open 15 minutes prior to the start of the show.',
    'The show duration for Wings of Time is 20 minutes per show.',
    'Visitors should be punctual for their booked time slot for Wings of Time.',
    'Admission to Wings of Time will be on a first come first serve basis within the allocated timeslot or at the discretion of the attraction host.',
    'Standard seats for Wings of Time are applicable to guests aged 4 years and above.',
    'No outside food and drinks are allowed at Wings of Time.',
    'More information on Wings of Time can be found at the provided link.',
    'The pax type for Wings of Time is Standard.',
    'Promotion A for Wings of Time offers $1.90 off when purchased online.',
    'The discount for Promotion A will automatically be applied upon checkout.',
    'The price for Wings of Time is 19.',
    'Wings of Time has opening hours daily with Show 1 at 7.40pm and Show 2 at 8.40pm.',
    'Wings of Time is accessible by wheelchair.',
    "The title for terms and conditions is 'Terms & Conditions'.",
    'More information on terms and conditions can be found at the provided link.',
    "The title for getting to Wings of Time is 'Getting Here'.",
    'Visitors can get to Wings of Time by Sentosa Express by alighting at Beach Station.',
    'Visitors can get to Wings of Time by Public Bus by boarding Bus 123 and alighting at Beach Station.',
    'Visitors can get to Wings of Time by Intra-Island Bus by boarding Sentosa Bus A or B and alighting at Beach Station.',
    'The nearest car park to Wings of Time is Beach Station Car Park.',
    "The title for contacting Wings of Time is 'Contact Us'.",
    'The contact location for Wings of Time is Beach Station.',
    'The contact phone number for Wings of Time is +65 6361 0088.',
    'The contact email for Wings of Time is guestrelations@mflg.com.sg.']

經過 AI 的智能分塊(agentic chunking),整個文本被自然地劃分為四個主要部分:

  1. 主體信息塊:包含了 Wings of Time 的核心介紹、特色、位置等綜合信息
  2. 日程政策塊:專門處理預約變更相關的信息
  3. 價格優惠塊:聚焦于折扣和支付相關內容
  4. 法律條款塊:歸納了各項條款和規定

Chunk (a641f): Sentosa's Wings of Time Show & Visitor Information
Summary: This chunk contains comprehensive details about the Wings of Time attraction in Sentosa, including its features, themes, location, visitor experience, ticketing and admission procedures, future enhancements, promotions, classification as a show and attraction, unique fireworks display, daily show schedule, accessibility options, importance of punctuality and ticket redemption, extended fireworks display in the Fireworks Symphony, transportation options to reach the venue, and the necessity of adhering to non-exchangeable ticket policies, with a focus on the standard ticketing process and visitor guidelines, and the recent update on the extended fireworks display, as well as the contact information and accessibility details, and the new experience starting February 2025.

Chunk (ae2b8): Scheduling Policies
Summary: This chunk contains information about policies regarding changes to scheduled dates and times.

Chunk (dadbb): Retail & Discounts
Summary: This chunk contains information about the application of discounts during the checkout process.

Chunk (3347c): Legal Terms & Conditions
Summary: This chunk contains information about terms and conditions, including their titles and where to find more information.

經過這樣的分塊之后,各個塊的主題明確,不重疊,且重要信息優先,輔助信息分類存放。把這樣的信息放在一起,也有助于提升向量庫的召回率,從而提升RAG的準確率。

總結

Agentic Chunking是一種非常強大的文本分塊技術,它能夠將文檔中相隔較遠但主題相關的句子歸入同一組,從而提升RAG模型的效果,但是這種方法在成本和延遲上相對較高。同事嘗試了Agentic chunking之后,據他說準確率提升了40%,但成本也增加了3倍。那么我們時候應該使用Agentic chunking呢?

根據我的項目經驗,以下場景特別適合:

  • 非結構化文本(如客服對話記錄)
  • 主題反復橫跳的內容(技術沙龍實錄)
  • 需要跨段落關聯的QA系統

而面對結構清晰的論文、說明書等,傳統分塊和語義分塊仍是性價比之選。


本文轉載自公眾號AI 博物院 作者:longyunfeigu

原文鏈接:??https://mp.weixin.qq.com/s/NyDnQCvq_cpCz_SwWivewQ??

?著作權歸作者所有,如需轉載,請注明出處,否則將追究法律責任
收藏
回復
舉報
回復
相關推薦
主站蜘蛛池模板: 国产中文字幕av | 久久精品91久久久久久再现 | 精品网站999www | 欧美日韩中文字幕 | 日韩高清一区二区 | 免费看a| 黄免费在线 | 久久精品国产99国产精品 | 日韩精品 | 综合九九| 99re6在线视频 | 亚洲三级在线观看 | 黄片毛片免费看 | 国产成人在线一区二区 | 一级毛片色一级 | 国产高清一区二区三区 | av网站免费看 | 99精品国产一区二区三区 | 国产亚洲精品久久午夜玫瑰园 | 日韩三级一区 | 在线观看午夜视频 | 久久成人一区 | 中文字幕高清av | 黑色丝袜三级在线播放 | 在线日韩视频 | 亚洲精品久 | 国产精品久久久久久久岛一牛影视 | 国产分类视频 | 免费不卡视频 | 91免费在线看 | 国产高清久久久 | 日本黄色大片免费 | 亚洲成人久久久 | 国产精品久久久久久二区 | av在线天堂网 | 91免费看片 | 粉嫩一区二区三区国产精品 | 中国一级特黄视频 | 精品国产一区一区二区三亚瑟 | 欧美日韩精品国产 | 精品亚洲永久免费精品 |