LanceDB：為 AI 應用打造的高效嵌入式向量數據庫

發布于 2024-12-24 11:41

瀏覽

0收藏

當前，向量數據庫已經成了一個紅海市場，新興的還是傳統數據庫廠商都在做這方面的工作。然而，在嵌入式，端上的向量數據庫比較少，chromaDB算是其中一個，但它不算是一個純原生、深度優化的的嵌入式向量數據庫，仍采用parquet格式（讀一行數據需要讀取整個塊解壓，比較慢，另外副本占用空間），功能也比較少，那有沒有更好的選擇呢？很多人自然想到關系型嵌入式數據庫王者——Sqlite，奈何它的向量版本 sqlite-vec還處于開發中，那有沒有文檔性能還好的替代品呢？LanceDB是一個選擇。

LanceDB 是一個專為構建 AI 應用而設計的開源向量數據庫。它采用嵌入式架構,無需部署獨立服務器,可以輕松集成到各種應用場景中。

LanceDB：為 AI 應用打造的高效嵌入式向量數據庫-AI.x社區

核心功能和優勢在于:

嵌入式架構。與需要部署服務器的 Qdrant 等產品不同,LanceDB 采用嵌入式設計,作為應用的一部分運行,易于集成且無需額外的基礎設施管理。
專為AI設計的Lance 數據格式（最大亮點）。LanceDB 使用專門優化的 Lance 列式存儲格式,相比傳統的 Parquet 格式具有更快的掃描速度。它支持數據分片,只加載必要的數據片段,大大減少 IO 開銷。同時具有機器學習所需的自動數據版本管理能力，不同的版本會關聯該版本相關文件、模式及 blob 的元數據,更新數據時無需完整重寫（Zero-copy）。

LanceDB：為 AI 應用打造的高效嵌入式向量數據庫-AI.x社區

相較于其他的常見格式對比，在機器學習場景場景中優勢明顯：

LanceDB：為 AI 應用打造的高效嵌入式向量數據庫-AI.x社區

數據cap理論

	Lance	Parquet & ORC	JSON & XML	TFRecord	Database	Warehouse
Analytics	Fast	Fast	Slow	Slow	Decent	Fast
Feature Engineering	Fast	Fast	Decent	Slow	Decent	Good
Training	Fast	Decent	Slow	Fast	N/A	N/A
Exploration	Fast	Slow	Fast	Slow	Fast	Decent
Infra Support	Rich	Rich	Decent	Limited	Rich	Rich

高性能向量搜索。基于 Rust 語言開發,具有優秀的性能表現。根據官方基準測試,在同等硬件條件下,對于 128 維向量的 10 億規模數據集,查詢延遲可以控制在 100ms 以內。并且支持GPU加速。
豐富的生態集成。LanceDB 原生支持 Python 和JavaScript/TypeScript,并與 LangChain 、LlamaIndex 等主流 AI 框架無縫集成。同時也支持 Apache Arrow 、Pandas 、Polars 、DuckDB 等數據處理工具。
多模態數據支持。除了向量數據,LanceDB 還能高效存儲和檢索文本、圖像、音頻等非結構化數據,無需額外的存儲解決方案。

使用 LanceDB 非常簡單,下面是使用示例：

Python版本:

import lancedb

# 連接數據庫
db = lancedb.connect("data/sample-lancedb")

# 創建表并插入數據
table = db.create_table("my_table",
    data=[{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
          {"vector": [5.9, 26.5], "item": "bar", "price": 20.0}])

# 執行向量搜索
result = table.search([100, 100]).limit(2).to_pandas()

js版本,搭配transformers使用。

async function example() {

    const lancedb = require('vectordb')

    // Import transformers and the all-MiniLM-L6-v2 model (https://huggingface.co/Xenova/all-MiniLM-L6-v2)
    const { pipeline } = await import('@xenova/transformers')
    const pipe = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');


    // Create embedding function from pipeline which returns a list of vectors from batch
    // sourceColumn is the name of the column in the data to be embedded
    //
    // Output of pipe is a Tensor { data: Float32Array(384) }, so filter for the vector
    const embed_fun = {}
    embed_fun.sourceColumn = 'text'
    embed_fun.embed = async function (batch) {
        let result = []
        for (let text of batch) {
            const res = await pipe(text, { pooling: 'mean', normalize: true })
            result.push(Array.from(res['data']))
        }
        return (result)
    }

    // Link a folder and create a table with data
    const db = await lancedb.connect('data/sample-lancedb')

    const data = [
        { id: 1, text: 'Cherry', type: 'fruit' },
        { id: 2, text: 'Carrot', type: 'vegetable' },
        { id: 3, text: 'Potato', type: 'vegetable' },
        { id: 4, text: 'Apple', type: 'fruit' },
        { id: 5, text: 'Banana', type: 'fruit' }
    ]

    const table = await db.createTable('food_table', data, embed_fun)


    // Query the table
    const results = await table
        .search("a sweet fruit to eat")
        .metricType("cosine")
        .limit(2)
        .execute()
    console.log(results.map(r => r.text))

}

example().then(_ => { console.log("Done!") })

更多參考資源：??https://github.com/lancedb/vectordb-recipes??

相比需要部署服務器的向量數據庫,LanceDB 的嵌入式架構特別適合:

需要在本地運行的桌面應用
資源受限的邊緣計算環境
對數據隱私有嚴格要求的場景
快速原型開發和測試

雖然在處理海量數據時,LanceDB 展現出了顯著的性能優勢,但對于大多數中小規模的 AI 應用來說,開發效率和易用性可能是更重要的考慮因素。LanceDB 簡單直觀的 API 設計和完善的生態支持,使其成為構建各類 AI 應用的理想選擇。

小結

事實上，當前很多的應用都選擇lancedb作為其實現方案，比如微軟的GraphRAG，Character AI ， MidJourney等，它們也獲得了YC 800 萬美元的種子輪融資。2025年，我們將迎來多模態LLM應用的爆發，這也將會帶來向量數據庫的新一輪的熱潮，作為嵌入式向量數據庫的最佳代表，無論是用于構建原型還是部署生產環境,都是一個值得考慮的選擇，甚至可能是不二選擇。

參考：

??https://blog.lancedb.com/new-funding-and-a-new-foundation-for-multimodal-ai-data/??

??https://lancedb.github.io/??

??https://github.com/lancedb/lancedb??

本文轉載自 ??AI工程化??，作者： ully

標簽

數據庫

嵌入式

贊

回復