四種通過LLM進行文本知識圖譜的構(gòu)建方法對比介紹

作者：Peter Lawrence 2024-01-19 12:11:31

我們在以前的文章中已經(jīng)介紹了使用大語言模型將非結(jié)構(gòu)化文本轉(zhuǎn)換為知識圖譜。但是對于知識圖譜的創(chuàng)建是一個很復(fù)雜的過程，比如需要對屬性增加限制，創(chuàng)建符合特定主題/模式的圖譜，并且有時文檔非常大，無法作為單個提示處理，所以在切分后的提示中創(chuàng)建的圖譜需要前后一致。

所以本文將介紹和比較使用LLM轉(zhuǎn)換非結(jié)構(gòu)化文本的四種方法，這些方法在不同的場景中都可能會用到。

使用LLM預(yù)訓(xùn)練本體（ontologies）

LLM似乎已經(jīng)接受了各種標(biāo)準本體(如SCHEMA)的預(yù)先培訓(xùn)。ORG, FOAF, SKOS, RDF, RDFS, OWL等。所以通過適當(dāng)?shù)南到y(tǒng)提示來指導(dǎo)使用這個預(yù)訓(xùn)練的本體，再加上包含非結(jié)構(gòu)化文本的用戶提示，我們可以獲得想要的轉(zhuǎn)換后的圖譜結(jié)構(gòu)。

提示如下

Translate the following user text to an RDF graph using the RDF, RDFS, and OWL ontologies formatted as TTL.
 Use the prefix ex: with IRI <http://example.com/> for any created entities.

LLM的相應(yīng)如下

@prefix ex: <http://example.com/> .
 @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
 @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
 @prefix owl: <http://www.w3.org/2002/07/owl#> .
 ex:PeterJones rdf:type ex:Person ;
  rdf:label "Peter Jones" .
 ex:BethHampton rdf:type ex:Person ;
  rdf:label "Beth Hampton" .
 ex:BrunelUniversity rdf:type ex:University ;
  rdf:label "Brunel University" .
 ex:George rdf:type ex:Person ;
  rdf:label "George" .
 ex:James rdf:type ex:Person ;
  rdf:label "James" .
 ex:Harold rdf:type ex:Person ;
  rdf:label "Harold" .
 ex:PeterJones ex:studiedAt ex:BrunelUniversity ;
  ex:marriedTo ex:BethHampton ;
  ex:hasChild ex:George, ex:James, ex:Harold .

對于gpt-3.5-turbo在創(chuàng)建新屬性時沒有問題。如果我們修改系統(tǒng)提示符，只使用SCHEMA.ORG

Translate the following user text to an RDF graph using the SCHEMA.ORG ontologies formatted as TTL.
 Use the prefix ex: with IRI <http://example.com/> for any created entities.

也會得到相應(yīng)的輸出

@prefix ex: <http://example.com/> .
 @prefix schema: <https://schema.org/> .
 
 ex:PeterJones a schema:Person ;
  schema:spouse ex:BethHampton ;
  schema:alumniOf ex:BrunelUniversity .
 
 ex:BethHampton a schema:Person ;
  schema:spouse ex:PeterJones .
  ex:BrunelUniversity a schema:EducationalOrganization .
 
 ex:George a schema:Person .
 ex:James a schema:Person .
 ex:Harold a schema:Person .
 
 ex:PeterJones schema:children ex:George, ex:James, ex:Harold .
 ex:BethHampton schema:children ex:George, ex:James, ex:Harold .

這是最簡單的方法，因為我們不需要做任何的操作，完全的使用LLM已經(jīng)訓(xùn)練好的內(nèi)容，并且也能得到很好的輸出。提示也非常簡潔(約41個token)，不會占用太多的字符。

但是轉(zhuǎn)換僅限于LLM預(yù)先訓(xùn)練過的那些“標(biāo)準”本體。如果你問ChatGPT是在什么標(biāo)準本體上進行訓(xùn)練的，他不會給你想要的答案，這樣對我們來說就是一個黑盒。并且在文本到圖譜的轉(zhuǎn)換中生成的實體還需要在各個圖譜之間進行對齊。

在LLM提示中添加本體

在大多多情況下，我們希望使用非標(biāo)準或自定義本體。LLM不太可能在這樣的本體上進行預(yù)訓(xùn)練，因此我們需要在提示中包含完整的本體。

Translate the following user text to an RDF graph using the following schema1: <http://inova8.com/schema/1/> ontologies formatted as TTL.
 Use the prefix ex: with IRI <http://example.com/> for any created entities.
 Only use pre-defined classes and properties from the schema1: <http://inova8.com/schema/1/> ontology.
 Use the properties and classes in the schema1: ontology.
 Include individuals, their data, and relationships.
 
 ... the full ontology in TTL format ...

我們需要對轉(zhuǎn)換的內(nèi)容進行詳細的說明，這導(dǎo)致提示token增加到了~3567，使用與之前相同的輸入提示，LLM也可以很好地轉(zhuǎn)換文本:

ex:PeterJones rdf:type schema1:CC ;
  :dc "Peter" ;
  :de "Jones" ;
  :oa ex:BrunelUniversity ;
  :oh ex:BethHampton ;
  :of ex:business .
 ex:BethHampton rdf:type schema1:CC ;
  :dc "Beth" ;
  :de "Hampton" ;
  :oa ex:BrunelUniversity ;
  :oh ex:PeterJones ;
  :of ex:business .
 ex:BrunelUniversity rdf:type schema1:CA ;
  rdfs:label "Brunel University" .
 ex:George rdf:type schema1:CC ;
  rdfs:label "George" ;
  :od ex:PeterJones ;
  :od ex:BethHampton .
 ex:James rdf:type schema1:CC ;
  rdfs:label "James" ;
  :od ex:PeterJones ;
  :od ex:BethHampton .
 ex:Harold rdf:type schema1:CC ;
  rdfs:label "Harold" ;
  :od ex:PeterJones ;
  :od ex:BethHampton .
 ex:PeterJones :oh ex:BethHampton .
 ex:BethHampton :oh ex:PeterJones .

當(dāng)我們將自定義的內(nèi)容包含在提示中時，LLM似乎可以理解用RDF、RDFS和OWL表示的本體，并且能夠?qū)⒎墙Y(jié)構(gòu)化文本轉(zhuǎn)換為自定義本體。

但是這導(dǎo)致提示現(xiàn)在非常長，以為系統(tǒng)提示token開銷很大。這將增加成本也會減慢響應(yīng)時間，因為時間與要處理的token成正比。并且這個結(jié)果仍然需要對齊。

使用本體進行微調(diào)

前兩種方法的主要問題是局限于預(yù)訓(xùn)練的本體，或者在提示中包含自定義本體時開銷很大。所以我們可以對LLM進行微調(diào)使用KG對LLM進行微調(diào)是非常簡單的，因為圖的本質(zhì)是三元組:

{:subject :predicate :object}

我們可以將其映射到提示中進行訓(xùn)練。下面的內(nèi)容都是可以從圖中自動生成的。

{“messages”: [
  {"role": "system", "content": "Complete the following graph edge"},
  {"role": "user", "content": "What is <:subject> <predicate>?"},
  {"role": "assistant", "content": " <:subject> is <:predicate> <:object>."}]
 }
 …

這個問題就變成了訓(xùn)練LLM將一種語言(非結(jié)構(gòu)化英語文本)翻譯成另一種語言(使用微調(diào)本體語義的高度結(jié)構(gòu)化rdf圖)。

messages”: [
  {"role": "system", "content": "Translate the following user text to an RDF graph using the Schema1 ontology."},
  {"role": "user", "content": "{example unstructured text}"},
  {"role": "assistant", "content": "{RDF graph using custom Schema1 ontology semantics}"}]
 }
 …

所以問題就變成了創(chuàng)建一個具有代表性的微調(diào)訓(xùn)練集(100~200個)和RDF轉(zhuǎn)換集。但是如何設(shè)計提示呢？我們來直接問問LLM，因為她最了解自己，哈。

I want to fine-tune an LLM on this Schema1 ontology. 
 This training aims to take an example user text and the assistant translates it into an RDF graph based on the Schema1 ontology. 
 Please create a comprehensive set of 50 example system, user, and assistant messages in JSONL message conversational chat format format for fine-tuning an LLM to translate text to Schema1 ontology RDF.
 The system content should be the instruction to translate the user text to the assistant RDF graph response using Sxchema1 ontology.
 The user text should be examples that cover all the concepts within the Schema1 ontology.
 The assistant content should be an RDF graph using the Schema1 ontology of the translation of the user text examples.
 Use the prefix ex: with IRI <http://example.com/> for any created entities.
 Use the message format as follows:
  {"messages": [
    {"role": "system", "content": "<system_content>"}, 
    {"role": "user", "content": "<user_content>}, 
    {"role": "assistant", "content": "<assistant_content>"}
  ]}

gpt-3.5-turbo以正確的格式做出了回應(yīng)。

{“messages”: [
  {“role”: “system”, “content”: “
      Translate the following user text to an RDF graph using the Schema1 ontology.”}, 
  {“role”: “user”, “content”: “
      John Smith is a professor who graduated from a reputed college.”}, 
  {“role”: “assistant”, “content”: “
      @prefix ex: <http://example.com/> .
      @prefix schema1: <http://inova8.com/schema/1/> .
      ex:JohnSmith rdf:type schema1:CC ;
        schema1:dc \”John\” ;
        schema1:de \”Smith\” ;
        schema1:of schema1:professor ;
        schema1:oa ex:ReputedCollege .
      ex:ReputedCollege rdf:type schema1:CA .”}]}
 
 … and so on

我們來試試效果，使用它生成了一個適當(dāng)大小的(160)轉(zhuǎn)換調(diào)優(yōu)數(shù)據(jù)集。使用3和10個epoch對LLM進行了微調(diào)。

System-prompt (34 tokens):

Translate the following user text to an RDF graph using the Schema1 ontology.
 Use the prefix ex: with IRI <http://example.com/> for any created entities.

User-prompt (28 tokens):

Peter Jones and Beth Hampton both studied at Brunel University, then got married, and have three children: George, James, and Harold.

LLM的結(jié)果

@prefix ex: <http://example.com/> .
 ex:PeterJones a schema1:CC ;
  schema1:dc "Peter" ;
  schema1:de "Jones" ;
  schema1:of schema1:student .
 ex:BethHampton a schema1:CC ;
  schema1:dc "Beth" ;
  schema1:de "Hampton" ;
  schema1:of schema1:student .
 ex:JonesHamptonMarriage a schema1:CG ;
  schema1:og ex:PeterJones ;
  schema1:oh ex:BethHampton .
 ex:GeorgeJones a schema1:CD ;
  schema1:dc "George" ;
  schema1:od ex:PeterJones ;
  schema1:of ex:BethHampton .
 ex:JamesJones a schema1:CD ;
  schema1:dc "James" ;
  schema1:od ex:PeterJones ;
  schema1:of ex:BethHampton .
 ex:HaroldJones a schema1:CD ;
  schema1:dc "Harold" ;
  schema1:od ex:PeterJones ;
  schema1:of ex:BethHampton .

可以看到，工作的還不錯，LLM自己給自己上了一課，我們只是動動嘴（手）。另外在質(zhì)量上3 epoch和10 epoch微調(diào)的LLM之間的轉(zhuǎn)換質(zhì)量似乎沒有什么差別。

LLM可以使用自定義本體進行微調(diào)，調(diào)優(yōu)LLM的提示可以非常簡潔(約41個token)，因此大部分token成本由要轉(zhuǎn)換的非結(jié)構(gòu)化文本組成。

但是微調(diào)需要度量llm理解自定義本體的準確性和圖譜轉(zhuǎn)換的準確性的指標(biāo)，如何判斷好壞就是一個需要仔細研究的問題，我們生成的圖譜仍然需要對齊。

改進微調(diào)的提示

通常情況下，單個本體/模式不足以捕獲文本的完整語義，所以需要使用兩個或更多本體。這里的一個好消息是經(jīng)過微調(diào)，LLM并沒有忘記預(yù)訓(xùn)練的本體，所以我們可以將它們結(jié)合使用。

Translate the following user text to an RDF graph using both the FOAF, and Schema1 ontologies.
 Use the prefix ex: with IRI <http://example.com/> for any created entities.
 Peter Jones and Beth Hampton both studied at Brunel University, then got married, and have three children: George, James, and Harold. 
 Peter, email address peter.jones@inova8.com, has been friends with Paul Williams because of their shared interest in ornithology.

結(jié)果如下：

@prefix ex: <http://example.com/> .
 ex:PeterJones a schema1:Thing, foaf:Person ;
  schema1:email "peter.jones@inova8.com" ;
  foaf:name "Peter Jones" ;
  foaf:interest ex:Ornithology .
 ex:BethHampton a schema1:Thing, foaf:Person ;
  foaf:name "Beth Hampton" .
 ex:GeorgeJones a schema1:Thing, foaf:Person ;
  foaf:name "George Jones" ;
  foaf:maker ex:PeterJones .
 ex:JamesJones a schema1:Thing, foaf:Person ;
  foaf:name "James Jones" ;
  foaf:maker ex:PeterJones .
 ex:HaroldJones a schema1:Thing, foaf:Person ;
  foaf:name "Harold Jones" ;
  foaf:maker ex:PeterJones .
 ex:BrunelUniversity a schema1:Thing, foaf:Document ;
  foaf:name "Brunel University" .
 ex:JonesHamptonMarriage a schema1:Thing ;
  schema1:spouse ex:PeterJones, ex:BethHampton .
 ex:JonesHamptonFamily a schema1:Thing ;
  schema1:parent ex:PeterJones, ex:BethHampton ;
  schema1:child ex:GeorgeJones, ex:JamesJones, ex:HaroldJones .
 ex:PeterJones foaf:knows ex:PaulWilliams .
 ex:PaulWilliams a schema1:Thing, foaf:Person ;
  foaf:name "Paul Williams" .

可以看到，回復(fù)中不僅包含了我們微調(diào)的結(jié)果，還包含了模型預(yù)訓(xùn)練時返回的結(jié)果

但是這里有一個問題，當(dāng)同一概念在本體之間重疊時，我們需要控制LLM返回使用哪個。

總結(jié)

對于上面幾種方法的對比，我們總結(jié)了一個圖表：

llm可以有效地將非結(jié)構(gòu)化文本轉(zhuǎn)換為RDF圖。自定義本體微調(diào)模型的token效率要高得多，因為它不需要在每個轉(zhuǎn)換請求提示符中提供完整本體的開銷，當(dāng)需要轉(zhuǎn)換多個文本時，這可以降低生產(chǎn)環(huán)境中的轉(zhuǎn)換成本。

但是我們還沒有提到如何建立文本到KG轉(zhuǎn)換的“準確性”測試，并且轉(zhuǎn)換后如何進行實體對齊，我們將在后面的文章中繼續(xù)介紹。

責(zé)任編輯：華軒來源： DeepHub IMBA

大語言模型知識圖譜人工智能

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看