五個很少被提到但能提高NLP工作效率的Python庫

作者：deephub 2021-12-27 16:09:54

本篇文章將分享5個很棒但是卻不被常被提及的Python庫，這些庫可以幫你解決各種自然語言處理(NLP)工作。

Contractions

Contractions它可以擴展常見的英語縮寫和俚語。并且可以快速、高效的處理大多數邊緣情況，例如缺少撇號。

例如：以前需要編寫一長串正則表達式來擴展文本數據中的(即 don’t → do not;can’t → cannot;haven’t → have not)。Contractions就可以解決這個問題

pip install contractions

使用樣例

import contractions 
s = "ive gotta go! i'll see yall later." 
text = contractions.fix(s, slang=True) 
print(text)

結果

ORIGINAL: ive gotta go! i’ll see yall later. 
OUTPUT: I have got to go! I will see you all later.

文本預處理的一個重要部分是創建一致性并在不失去太多意義的情況下減少單詞列表。詞袋模型和 TF-IDF 創建大型稀疏矩陣，其中每個變量都是語料庫中一個不同的詞匯詞。將縮略語進行還原可以進一步降低維度，還可以有助于過濾停用詞。

Distilbert-Punctuator

將丟失的標點符號的文本進行斷句并添加標點符號……聽起來很容易，對吧? 對于計算機來說，做到這一點肯定要復雜得多。

Distilbert-punctuator 是我能找到的唯一可以執行此任務的 Python 庫。而且還超級準! 這是因為它使用了 BERT 的精簡變體。在結合 20,000 多篇新聞文章和 4,000 份 TED Talk 抄本后，對模型進行了進一步微調，以檢測句子邊界。在插入句尾標點符號(例如句號)時，模型還會適當地將下一個起始字母大寫。

安裝

pip install distilbert-punctuator

這個庫需要相當多的依賴項，如果只是想測試，可以在 Google Colab 上試用。

使用樣例

from dbpunctuator.inference import Inference, InferenceArguments 
from dbpunctuator.utils import DEFAULT_ENGLISH_TAG_PUNCTUATOR_MAP 
args = InferenceArguments( 
        model_name_or_path="Qishuai/distilbert_punctuator_en", 
        tokenizer_name="Qishuai/distilbert_punctuator_en", 
        tag2punctuator=DEFAULT_ENGLISH_TAG_PUNCTUATOR_MAP 
    ) 
punctuator_model = Inference(inference_args=args,  
                             verbose=False) 
text = [ 
""" 
however when I am elected I vow to protect our American workforce 
unlike my opponent I have faith in our perseverance our sense of trust and our democratic principles will you support me 
""" 
] 
 
print(punctuator_model.punctuation(text)[0])

結果

ORIGINAL:  
however when I am elected I vow to protect our American workforce 
unlike my opponent I have faith in our perseverance our sense of trust and our democratic principles will you support me 
 
OUTPUT: 
However, when I am elected, I vow to protect our American workforce. Unlike my opponent, I have faith in our perseverance, our sense of trust and our democratic principles. Will you support me?

如果你只是希望文本數據在語法上更加正確和易于展示。無論任務是修復凌亂的 Twitter 帖子還是聊天機器人消息，這個庫都適合你。

Textstat

Textstat 是一個易于使用的輕量級庫，可提供有關文本數據的各種指標，例如閱讀水平、閱讀時間和字數。

pip install textstat

使用樣例

import textstat 
text = """ 
Love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  
""" 
# Flesch reading ease score 
print(textstat.flesch_reading_ease(text)) 
  # 90-100 | Very Easy 
  # 80-89  | Easy 
  # 70-79  | Fairly Easy 
  # 60-69  | Standard 
  # 50-59  | Fairly Difficult 
  # 30-49  | Difficult 
  # <30    | Very Confusing 
 
# Reading time (output in seconds) 
# Assuming 70 milliseconds/character 
 
print(textstat.reading_time(text, ms_per_char=70))# Word count  
print(textstat.lexicon_count(text, removepunct=True))

結果

ORIGINAL: 
Love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite. 
 
OUTPUTS: 
74.87 # reading score is considered 'Fairly Easy' 
7.98  # 7.98 seconds to read 
30    # 30 words

這個庫還為這些指標增加了一個額外的分析層。例如，一個八卦雜志上的名人新聞文章的數據集。使用textstat，你會發現閱讀速度更快更容易閱讀的文章更受歡迎，留存率更高。

Gibberish-Detector

這個低代碼庫的主要目的是檢測難以理解的單詞(或胡言亂語)。它在大量英語單詞上訓練的模型。

pip install gibberish-detector

安裝完成后還需要自己訓練模型，但這非常簡單，只需一分鐘。訓練步驟如下：

從這里下載名為 big.txt 的訓練語料庫
打開你的 CLI 并 cd 到 big.txt 所在的目錄
運行以下命令：gibberish-detector train .\big.txt > gibberish-detector.model

這將在當前目錄中創建一個名為 gibberish-detector.model 的文件。

使用樣例

from gibberish_detector import detector 
# load the gibberish detection model 
Detector = detector.create_from_model('.\gibberish-detector.model') 
 
text1 = "xdnfklskasqd" 
print(Detector.is_gibberish(text1)) 
 
text2 = "apples" 
print(Detector.is_gibberish(text2))

結果

True  # xdnfklskasqd (this is gibberish) 
False # apples (this is not)

它可以幫助我從數據集中刪除不良觀察結果。還可以實現對用戶輸入的錯誤處理。例如，如果用戶在您的 Web 應用程序上輸入無意義的胡言亂語文本，這時可以返回一條錯誤消息。

NLPAug

最好的要留到最后。

首先，什么是數據增強?它是通過添加現有數據的稍微修改的副本來擴展訓練集大小的任何技術。當現有數據的多樣性有限或不平衡時，通常使用數據增強。對于計算機視覺問題，增強用于通過裁剪、旋轉和改變圖像的亮度來創建新樣本。對于數值數據，可以使用聚類技術創建合成實例。

但是如果我們正在處理文本數據呢?這就是 NLPAug 的用武之地。該庫可以通過替換或插入語義關聯的單詞來擴充文本。通過使用像 BERT 這樣的預訓練語言模型來進行數據的增強，這是一種強大的方法，因為它考慮了單詞的上下文。根據設置的參數，可以使用前 n 個相似詞來修改文本。

預訓練的詞嵌入，如 Word2Vec 和 GloVe，也可用于用同義詞替換詞。

pip install nlpaug

使用樣例

import nlpaug.augmenter.word as naw 
 
# main parameters to adjust 
ACTION = 'substitute' # or use 'insert' 
TOP_K = 15 # randomly draw from top 15 suggested words 
AUG_P = 0.40 # augment 40% of words within text 
 
aug_bert = naw.ContextualWordEmbsAug( 
    model_path='bert-base-uncased',  
    action=ACTION,  
    top_k=TOP_K, 
    aug_p=AUG_P 
    ) 
 
text = """ 
Come into town with me today to buy food! 
""" 
augmented_text = aug_bert.augment(text, n=3) # n: num. of outputs 
print(augmented_text)

結果

ORIGINAL: 
Come into town with me today to buy food! 
 
OUTPUTS: 
• drove into denver with me today to purchase groceries! 
• head off town with dad today to buy coffee! 
• come up shop with mom today to buy lunch!

假設你正在使用一個具有 15k 條正面評論和僅 4k 條負面評論的數據集上訓練監督分類模型。嚴重不平衡的數據集會在訓練期間產生對多數類(正面評價)的模型偏差。

簡單地復制少數類的示例(負面評論)不會向模型添加任何新信息。相反，利用 NLPAug 的高級文本增強功能來增加多樣性的少數類。該技術已被證明可以提高 AUC 和 F1-Score。

結論

作為數據科學家、Kaggle 參與者或一般程序員，重要的是我們需要找到更多的工具來簡化我們的工作流程。這樣可以利用這些庫來解決問題，增強我們的數據集，并花更多時間思考解決方案而不是編寫代碼。

責任編輯：華軒來源：今日頭條

Python 自然語言開發

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

五個很少被提到但能提高NLP工作效率的Python庫

Contractions

Distilbert-Punctuator

Textstat

Gibberish-Detector

結果

NLPAug

結論