機器學習如何訓練最終模型
對于剛剛接觸、或跨界轉行至機器學習的朋友來說,“如何訓練最終模型”可謂是一個經典話題。對此,Jason Brownlee博士專門撰文解答這個疑問(原文鏈接:http://machinelearningmastery.com/train-final-machine-learning-model/),開數科技在此對文章進行了中文編譯,希望能夠為正在學習中的朋友們帶去一些幫助。
原文作者:Dr. Jason Brownlee
中文編譯:R.
特邀校審:Dr. Xu.Tang
來源:開數科技(微信公眾號:open01tech)
How to Train a Final Machine Learning Model
機器學習如何訓練最終模型
The machine learning model that we use to make predictions on new data is called the final model.
機器學習過程中,我們用來對新數據進行預測的模型被稱為最終模型。
There can be confusion in applied machine learning about how to train a final model.
而對于如何訓練最終模型,初學者可能會產生疑問或困惑。
This error is seen with beginners to the field who ask questions such as:
例如,初學者可能會提出以下問題:
• How do I predict with cross validation?
· 我應該如何通過交叉驗證進行預測?
• Which model do I choose from cross-validation?
· 根據交叉驗證我應該選擇哪個模型?
• Do I use the model after preparing it on the training dataset?
· 我應該使用在訓練集上建立的模型嗎?
This post will clear up the confusion.
本文的目的在于解答這些問題。
In this post, you will discover how to finalize your machine learning model in order to make predictions on new data.
通過本文,你將會了解如何最終選定你的機器學習模型,從而對新的數據進行預測。
Let’s get started.
讓我們開始吧。
What is a Final Model?
什么是“最終模型”?
A final machine learning model is a model that you use to make predictions on new data.
在機器學習中,“最終模型”是指用來預測新數據的模型。
That is, given new examples of input data, you want to use the model to predict the expected output. This may be a classification (assign a label) or a regression (a real value).
也就是說,在給定的新輸入數據樣例上,你可以使用最終模型預測出期待的輸出結果。這有可能是一個分類問題(數據標注)或者是一個回歸問題(數值估計)。
For example, whether the photo is a picture of a dog or a cat, or the estimated number of sales for tomorrow.
比如我們可以通過模型,去判斷某個照片中是汪還是咪,又或者可以去預估明天的銷售額。
The goal of your machine learning project is to arrive at a final model that performs the best, where “best” is defined by:
進行機器學習的目的是訓練一個“最好”的最終模型。這里“最好”是由以下因素決定的:
• Data: the historical data that you have available.
· 數據:可用的歷史數據。
• Time: the time you have to spend on the project.
· 時間:用來訓練模型的時間。
• Procedure: the data preparation steps, algorithm or algorithms, and the chosen algorithm configurations.
· 過程:數據準備步驟、算法或算法集,以及如何配置這些算法。
In your project, you gather the data, spend the time you have, and discover the data preparation procedures, algorithm to use, and how to configure it.
總體說來,整個過程涉及數據收集、訓練、合理的設置流程、選擇合適的算法,并進行正確配置。
The final model is the pinnacle of this process, the end you seek in order to start actually making predictions.
“最終模型”則是整個過程的終點,通過它你可以開始對實際數據進行預測。
The Purpose of Train/Test Sets
使用訓練/測試數據集的目的
Why do we use train and test sets?
為什么要使用訓練/測試數據集?
Creating a train and test split of your dataset is one method to quickly evaluate the performance of an algorithm on your problem.
通過把數據分割成訓練集/測試集,能夠快速地評估你的算法的性能如何。
The training dataset is used to prepare a model, to train it.
訓練數據集是用來形成、并訓練模型的。
We pretend the test dataset is new data where the output values are withheld from the algorithm. We gather predictions from the trained model on the inputs from the test dataset and compare them to the withheld output values of the test set.
對于測試數據,我們假設測試數據集是新的數據,在模型訓練過程中隱藏已知的輸出值(事實上我們是知道輸出值的)。基于測試數據的輸入和在訓練數據上構建的模型,我們可以預測測試數據上的輸出值并將它們與真實輸出進行比較。
Comparing the predictions and withheld outputs on the test dataset allows us to compute a performance measure for the model on the test dataset. This is an estimate of the skill of the algorithm trained on the problem when making predictions on unseen data.
通過將測試數據集的預測結果和我們事先已知的輸出結果進行比對,可以衡量模型在測試數據上的表現,從而估計模型在未知數據集上的預測能力。
Let’s unpack this further
讓我們進一步展開來解釋
When we evaluate an algorithm, we are in fact evaluating all steps in the procedure, including how the training data was prepared (e.g. scaling), the choice of algorithm (e.g. kNN), and how the chosen algorithm was configured (e.g. k=3).
當我們評估一個算法,我們實際上是評估計算過程中的所有步驟,包括如何準備訓練數據(例如:縮放)、算法的選擇(例如:KNN),以及如何配置我們的算法(例如:K = 3)。
The performance measure calculated on the predictions is an estimate of the skill of the whole procedure.
所謂模型預測性能的優劣,也是對計算過程中所有涉及環節的綜合評估。
We generalize the performance measure from:
一些評定因素包括:
• “the skill of the procedure on the test set“
· “測試/訓練環節所用方法及性能”
to
• “the skill of the procedure on unseen data“.
· 到“通過計算模型在測試數據集上的預測精度來估計它在未知數據上的預測能力”。
This is quite a leap and requires that:
這兩者之間實際上有相當大的距離,這個過程需要滿足一下條件:
• The procedure is sufficiently robust that the estimate of skill is close to what we actually expect on unseen data.
· 《模型具有足夠的魯棒性使得這種估計能夠充分接近模型在未知數據集上的預測精度。
· The choice of performance measure accurately captures what we are interested in measuring in predictions on unseen data.
· 評價指標的選擇能夠真實反映我們對于數據預測的關注點。
• The choice of data preparation is well understood and repeatable on new data, and reversible if predictions need to be returned to their original scale or related to the original input values.
· 數據的預處理是合理的,并且能夠在新數據集上重復; 同時如果預測過程需要回溯到原數據的量綱上,那么預處理過程還要是可逆的。
• The choice of algorithm makes sense for its intended use and operational environment (e.g. complexity or chosen programming language).
· 算法的選擇應該考慮其實際的應用目標和操作環境(例如算法復雜度或編程語言的選擇)。
A lot rides on the estimated skill of the whole procedure on the test set.
機器學習方法在測試數據上的表現將會決定我們最終模型,包括數據預處理過程、具體模型類型、參數的選擇和訓練環境等諸多因素。
In fact, using the train/test method of estimating the skill of the procedure on unseen data often has a high variance (unless we have a heck of a lot of data to split). This means that when it is repeated, it gives different results, often very different results.
事實上,使用訓練/測試數據分割法來估計模型在未知數據上的預測能力往往會有很大的分歧(除非有海量的數據進行分割)。也就是說,在不同的未知數據上,同一個模型的預測能力可能會有明顯的差異。
The outcome is that we may be quite uncertain about how well the procedure actually performs on unseen data and how one procedure compares to another.
其結果是,我們可能不是非常確定模型在未知數據及上的表現如何,以及模型之間的差異如何。
Often, time permitting, we prefer to use k-fold cross-validation instead.
如果時間允許的話,使用交叉驗證可能也是個不錯的方法。
The Purpose of k-fold Cross Validation
交叉驗證的目的
Why do we use k-fold cross validation?
為什么要使用交叉驗證?
Cross-validation is another method to estimate the skill of a method on unseen data. Like using a train-test split.
類似前面提到的訓練數據集預測方法,“交叉驗證”是另一種用來估計模型在未知數據集上預測能力的方法。
Cross-validation systematically creates and evaluates multiple models on multiple subsets of the dataset.
交叉驗證系統的在原數據的多個子集創建多個模型,并進行評估。
This, in turn, provides a population of performance measures.
這同時提供了相關模型的一組評價指標。
• We can calculate the mean of these measures to get an idea of how well the procedure performs on average.
· 我們可以對這組評價指標取均值以評估模型的性能。
• We can calculate the standard deviation of these measures to get an idea of how much the skill of the procedure is expected to vary in practice.
· 我們可以計算出這些指標的標準偏差,從而了解在真實數據集中會產生多大范圍的變化。
This is also helpful for providing a more nuanced comparison of one procedure to another when you are trying to choose which algorithm and data preparation procedures to use.
這也有助于更細致的比較該選擇何種算法或采用何種數據預處理方法。
Also, this information is invaluable as you can use the mean and spread to give a confidence interval on the expected performance on a machine learning procedure in practice.
此外,這些信息的價值還在于你能計算它們的均值和范圍來構建機器學習模型的預測能力的置信區間。
Both train-test splits and k-fold cross validation are examples of resampling methods.
訓練/測試數據集和交叉驗證都是使用重采樣的方法。
Why do we use Resampling Methods?
為什么要使用重采樣方法?
The problem with applied machine learning is that we are trying to model the unknown.
應用機器學習的目的在于我們希望通過模型對未知數據進行預測。
On a given predictive modeling problem, the ideal model is one that performs the best when making predictions on new data.
對于一個既定預測的模型,理想狀態是該模型對新數據能夠給出接近真實情況的預測結果。
We don’t have new data, so we have to pretend with statistical tricks.
但在此之前,我們沒有新的數據,所以我們不得不通過統計方法來模擬。
The train-test split and k-fold cross validation are called resampling methods. Resampling methods are statistical procedures for sampling a dataset and estimating an unknown quantity.
訓練/測試數據集和交叉驗證采用所謂的“重采樣方法”。“重采樣方法”是對數據集進行采樣,并估計未知量的統計方法。
In the case of applied machine learning, we are interested in estimating the skill of a machine learning procedure on unseen data. More specifically, the skill of the predictions made by a machine learning procedure.
在應用機器學習時,我們關注的是模型的預測能力; 具體來說,就是模型預測值的準確性。
Once we have the estimated skill, we are finished with the resampling method.
一旦我們估計出模型的預測精度,那么重采樣方法的任務也就結束了。
• If you are using a train-test split, that means you can discard the split datasets and the trained model.
· 如果你使用的是一個隨機分割的訓練與測試數據集,這意味著你現在可以無視這個數據集和相關的訓練模型了。
• If you are using k-fold cross-validation, that means you can throw away all of the trained models.
· 如果你使用的是k-fold交叉驗證,這意味著你可以扔掉所有在數據子集上訓練的模型了。
They have served their purpose and are no longer needed.
因為它們的任務已經完成了。
You are now ready to finalize your model.
你現在即將完成你的模型了。
How to Finalize a Model?
如何完成模型?
You finalize a model by applying the chosen machine learning procedure on all of your data.
你可以將機器學習生成的模型應用在你全部的數據上。
That’s it.
就是這樣。
With the finalized model, you can:
對于最終模型,您可以:
• Save the model for later or operational use.
· 保存模型為以后或操作使用。
• Make predictions on new data.
· 對新數據作出預測。
What about the cross-validation models or the train-test datasets?
那交叉驗證模型或訓練/測試數據集呢?
They’ve been discarded. They are no longer needed. They have served their purpose to help you choose a procedure to finalize.
它們已經完成自身的使命,以后也就不再需要它們了。
關于作者:Dr. Jason Brownlee
Dr. Jason Brownlee is a husband, proud father, academic researcher, author, professional developer and a machine learning practitioner. He is dedicated to helping developers get started and get good at applied machine learning.
特邀校審:Dr. Xu.Tang
新加坡國立大學統計學博士,原大公國際數據分析經理,現開數科技高級數據挖掘與分析師。
關于開數科技:
開數科技(OPEN01)致力于以世界領先的人工智能大數據處理技術、獨到的IT架構、深度學習以及模式識別算法,為各行業用戶提供實時、高效、多維度的數據分析產品和服務。核心團隊成員匯集來自美國MIT、哈佛大學、紐約州立大學、英國劍橋大學等大數據專家,以及來自羅蘭貝格、埃森哲等戰略運營專家。