Go并發模式：管道和顯式取消

作者：Codefor 編譯 2014-04-25 10:13:00

開發開發工具

Go并發原語使得構建流式數據管道，高效利用I/O和多核變得簡單。這篇文章介紹了幾個管道例子，重點指出在操作失敗時的細微差別，并介紹了優雅處理失敗的技術。

引言

什么是管道？

Go沒有正式的管道定義。管道只是眾多并發程序的一類。一般的，一個管道就是一些列的由channel連接起來的階段。每個階段都有執行相同邏輯的goroutine。在每個階段中，goroutine

從channel讀取上游數據
在數據上執行一些操作，通常會產生新的數據
通過channel將數據發往下游

每個階段都可以有任意個輸入channel和輸出channel，除了第一個和最有一個channel（只有輸入channel或只有輸出channel）。第一個步驟通常叫數據源或者生產者，最后一個叫做存儲池或者消費者。

我們先從一個簡單的管道例子來解釋這些概念和技術，稍后我們會介紹一個更為復雜的例子。

數字的平方

假設管道有三個階段。

第一步，gen函數,是一個將數字列表轉換到一個channel中的函數。Gen函數啟動了一個goroutine，將數字發送到channel，并在所有數字都發送完后關閉channel。

func gen(nums ...int) <-chan int {  
    out := make(chan int)  
    go func() {  
        for _, n := range nums {  
            out <- n  
        }  
        close(out)  
    }()  
    return out  
}

第二個階段，sq，從上面的channel接收數字，并返回一個包含所有收到數字的平方的channel。在上游channel關閉后，這個階段已經往下游發送完所有的結果，然后關閉輸出channel：

func sq(in <-chan int) <-chan int {  
    out := make(chan int)  
    go func() {  
        for n := range in {  
            out <- n * n  
        }  
        close(out)  
    }()  
    return out  
}

main函數建立這個管道，并執行第一個階段，從第二個階段接收結果并逐個打印，直到channel被關閉。

func main() {  
    // Set up the pipeline.  
    c := gen(2, 3)  
    out := sq(c)  
   
    // Consume the output.  
    fmt.Println(<-out) // 4  
    fmt.Println(<-out) // 9  
}

因為sq對輸入channel和輸出channel擁有相同的類型，我們可以任意次的組合他們。我們也可以像其他階段一樣，將main函數重寫成一個循環遍歷。

func main() {  
    // Set up the pipeline and consume the output.  
    for n := range sq(sq(gen(2, 3))) {  
        fmt.Println(n) // 16 then 81  
    }  
}

扇出扇入（Fan-out, fan-in）

多個函數可以從同一個channel讀取數據，直到這個channel關閉，這叫扇出。這是一種多個工作實例分布式地協作以并行利用CPU和I/O的方式。

一個函數可以從多個輸入讀取并處理數據，直到所有的輸入channel都被關閉。這個函數會將所有輸入channel導入一個單一的channel。這個單一的channel在所有輸入channel都關閉后才會關閉。這叫做扇入。

我們可以設置我們的管道執行兩個sq實例，每一個實例都從相同的輸入channel讀取數據。我們引入了一個新的函數，merge，來扇入結果:

func main() {  
    in := gen(2, 3)  
   
    // Distribute the sq work across two goroutines that both read from in.  
    c1 := sq(in)  
    c2 := sq(in)  
   
    // Consume the merged output from c1 and c2.  
    for n := range merge(c1, c2) {  
        fmt.Println(n) // 4 then 9, or 9 then 4  
    }  
}

merge函數為每一個輸入channel啟動一個goroutine，goroutine將數據拷貝到同一個輸出channel。這樣就將多個channel轉換成一個channel。一旦所有的output goroutine啟動起來，merge就啟動另一個goroutine，在所有輸入拷貝完畢后關閉輸出channel。
向一個關閉了的channel發送數據會觸發異常，所以在調用close之前確認所有的發送動作都執行完畢很重要。sync.WaitGroup類型為這種同步提供了一種簡便的方法:

func merge(cs ...<-chan int) <-chan int {  
    var wg sync.WaitGroup  
    out := make(chan int)  
   
    // Start an output goroutine for each input channel in cs.  output  
    // copies values from c to out until c is closed, then calls wg.Done.  
    output := func(c <-chan int) {  
        for n := range c {  
            out <- n  
        }  
        wg.Done()  
    }  
    wg.Add(len(cs))  
    for _, c := range cs {  
        go output(c)  
    }  
   
    // Start a goroutine to close out once all the output goroutines are  
    // done.  This must start after the wg.Add call.  
    go func() {  
        wg.Wait()  
        close(out)  
    }()  
    return out  
}

停止的藝術

我們所有的管道函數都遵循一種模式：

發送者在發送完畢時關閉其輸出channel。
接收者持續從輸入管道接收數據直到輸入管道關閉。

這種模式使得每一個接收函數都能寫成一個range循環，保證所有的goroutine在數據成功發送到下游后就關閉。

但是在真實的案例中，并不是所有的輸入數據都需要被接收處理。有些時候是故意這么設計的：接收者可能只需要數據的子集就夠了；或者更一般的，因為輸入數據有錯誤而導致接收函數提早退出。上面任何一種情況下，接收者都不應該繼續等待后續的數據到來，并且我們希望上游函數停止生成后續步驟已經不需要的數據。

在我們的管道例子中，如果一個階段無法消費所有的輸入數據，那些發送這些數據的goroutine就會一直阻塞下去：

    // Consume the first value from output.  
    out := merge(c1, c2)  
    fmt.Println(<-out) // 4 or 9  
    return 
    // Since we didn't receive the second value from out,  
    // one of the output goroutines is hung attempting to send it.  
}

這是一種資源泄漏：goroutine會占用內存和運行時資源。goroutine棧持有的堆引用會阻止GC回收資源。而且goroutine不能被垃圾回收，必須主動退出。

我們必須重新設計管道中的上游函數，在下游函數無法接收所有輸入數據時退出。一種方法就是讓輸出channel擁有一定的緩存。緩存可以存儲一定數量的數據。如果緩存空間足夠，發送操作就會馬上返回:

c := make(chan int, 2) // buffer size 2  
c <- 1  // succeeds immediately  
c <- 2  // succeeds immediately  
c <- 3  // blocks until another goroutine does <-c and receives 1

如果在channel創建時就知道需要發送數據的數量，帶緩存的channel會簡化代碼。例如，我們可以重寫gen函數，拷貝一系列的整數到一個帶緩存的channel而不是創建一個新的goroutine：

func gen(nums ...int) <-chan int {  
    out := make(chan int, len(nums))  
    for _, n := range nums {  
        out <- n  
    }  
    close(out)  
    return out  
}

反過來我們看管道中被阻塞的goroutine，我們可以考慮為merge函數返回的輸出channel增加一個緩存：

func merge(cs ...<-chan int) <-chan int {  
    var wg sync.WaitGroup  
    out := make(chan int, 1) // enough space for the unread inputs  
    // ... the rest is unchanged ...

雖然這樣可以避免了程序中goroutine的阻塞，但這是很爛的代碼。選擇緩存大小為1取決于知道merge函數接收數字的數量和下游函數消費數字的數量。這是很不穩定的：如果我們向gen多發送了一個數據，或者下游函數少消費了數據，我們就又一次阻塞了goroutine。

然而，我們需要提供一種方式，下游函數可以通知上游發送者下游要停止接收數據。

#p#

顯式取消

當main函數決定在沒有從out接收所有的數據而要退出時，它需要通知上游的goroutine取消即將發送的數據。可以通過向一個叫做done的channel發送數據來實現。因為有兩個潛在阻塞的goroutine，main函數會發送兩個數據：

func main() {  
    in := gen(2, 3)  
   
    // Distribute the sq work across two goroutines that both read from in.  
    c1 := sq(in)  
    c2 := sq(in)  
   
    // Consume the first value from output.  
    done := make(chan struct{}, 2)  
    out := merge(done, c1, c2)  
    fmt.Println(<-out) // 4 or 9  
   
    // Tell the remaining senders we're leaving.  
    done <- struct{}{}  
    done <- struct{}{}  
}

對發送goroutine而言，需要將發送操作替換為一個select語句，要么out發生發送操作，要么從done接收數據。done的數據類型是空的struct，因為其值無關緊要：僅僅表示out需要取消發送操作。output 繼續在輸入channel循環執行，因此上游函數是不會阻塞的。（接下來我們會討論如何讓循環提早退出）

func merge(done <-chan struct{}, cs ...<-chan int) <-chan int {  
    var wg sync.WaitGroup  
    out := make(chan int)  
   
    // Start an output goroutine for each input channel in cs.  output  
    // copies values from c to out until c is closed or it receives a value  
    // from done, then output calls wg.Done.  
    output := func(c <-chan int) {  
        for n := range c {  
            select {  
            case out <- n:  
            case <-done:  
            }  
        }  
        wg.Done()  
    }  
    // ... the rest is unchanged ...

這種方法有一個問題：每一個下游函數需要知道潛在可能阻塞的上游發送者的數量，以發送響應的信號讓其提早退出。跟蹤這些數量是無趣的而且很容易出錯。

我們需要一種能夠讓未知或無界數量的goroutine都能夠停止向下游發送數據的方法。在Go中，我們可以通過關閉一個channel實現。因為從一個關閉了的channel執行接收操作總能馬上成功，并返回相應數據類型的零值。

這意味著main函數僅通過關閉done就能實現將所有的發送者解除阻塞。關閉操作是一個高效的對發送者的廣播信號。我們擴展管道中所有的函數接受done作為一個參數，并通過defer來實現相應channel的關閉操作。因此，無論main函數在哪一行退出都會通知上游退出。

func main() {  
    // Set up a done channel that's shared by the whole pipeline,  
    // and close that channel when this pipeline exits, as a signal  
    // for all the goroutines we started to exit.  
    done := make(chan struct{})  
    defer close(done)  
   
    in := gen(done, 2, 3)  
   
    // Distribute the sq work across two goroutines that both read from in.  
    c1 := sq(done, in)  
    c2 := sq(done, in)  
   
    // Consume the first value from output.  
    out := merge(done, c1, c2)  
    fmt.Println(<-out) // 4 or 9  
   
    // done will be closed by the deferred call.  
}

現在每一個管道函數在done被關閉后就可以馬上返回了。merge函數中的output可以在接收管道的數據消費完之前返回，因為output函數知道上游發送者sq會在done關閉后停止產生數據。同時，output通過defer語句保證wq.Done會在所有退出路徑上調用。

func merge(done <-chan struct{}, cs ...<-chan int) <-chan int {  
    var wg sync.WaitGroup  
    out := make(chan int)  
   
    // Start an output goroutine for each input channel in cs.  output  
    // copies values from c to out until c or done is closed, then calls  
    // wg.Done.  
    output := func(c <-chan int) {  
        defer wg.Done()  
        for n := range c {  
            select {  
            case out <- n:  
            case <-done:  
                return 
            }  
        }  
    }  
    // ... the rest is unchanged ...

類似的，sq也可以在done關閉后馬上返回。sq通過defer語句使得任何退出路徑都能關閉其輸出channel out。

func sq(done <-chan struct{}, in <-chan int) <-chan int {  
    out := make(chan int)  
    go func() {  
        defer close(out)  
        for n := range in {  
            select {  
            case out <- n * n:  
            case <-done:  
                return 
            }  
        }  
    }()  
    return out  
}

管道構建的指導思想如下：

每一個階段在所有發送操作完成后關閉輸出channel。
每一個階段持續從輸入channel接收數據直到輸入channel被關閉或者生產者被解除阻塞（譯者：生產者退出）。

管道解除生產者阻塞有兩種方法：要么保證有足夠的緩存空間存儲將要被生產的數據，要么顯式的通知生產者消費者要取消接收數據。

樹形摘要

讓我們來看一個更為實際的管道。

MD5是一個信息摘要算法，對于文件校驗非常有用。命令行工具md5sum很有用，可以打印一系列文件的摘要值。

% md5sum *.go  
d47c2bbc28298ca9befdfbc5d3aa4e65  bounded.go  
ee869afd31f83cbb2d10ee81b2b831dc  parallel.go  
b88175e65fdcbc01ac08aaf1fd9b5e96  serial.go

我們的例子程序和md5sum類似，但是接受一個單一的文件夾作為參數，打印該文件夾下每一個普通文件的摘要值，并按路徑名稱排序。

% go run serial.go .  
d47c2bbc28298ca9befdfbc5d3aa4e65  bounded.go  
ee869afd31f83cbb2d10ee81b2b831dc  parallel.go  
b88175e65fdcbc01ac08aaf1fd9b5e96  serial.go

我們程序的main函數調用一個工具函數MD5ALL，該函數返回一個從路徑名稱到摘要值的哈希表，然后排序并輸出結果：

func main() {  
    // Calculate the MD5 sum of all files under the specified directory,  
    // then print the results sorted by path name.  
    m, err := MD5All(os.Args[1])  
    if err != nil {  
        fmt.Println(err)  
        return 
    }  
    var paths []string  
    for path := range m {  
        paths = append(paths, path)  
    }  
    sort.Strings(paths)  
    for _, path := range paths {  
        fmt.Printf("%x  %s\n", m[path], path)  
    }  
}

MD5ALL是我們討論的核心。在 serial.go中，沒有采用任何并發，僅僅遍歷文件夾，讀取文件并求出摘要值。

// MD5All reads all the files in the file tree rooted at root and returns a map  
// from file path to the MD5 sum of the file's contents.  If the directory walk  
// fails or any read operation fails, MD5All returns an error.  
func MD5All(root string) (map[string][md5.Size]byte, error) {  
    m := make(map[string][md5.Size]byte)  
    err := filepath.Walk(root, func(path string, info os.FileInfo, err error) error {  
        if err != nil {  
            return err  
        }  
        if info.IsDir() {  
            return nil  
        }  
        data, err := ioutil.ReadFile(path)  
        if err != nil {  
            return err  
        }  
        m[path] = md5.Sum(data)  
        return nil  
    })  
    if err != nil {  
        return nil, err  
    }  
    return m, nil  
}

#p#

并行摘要求值

在parallel.go中，我們將MD5ALL分成兩階段的管道。第一個階段，sumFiles，遍歷文件夾，每個文件一個goroutine進行求摘要值，然后將結果發送一個數據類型為result的channel中：

type result struct {  
    path string  
    sum  [md5.Size]byte 
    err  error  
}

sumFiles 返回兩個channel：一個用于生成結果，一個用于filepath.Walk返回錯誤。Walk函數為每一個普通文件啟動一個goroutine，然后檢查done，如果done被關閉，walk馬上就會退出。

func sumFiles(done <-chan struct{}, root string) (<-chan result, <-chan error) {  
    // For each regular file, start a goroutine that sums the file and sends  
    // the result on c.  Send the result of the walk on errc.  
    c := make(chan result)  
    errc := make(chan error, 1)  
    go func() {  
        var wg sync.WaitGroup  
        err := filepath.Walk(root, func(path string, info os.FileInfo, err error) error {  
            if err != nil {  
                return err  
            }  
            if info.IsDir() {  
                return nil  
            }  
            wg.Add(1)  
            go func() {  
                data, err := ioutil.ReadFile(path)  
                select {  
                case c <- result{path, md5.Sum(data), err}:  
                case <-done:  
                }  
                wg.Done()  
            }()  
            // Abort the walk if done is closed.  
            select {  
            case <-done:  
                return errors.New("walk canceled")  
            default:  
                return nil  
            }  
        })  
        // Walk has returned, so all calls to wg.Add are done.  Start a  
        // goroutine to close c once all the sends are done.  
        go func() {  
            wg.Wait()  
            close(c)  
        }()  
        // No select needed here, since errc is buffered.  
        errc <- err  
    }()  
    return c, errc  
}

MD5All 從c中接收摘要值。MD5All 在遇到錯誤時提前退出，通過defer關閉done。

func MD5All(root string) (map[string][md5.Size]byte, error) {  
    // MD5All closes the done channel when it returns; it may do so before  
    // receiving all the values from c and errc.  
    done := make(chan struct{})  
    defer close(done)  
   
    c, errc := sumFiles(done, root)  
   
    m := make(map[string][md5.Size]byte)  
    for r := range c {  
        if r.err != nil {  
            return nil, r.err  
        }  
        m[r.path] = r.sum  
    }  
    if err := <-errc; err != nil {  
        return nil, err  
    }  
    return m, nil  
}

有界并行

parallel.go中實現的MD5ALL，對每一個文件啟動了一個goroutine。在一個包含大量大文件的文件夾中，這會導致超過機器可用內存的內存分配。（譯者注：即發生OOM）

我們可以通過限制讀取文件的并發度來避免這種情況發生。在bounded.go中，我們通過創建一定數量的goroutine讀取文件。現在我們的管道現在有三個階段：遍歷文件夾，讀取文件并計算摘要值，收集摘要值。

第一個階段，walkFiles，輸出文件夾中普通文件的文件路徑：

func walkFiles(done <-chan struct{}, root string) (<-chan string, <-chan error) {  
    paths := make(chan string)  
    errc := make(chan error, 1)  
    go func() {  
        // Close the paths channel after Walk returns.  
        defer close(paths)  
        // No select needed for this send, since errc is buffered.  
        errc <- filepath.Walk(root, func(path string, info os.FileInfo, err error) error {  
            if err != nil {  
                return err  
            }  
            if info.IsDir() {  
                return nil  
            }  
            select {  
            case paths <- path:  
            case <-done:  
                return errors.New("walk canceled")  
            }  
            return nil  
        })  
    }()  
    return paths, errc  
}

中間的階段啟動一定數量的digester goroutine，從paths接收文件名稱，并向c發送result結構:

func digester(done <-chan struct{}, paths <-chan string, c chan<- result) {  
    for path := range paths {  
        data, err := ioutil.ReadFile(path)  
        select {  
        case c <- result{path, md5.Sum(data), err}:  
        case <-done:  
            return 
        }  
    }  
}

和前一個例子不同，digester并不關閉其輸出channel，因為輸出channel是共享的，多個goroutine會向同一個channel發送數據。MD5All 會在所有的digesters 結束后關閉響應的channel。

// Start a fixed number of goroutines to read and digest files.  
c := make(chan result)  
var wg sync.WaitGroup  
const numDigesters = 20 
wg.Add(numDigesters)  
for i := 0; i < numDigesters; i++ {  
    go func() {  
        digester(done, paths, c)  
        wg.Done()  
    }()  
}  
go func() {  
    wg.Wait()  
    close(c)  
}()

我們也可以讓每一個digester創建并返回自己的輸出channel，但如果這樣的話，我們需要額外的goroutine來扇入這些結果。
最后一個階段從c中接收所有的result數據，并從errc中檢查錯誤。這種檢查不能在之前的階段做，因為在這之前，walkFiles 可能被阻塞不能往下游發送數據：

    m := make(map[string][md5.Size]byte)  
    for r := range c {  
        if r.err != nil {  
            return nil, r.err  
        }  
        m[r.path] = r.sum  
    }  
    // Check whether the Walk failed.  
    if err := <-errc; err != nil {  
        return nil, err  
    }  
    return m, nil  
}

結論

這篇文章介紹了如果用Go構建流式數據管道的技術。在這樣的管道中處理錯誤有點取巧，因為管道中每一個階段可能被阻塞不能往下游發送數據，下游階段可能已經不關心輸入數據。我們展示了關閉channel如何向所有管道啟動的goroutine廣播一個done信號，并且定義了正確構建管道的指導思想。

深入閱讀：

• Go并發模式（視頻）展示了Go并發原語的基本概念和幾個實現的方法

• 高級Go并發模式（視頻）包含幾個更為復雜的Go并發原語的使用，尤其是select

• Douglas McIlroy的Squinting at Power Series論文展示了類似Go的并發模式如何為復雜的計算提供優雅的支持。

原文鏈接： Golang - Sameer Ajmani 翻譯：伯樂在線 - Codefor

譯文鏈接： http://blog.jobbole.com/65552/

責任編輯：林師授來源：伯樂在線

Go語言并發模式

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看