Python中多線程和多處理的初學(xué)者指南
使用Python分析數(shù)據(jù),如果使用了正確的數(shù)據(jù)結(jié)構(gòu)和算法,有時(shí)可以大量提高程序的速度。實(shí)現(xiàn)此目的的一種方法是使用Muiltithreading(多線程)或Multiprocessing(多重處理)。
在這篇文章中,我們不會(huì)詳細(xì)討論多線程或多處理的內(nèi)部原理。相反,我們舉一個(gè)例子,編寫一個(gè)小的Python腳本從Unsplash下載圖像。我們將從一次下載一個(gè)圖像的版本開始。接下來,我們使用線程來提高執(zhí)行速度。
多線程
簡(jiǎn)單地說,線程允許您并行地運(yùn)行程序。花費(fèi)大量時(shí)間等待外部事件的任務(wù)通常適合線程化。它們也稱為I/O Bound任務(wù)例如從文件中讀寫,網(wǎng)絡(luò)操作或使用API在線下載。讓我們來看一個(gè)示例,它展示了使用線程的好處。
1. 沒有線程
在本例中,我們希望通過順序運(yùn)行程序來查看從Unsplash API下載15張圖像需要多長(zhǎng)時(shí)間:
- import requests
- import time
- img_urls = [
- 'https://images.unsplash.com/photo-1516117172878-fd2c41f4a759',
- 'https://images.unsplash.com/photo-1532009324734-20a7a5813719',
- 'https://images.unsplash.com/photo-1524429656589-6633a470097c',
- 'https://images.unsplash.com/photo-1530224264768-7ff8c1789d79',
- 'https://images.unsplash.com/photo-1564135624576-c5c88640f235',
- 'https://images.unsplash.com/photo-1541698444083-023c97d3f4b6',
- 'https://images.unsplash.com/photo-1522364723953-452d3431c267',
- 'https://images.unsplash.com/photo-1513938709626-033611b8cc03',
- 'https://images.unsplash.com/photo-1507143550189-fed454f93097',
- 'https://images.unsplash.com/photo-1493976040374-85c8e12f0c0e',
- 'https://images.unsplash.com/photo-1504198453319-5ce911bafcde',
- 'https://images.unsplash.com/photo-1530122037265-a5f1f91d3b99',
- 'https://images.unsplash.com/photo-1516972810927-80185027ca84',
- 'https://images.unsplash.com/photo-1550439062-609e1531270e',
- 'https://images.unsplash.com/photo-1549692520-acc6669e2f0c'
- ]
- start = time.perf_counter() #start timer
- for img_url in img_urls:
- img_name = img_url.split('/')[3] #get image name from url
- img_bytes = requests.get(img_url).content
- with open(img_name, 'wb') as img_file:
- img_file.write(img_bytes) #save image to disk
- finish = time.perf_counter() #end timer
- print(f"Finished in {round(finish-start,2)} seconds")
- #results
- Finished in 23.101926751 seconds
一共用時(shí)23秒。
2. 多線程
讓我們看看Pyhton中的線程模塊如何顯著地改進(jìn)我們的程序執(zhí)行:
- import time
- from concurrent.futures import ThreadPoolExecutor
- def download_images(url):
- img_name = img_url.split('/')[3]
- img_bytes = requests.get(img_url).content
- with open(img_name, 'wb') as img_file:
- img_file.write(img_bytes)
- print(f"{img_name} was downloaded")
- start = time.perf_counter() #start timer
- with ThreadPoolExecutor() as executor:
- results = executor.map(download_images,img_urls) #this is Similar to map(func, *iterables)
- finish = time.perf_counter() #end timer
- print(f"Finished in {round(finish-start,2)} seconds")
- #results
- Finished in 5.544147536 seconds
我們可以看到,與不使用線程代碼相比,使用線程代碼可以顯著提高速度。從23秒到5秒。
對(duì)于本例,請(qǐng)注意在創(chuàng)建線程時(shí)存在開銷,因此將線程用于多個(gè)API調(diào)用是有意義的,而不僅僅是單個(gè)調(diào)用。
此外,對(duì)于密集的計(jì)算,如數(shù)據(jù)處理,圖像處理多處理比線程執(zhí)行得更好。