
Given that we can get a benefit using threads, we know it is possible. It is unclear whether processes can offer a speed benefit in this case. We can also try to unzip files concurrently with processes instead of threads. We will use the context manager to ensure the thread pool is closed once we are finished with it. The ThreadPoolExecutor provides a pool of worker threads that we can use.Įach file can be submitted as a task to the thread pool and worker threads can perform the task of loading data from disk into memory.įirst, we can create the thread pool with 100 worker threads. Next, let’s start to explore the use of threads to speed-up file unzipping and saving. This might offer a benefit if we wish to explore possible speed-ups by separating the two elements of unzipping each file. One approach would be to call the ZipFile.extract() function directly to first decompress the data into memory then save the data to disk.Īn alternate approach might be to first decompress the data into memory as a string using the ZipFile.read() function, then create a path and save the file to disk manually using Python disk IO functions. That is, we expect to spend more time waiting for the hard drive than waiting for the CPU. Intuitively, we would expect that the IO-bound part of the task is slower than the CPU-bound part of the task. Saving the decompressed data to file is IO-bound as it is limited by the speed that we can move data from main memory onto the hard drive. We can adapt the program to be multithreaded with very few changes.ĭecompressing data in memory is purely algorithmic and intuitively we might think it is CPU-bound. Next, let’s explore how we might use multi-threading to speed-up the unzipping process.
#Python decompress lzip serial#
How long does the serial version take to run on your machine?
