[PyQt] Loading from DB: Threading vs. Multiprocessing

Mon Jan 3 18:23:39 GMT 2011

On Thursday 30 December 2010 01:37:58 Knacktus wrote:
> Hello everyone,
> 
> here the background story ;-):
> ------------------------------
> 
> I'm writing an application which handles a lot of nested data. Imagine a
> QTreeView with 10-20 levels and up to 50000 lines.
> The standard use case would be to expand one level after the other in
> the GUI. For each level I would load the data from the database. This
> can be very slow, as I have a lot of network traffic and DB requests
> with little amounts of data.
> On the other hand, I am able to identify all the data of the tree in the
> database. Therefore I can load all items to the client with one request.
> But this is not always necessary.
> My idea is to load all the data to the client (as bulk) on certain user
> requests only, e.g. if the user clicks "expand all". In the meantime,
> the user should be able to keep on working with the app.
> 
> Here the question:
> ------------------
> 
> I've read about multithreading and multiprocessing, but have no
> experience with it yet. I've done a lot of MPI programming with Fortran,
> so multiprocessing is more familiar to me. Right now I have these
> options in my mind:
> 
> 1) Multiprocessing: Loading the bulk data from the DB with a seperate
> process and merge this data with the exisiting data in my main process.
> The data is in a huge dict with the item ids as keys. So merging should
> not be a problem. That sounds clean to me, but I'm concerned about the
> fact, that multiprocessing is copying the whole app (with all the data
> already loaded from the DB) in memory.
> 
> 2) Multithreading: I guess trying to write to the same dict with my
> items with two threads could be messy. Therefore I would create a
> temporary dict in the secondary thread that loads the bulk from the DB
> and merge with the main dict later. The question here would be the
> responsiveness of the GUI as Python can't perform "real" multithreading.
> 
> 3) Creating a Subprocess: I would create a helper app that loads the
> data from the DB and saves it temporarly on the local drive as pickle.
> The main app would read the data from the file and merge with the main
> dict.
> 
> 
> What are your recommendations and experiences? Are there best practices?
> 
I've been using threads(QThread) to run database queries and other tasks such 
as image loading for years without any major problems.  I am using a custom 
built ORM written in c++(with python bindings), so there are a lot more 
potential issues than a pure python implementation that strictly executes 
queries.

Python has a Global Interpreter Lock(GIL), which will prevent conflicting 
access to python data types such as dicts.  This should be sufficient to make 
your specific use case thread safe.  The GIL can get in the way when your 
threaded code is doing a lot of processing using the python interpreter, but 
when the threaded code is calling long running c/c++ functions, such as a 
database select, it will release the GIL until the call completes.  This is 
all done automatically within python and PyQt.  This is "real" multithreading.

Your #1 and #3 are essentially the same, except that you don't explain how you 
are going to merge the data in #1.  I would imagine that you could do it via 
shared memory, but I have no experience using shared memory from python.  You 
may be thinking of using fork for #1, which I would avoid simply because it 
complicates things and provides very little advantage for your use case.  
Forking would simplify passing the query or other data to the worker process, 
but this should be fairly simple anyway, and you still need to worry about the 
same thing when passing the data back.

When forking file handles and sockets are shared between the processes, which 
can easily cause problems unless you take care to reinitialize them. Also 
forking is not available on windows.

I was using fork for a linux daemon at one point, which was built using python 
and my ORM.  Even after dealing with the obvious stuff like reinitializing the 
database connection, it still caused problems with my logging library so I 
changed the code to use threads and that has been working great.

In conclusion I would try threads to solve your problem.  I think they would 
require the least amount of code and be the most efficient in terms of memory 
and cpu requirements.  If you keep the amount of code that runs in the thread 
to a minimum, you can avoid most problems that people run into, and many of 
those problems are already a non-issue when using python.

Matt Newell