Tuesday, 15 January 2013

multithreading - Python map function, threads, big structures -



multithreading - Python map function, threads, big structures -

my problem simple, need map function list of values, , homecoming maximum value. so, reduction after map, mapreduce if term.

i read, beingness newbie python, should utilize multiprocessing instead of threads, , ok, don't know if cripple program. problem function need map needs several parameters, , info structures needs are huge.

so, worried big info beingness passed processes, results creating new copies, , unacceptable.

what recommendations have in solving simple problem? face multiple copies, should manually create shared memory, or vm automagically create me?

thanks!

you have few options accomplish this, each own advantages , disadvantages.

argument lists

let's easy 1 out of way first: passing info structures create re-create each process. sounds not want.

managers , proxies

i recommend seek method out first. multiprocessing library supports proxy objects. these deed aliases shared objects , easy utilize if you're dealing native types lists , dictionaries. method allows safely modify shared objects without having worry lock details, because manager take care of them. since mentioned lists, may best bet. can create proxies own custom objects.

global info structures

in situations, acceptable solution create info structures global alternative passing them arguments. not copied between processes if read them. can trip people when don't realize creating local reference global variable counts writing it because variable's reference count must incremented. how garbage collector knows when memory can freed: if object's reference count zero, no 1 using , can removed. here's code snippet demonstrates this:

import sys import multiprocessing global_var = [1, 2, 3] def no_effect1(): print(global_var[0] + global_var[1]) print("no effect count: {}".format(sys.getrefcount(global_var))) homecoming def no_effect2(): new_list = [i in global_var] print("new list count: {}".format(sys.getrefcount(global_var))) def change1(): local_handle = global_var print("local handle count: {}".format(sys.getrefcount(global_var))) def change2(): new_dict = {'item':global_var} print("contained count: {}".format(sys.getrefcount(global_var))) p_list = [multiprocessing.process(target=no_effect1), multiprocessing.process(target=no_effect2), multiprocessing.process(target=change1), multiprocessing.process(target=change2)] p in p_list: p.start() p in p_list: p.join() p.join()

this code produces output:

3 no effect count: 2 new list count: 2 local handle count: 3 contained count: 3

in no_effect1() function, able read , utilize info global construction without increasing ref count. no_effect2() constructs new list global structures. in both cases, reading globals, not creating any local references same underlying memory. if utilize global info structures in way, not cause them copied between processes.

however, notice in change1() , change2() reference count incremented because bound local variable same info structure. means have modified global construction , it copied.

shared ctypes

if can finesse shared info c arrays, can utilize shared ctypes. these arrays (or single values) allocated shared memory. can pass around wrapper without underlying info beingness copied.

mmap

you can create shared memory-map set info into, can complicated, , recommend doing if proxy , global options don't work you. there blog post here has decent example.

other thoughts

one nitpicky point: in question, referred "vm". since didn't specify you're running on vm, assume you're referring python interpreter vm. maintain in mind python interpreter, , not provide virtual machine environment java. line blurred , right utilize of terminology open debate, people don't refer python interpreter vm. see first-class answers this question more nuanced explanation of differences.

python multithreading mapreduce

No comments:

Post a Comment