c++ - Using extern on Halide with GPU -
i seek utilize extern function in halide. in context, want on gpu.
i compile in aot compilation opencl statement. of course, opencl can still utilize cpu, utilize this:
halide_set_ocl_device_type("gpu"); for now, schedule @ compute_root().
first question, if utilize compute_root() , opencl gpu, did process compute on device copyhtod , dtoh? (or on host buffer)
second question, more related extern functions. utilize extern phone call because of our algorithm not in halide. extern call:
foo.define_extern("cool_foo", args, float(32), 4); extern retrieve: extern "c" int cool_foo(buffer_t * in, int w, int h, int z, buffer_t * out){ .. }
but, in cool_foo function, buffer_t load in host memory. dev address 0 (default).
if seek re-create memory before algorithm:
halide_copy_to_dev(null, &in); it nothing.
if create available device memory:
in.host = null; my host pointer null, device address still 0.
(dev_dirty true on case , host_dirty false)
any idea?
edit (to reply dsharlet)
here's construction of code:
parse info correctly on cpu. --> sent buffer on gpu (using halide_copy_to_dev...) --> come in in halide structure, read parameter , add together boundary status --> go in extern function -->...
i don't have valid buffer_t in extern function. schedule in compute_root(), utilize hl_target=host-opencl , set ocl gpu. before entering in halide, can read device address , it's ok.
here's code:
before halide, cpu stuff(the pointer) , transfert gpu
buffer_t k = { 0, (uint8_t *) k_full, {w_k, h_k, num_patch_x * num_patch_y * 3}, {1, w_k, w_k * h_k}, {0}, sizeof(float), }; #if defined( usegpu ) // transfer gpu halide_copy_to_dev(null, &k); k.host_dirty = false; k.dev_dirty = true; //k.host = null; // it's k_full #endif halide_func(&k) inside halide:
imageparam ... func process; process = halide_sub_func(k, width, height, k.channels()); process.compute_root(); ... func halide_sub_func(imageparam k, expr width, expr height, expr patches) { func kbounded("kbounded"), kshifted("kshifted"), khat("khat"), khat_tuple("khat_tuple"); kbounded = repeat_image(constant_exterior(k, 0.0f), 0, width, 0, height, 0, patches); kshifted(x, y, pi) = kbounded(x + k.width() / 2, y + k.height() / 2, pi); khat = extern_func(kshifted, width, height, patches); khat_tuple(x, y, pi) = tuple(khat(0, x, y, pi), khat(1, x, y, pi)); kshifted.compute_root(); khat.compute_root(); homecoming khat_tuple; } outside halide(extern function):
inline .... { //the buffer_t.dev , .host 0 , null. expect null host, dev.. }
i find solution problem.
i post reply in code here. (since did little offline test, variable name doesn't match)
inside halide: (halide_func.cpp)
#include <halide.h> using namespace halide; using namespace halide::boundaryconditions; func thirdpartyfunction(imageparam f); func fourthpartyfunction(imageparam f); var x, y; int main(int argc, char **argv) { // input: imageparam f( float( 32 ), 2, "f" ); printf(" argument: %d\n",argc); int test = atoi(argv[1]); if (test == 1) { func f1; f1(x, y) = f(x, y) + 1.0f; f1.gpu_tile(x, 256); std::vector<argument> args( 1 ); args[ 0 ] = f; f1.compile_to_file("halide_func", args); } else if (test == 2) { func foutput("foutput"); func fbounded("fbounded"); fbounded = repeat_image(f, 0, f.width(), 0, f.height()); foutput(x, y) = fbounded(x-1, y) + 1.0f; foutput.gpu_tile(x, 256); std::vector<argument> args( 1 ); args[ 0 ] = f; foutput.compile_to_file("halide_func", args); } else if (test == 3) { func h("hout"); h = thirdpartyfunction(f); h.gpu_tile(x, 256); std::vector<argument> args( 1 ); args[ 0 ] = f; h.compile_to_file("halide_func", args); } else { func h("hout"); h = fourthpartyfunction(f); std::vector<argument> args( 1 ); args[ 0 ] = f; h.compile_to_file("halide_func", args); } } func thirdpartyfunction(imageparam f) { func g("g"); func fbounded("fbounded"); func h("h"); //boundary fbounded = repeat_image(f, 0, f.width(), 0, f.height()); g(x, y) = fbounded(x-1, y) + 1.0f; h(x, y) = g(x, y) - 1.0f; // need comment out if want utilize gpu schedule. //g.compute_root(); //at to the lowest degree 1 stage schedule lone //h.compute_root(); homecoming h; } func fourthpartyfunction(imageparam f) { func fbounded("fbounded"); func g("g"); func h("h"); //boundary fbounded = repeat_image(f, 0, f.width(), 0, f.height()); // preprocess g(x, y) = fbounded(x-1, y) + 1.0f; g.compute_root(); g.gpu_tile(x, y, 256, 1); // extern std::vector < externfuncargument > args = { g, f.width(), f.height() }; h.define_extern("extern_func", args, int(16), 3); h.compute_root(); homecoming h; } the external function: (external_func.h)
#include <cstdint> #include <cstdio> #include <cstdlib> #include <cassert> #include <cinttypes> #include <cstring> #include <fstream> #include <map> #include <vector> #include <complex> #include <chrono> #include <iostream> #include <clfft.h> // opencl need include. using namespace std; // useful stuff. void completedetails2d(buffer_t buffer) { // read elements: std::cout << "buffer information:" << std::endl; std::cout << "extent: " << buffer.extent[0] << ", " << buffer.extent[1] << std::endl; std::cout << "stride: " << buffer.stride[0] << ", " << buffer.stride[1] << std::endl; std::cout << "min: " << buffer.min[0] << ", " << buffer.min[1] << std::endl; std::cout << "elem size: " << buffer.elem_size << std::endl; std::cout << "host dirty: " << buffer.host_dirty << ", dev dirty: " << buffer.dev_dirty << std::endl; printf("host pointer: %p, dev pointer: %" priu64 "\n\n\n", buffer.host, buffer.dev); } extern cl_context _zn6halide7runtime8internal11weak_cl_ctxe; extern cl_command_queue _zn6halide7runtime8internal9weak_cl_qe; extern "c" int extern_func(buffer_t * in, int width, int height, buffer_t * out) { printf("in extern\n"); completedetails2d(*in); printf("out extern\n"); completedetails2d(*out); if(in->dev == 0) { // boundary stuff in->min[0] = 0; in->min[1] = 0; in->extent[0] = width; in->extent[1] = height; homecoming 0; } // super awesome stuff on gpu // ... cl_context & ctx = _zn6halide7runtime8internal11weak_cl_ctxe; // found zougloub cl_command_queue & queue = _zn6halide7runtime8internal9weak_cl_qe; // same printf("ctx: %p\n", ctx); printf("queue: %p\n", queue); cl_mem buffer_in; buffer_in = (cl_mem) in->dev; cl_mem buffer_out; buffer_out = (cl_mem) out->dev; // copying info 1 buffer int err = clenqueuecopybuffer(queue, buffer_in, buffer_out, 0, 0, 256*256*4, 0, null, null); printf("copy: %d\n", err); err = clfinish(queue); printf("finish: %d\n\n", err); homecoming 0; } finally, non-halide stuff: (halide_test.cpp)
#include <halide_func.h> #include <iostream> #include <cinttypes> #include <external_func.h> // extern function available within .o generated. #include "halideruntime.h" int main(int argc, char **argv) { // init kernel in gpu halide_set_ocl_device_type("gpu"); // create buffer int width = 256; int height = 256; float * bufferhostin = (float*) malloc(sizeof(float) * width * height); float * bufferhostout = (float*) malloc(sizeof(float) * width * height); for( int j = 0; j < height; ++j) { for( int = 0; < width; ++i) { bufferhostin[i + j * width] = i+j; } } buffer_t bufferhalidein = {0, (uint8_t *) bufferhostin, {width, height}, {1, width, width * height}, {0, 0}, sizeof(float), true, false}; buffer_t bufferhalideout = {0, (uint8_t *) bufferhostout, {width, height}, {1, width, width * height}, {0, 0}, sizeof(float), true, false}; printf("in\n"); completedetails2d(bufferhalidein); printf("data (host): "); for(int = 0; < 10; ++ i) { printf(" %f, ", bufferhostin[i]); } printf("\n"); printf("out\n"); completedetails2d(bufferhalideout); // send gpu halide_copy_to_dev(null, &bufferhalidein); halide_copy_to_dev(null, &bufferhalideout); bufferhalidein.host_dirty = false; bufferhalidein.dev_dirty = true; bufferhalideout.host_dirty = false; bufferhalideout.dev_dirty = true; // tricks halide forcefulness utilize of device. bufferhalidein.host = null; bufferhalideout.host = null; printf("in after device\n"); completedetails2d(bufferhalidein); // halide function halide_func(&bufferhalidein, &bufferhalideout); // host bufferhalidein.host = (uint8_t*)bufferhostin; bufferhalideout.host = (uint8_t*)bufferhostout; halide_copy_to_host(null, &bufferhalideout); halide_copy_to_host(null, &bufferhalidein); // validation printf("\nout\n"); completedetails2d(bufferhalideout); printf("data (host): "); for(int = 0; < 10; ++ i) { printf(" %f, ", bufferhostout[i]); } printf("\n"); // free free(bufferhostin); free(bufferhostout); } you can compile halide_func test 4 utilize extern functionnality.
here's of conclusion have. (thanks zalman , zougloub)
compute_root don't phone call device if utilize alone. we need gpu() of gpu_tile() in code phone call gpu routine. (btw, need set variable inside) gpu_tile les item crash stuff. boundarycondition works in gpu. before calling extern function, func goes input need be: f.compute_root(); f.gpu_tile(x,y,...,...); compute_root in middle stage not implicit. if dev address 0, it's normal, resend dimension , extern called again. last stage compute_root() implicit. c++ image-processing halide
No comments:
Post a Comment