My Blog: c++ - Using extern on Halide with GPU -

c++ - Using extern on Halide with GPU -

i seek utilize extern function in halide. in context, want on gpu.

i compile in aot compilation opencl statement. of course, opencl can still utilize cpu, utilize this:

halide_set_ocl_device_type("gpu");

for now, schedule @ compute_root().

first question, if utilize compute_root() , opencl gpu, did process compute on device copyhtod , dtoh? (or on host buffer)

second question, more related extern functions. utilize extern phone call because of our algorithm not in halide. extern call:

foo.define_extern("cool_foo", args, float(32), 4);

extern retrieve: extern "c" int cool_foo(buffer_t * in, int w, int h, int z, buffer_t * out){ .. }

but, in cool_foo function, buffer_t load in host memory. dev address 0 (default).

if seek re-create memory before algorithm:

halide_copy_to_dev(null, &in);

it nothing.

if create available device memory:

in.host = null;

my host pointer null, device address still 0.

(dev_dirty true on case , host_dirty false)

any idea?

edit (to reply dsharlet)

here's construction of code:

parse info correctly on cpu. --> sent buffer on gpu (using halide_copy_to_dev...) --> come in in halide structure, read parameter , add together boundary status --> go in extern function -->...

i don't have valid buffer_t in extern function. schedule in compute_root(), utilize hl_target=host-opencl , set ocl gpu. before entering in halide, can read device address , it's ok.

here's code:

before halide, cpu stuff(the pointer) , transfert gpu

buffer_t k = { 0, (uint8_t *) k_full, {w_k, h_k, num_patch_x * num_patch_y * 3}, {1, w_k, w_k * h_k}, {0}, sizeof(float), }; #if defined( usegpu )     // transfer gpu     halide_copy_to_dev(null, &k);     k.host_dirty = false;     k.dev_dirty = true;     //k.host = null; // it's k_full #endif halide_func(&k)

inside halide:

imageparam ... func process; process = halide_sub_func(k, width, height, k.channels()); process.compute_root();  ...  func halide_sub_func(imageparam k, expr width, expr height, expr patches) {     func kbounded("kbounded"), kshifted("kshifted"), khat("khat"), khat_tuple("khat_tuple");     kbounded = repeat_image(constant_exterior(k, 0.0f), 0, width, 0, height, 0, patches);     kshifted(x, y, pi) = kbounded(x + k.width() / 2, y + k.height() / 2, pi);      khat = extern_func(kshifted, width, height, patches);     khat_tuple(x, y, pi) = tuple(khat(0, x, y, pi), khat(1, x, y, pi));      kshifted.compute_root();     khat.compute_root();       homecoming khat_tuple; }

outside halide(extern function):

inline .... { //the buffer_t.dev , .host 0 , null. expect null host, dev.. }

i find solution problem.

i post reply in code here. (since did little offline test, variable name doesn't match)

inside halide: (halide_func.cpp)

#include <halide.h>    using namespace halide;   using namespace halide::boundaryconditions;   func thirdpartyfunction(imageparam f);  func fourthpartyfunction(imageparam f);  var x, y;   int main(int argc, char **argv) {     // input:     imageparam f( float( 32 ), 2, "f" );      printf(" argument: %d\n",argc);      int test = atoi(argv[1]);      if (test == 1) {         func f1;         f1(x, y) = f(x, y) + 1.0f;         f1.gpu_tile(x, 256);         std::vector<argument> args( 1 );         args[ 0 ] = f;         f1.compile_to_file("halide_func", args);      } else if (test == 2) {         func foutput("foutput");         func fbounded("fbounded");         fbounded = repeat_image(f, 0, f.width(), 0, f.height());         foutput(x, y) = fbounded(x-1, y) + 1.0f;           foutput.gpu_tile(x, 256);         std::vector<argument> args( 1 );         args[ 0 ] = f;         foutput.compile_to_file("halide_func", args);      } else if (test == 3) {         func h("hout");          h = thirdpartyfunction(f);          h.gpu_tile(x, 256);         std::vector<argument> args( 1 );         args[ 0 ] = f;         h.compile_to_file("halide_func", args);      } else {         func h("hout");          h = fourthpartyfunction(f);          std::vector<argument> args( 1 );         args[ 0 ] = f;         h.compile_to_file("halide_func", args);     }  }   func thirdpartyfunction(imageparam f) {      func g("g");      func fbounded("fbounded");      func h("h");      //boundary      fbounded = repeat_image(f, 0, f.width(), 0, f.height());      g(x, y) = fbounded(x-1, y) + 1.0f;      h(x, y) = g(x, y) - 1.0f;       // need comment out if want  utilize gpu schedule.      //g.compute_root(); //at  to the lowest degree 1 stage schedule  lone      //h.compute_root();        homecoming h;  }  func fourthpartyfunction(imageparam f) {     func fbounded("fbounded");     func g("g");     func h("h");      //boundary     fbounded = repeat_image(f, 0, f.width(), 0, f.height());      // preprocess     g(x, y) = fbounded(x-1, y) + 1.0f;      g.compute_root();     g.gpu_tile(x, y, 256, 1);       // extern     std::vector < externfuncargument > args = { g, f.width(), f.height() };     h.define_extern("extern_func", args, int(16), 3);      h.compute_root();      homecoming h; }

the external function: (external_func.h)

#include <cstdint> #include <cstdio> #include <cstdlib> #include <cassert> #include <cinttypes> #include <cstring> #include <fstream> #include <map> #include <vector> #include <complex> #include <chrono> #include <iostream>   #include <clfft.h> // opencl need include.  using namespace std; // useful stuff. void completedetails2d(buffer_t buffer) {     // read elements:     std::cout << "buffer information:" << std::endl;     std::cout << "extent: " << buffer.extent[0] << ", " << buffer.extent[1] << std::endl;     std::cout << "stride: " << buffer.stride[0] << ", " << buffer.stride[1] << std::endl;     std::cout << "min: " << buffer.min[0] << ", " << buffer.min[1] << std::endl;     std::cout << "elem size: " << buffer.elem_size << std::endl;     std::cout << "host dirty: " << buffer.host_dirty << ", dev dirty: " << buffer.dev_dirty << std::endl;     printf("host pointer: %p, dev pointer: %" priu64 "\n\n\n", buffer.host, buffer.dev); }  extern cl_context _zn6halide7runtime8internal11weak_cl_ctxe; extern cl_command_queue _zn6halide7runtime8internal9weak_cl_qe;   extern "c" int extern_func(buffer_t * in, int width, int height, buffer_t * out) {     printf("in extern\n");     completedetails2d(*in);     printf("out extern\n");     completedetails2d(*out);      if(in->dev == 0) {         // boundary stuff         in->min[0] = 0;         in->min[1] = 0;         in->extent[0] = width;         in->extent[1] = height;          homecoming 0;     }      // super awesome stuff on gpu     // ...      cl_context & ctx = _zn6halide7runtime8internal11weak_cl_ctxe; // found zougloub     cl_command_queue & queue = _zn6halide7runtime8internal9weak_cl_qe; // same      printf("ctx: %p\n", ctx);      printf("queue: %p\n", queue);      cl_mem buffer_in;     buffer_in = (cl_mem) in->dev;     cl_mem buffer_out;     buffer_out = (cl_mem) out->dev;      // copying   info 1 buffer     int err = clenqueuecopybuffer(queue, buffer_in, buffer_out, 0, 0, 256*256*4, 0, null, null);      printf("copy: %d\n", err);      err = clfinish(queue);      printf("finish: %d\n\n", err);       homecoming 0; }

finally, non-halide stuff: (halide_test.cpp)

#include <halide_func.h> #include <iostream> #include <cinttypes>  #include <external_func.h>  // extern function available  within .o generated. #include "halideruntime.h"  int main(int argc, char **argv) {      // init kernel in gpu     halide_set_ocl_device_type("gpu");      // create buffer     int width = 256;     int height = 256;     float * bufferhostin = (float*) malloc(sizeof(float) * width * height);     float * bufferhostout = (float*) malloc(sizeof(float) * width * height);      for( int j = 0; j < height; ++j) {         for( int = 0; < width; ++i) {             bufferhostin[i + j * width] = i+j;         }     }      buffer_t bufferhalidein = {0, (uint8_t *) bufferhostin, {width, height}, {1, width, width * height}, {0, 0}, sizeof(float), true, false};     buffer_t bufferhalideout = {0, (uint8_t *) bufferhostout, {width, height}, {1, width, width * height}, {0, 0}, sizeof(float), true, false};      printf("in\n");     completedetails2d(bufferhalidein);     printf("data (host): ");     for(int = 0; < 10; ++ i) {         printf(" %f, ", bufferhostin[i]);     }     printf("\n");      printf("out\n");     completedetails2d(bufferhalideout);      // send gpu     halide_copy_to_dev(null, &bufferhalidein);     halide_copy_to_dev(null, &bufferhalideout);     bufferhalidein.host_dirty = false;     bufferhalidein.dev_dirty = true;     bufferhalideout.host_dirty = false;     bufferhalideout.dev_dirty = true;     // tricks halide  forcefulness  utilize of device.     bufferhalidein.host = null;     bufferhalideout.host = null;      printf("in after device\n");     completedetails2d(bufferhalidein);      // halide function     halide_func(&bufferhalidein, &bufferhalideout);      // host     bufferhalidein.host = (uint8_t*)bufferhostin;     bufferhalideout.host = (uint8_t*)bufferhostout;     halide_copy_to_host(null, &bufferhalideout);     halide_copy_to_host(null, &bufferhalidein);      // validation     printf("\nout\n");     completedetails2d(bufferhalideout);     printf("data (host): ");     for(int = 0; < 10; ++ i) {         printf(" %f, ", bufferhostout[i]);     }     printf("\n");      // free     free(bufferhostin);     free(bufferhostout);  }

you can compile halide_func test 4 utilize extern functionnality.

here's of conclusion have. (thanks zalman , zougloub)

compute_root don't phone call device if utilize alone. we need gpu() of gpu_tile() in code phone call gpu routine. (btw, need set variable inside) gpu_tile les item crash stuff. boundarycondition works in gpu. before calling extern function, func goes input need be: f.compute_root(); f.gpu_tile(x,y,...,...); compute_root in middle stage not implicit. if dev address 0, it's normal, resend dimension , extern called again. last stage compute_root() implicit.

c++ image-processing halide

My Blog

Thursday, 15 April 2010

c++ - Using extern on Halide with GPU -

No comments:

Post a Comment