My Blog: google bigquery - Issue when loading data from cloud storage, at least an error message improvement is needed -

Wednesday, 15 June 2011

google bigquery - Issue when loading data from cloud storage, at least an error message improvement is needed -

when seek load multiple files cloud storage larger jobs fail. when seek load individual file works, loading batches much more convenient.

snippet: recent jobs load 11:24am gs://albertbigquery.appspot.com/uep/201409/01/wpc_5012_20140901_0002.log.gz toalbertbigquery:uep.201409

load 11:23am gs://albertbigquery.appspot.com/uep/201409/01/wpc_5012_20140901_0001.log.gz toalbertbigquery:uep.201409

load 11:22am gs://albertbigquery.appspot.com/uep/201409/01/* toalbertbigquery:uep.201409 errors: file: 40 / line:1 / field:1, bad character (ascii 0) encountered: field starts with: <�> file: 40 / line:2 / field:1, bad character (ascii 0) encountered: field starts with: <5c��>}�> file: 40 / line:3 / field:1, bad character (ascii 0) encountered: field starts with: <��w�o�> file: 40 / line:4, few columns: expected 7 column(s) got 2 column(s). additional help: file: 40 / line:5, few columns: expected 7 column(s) got 1 column(s). additional help: file: 40 / line:6, few columns: expected 7 column(s) got 1 column(s). additional help: file: 40 / line:7, few columns: expected 7 column(s) got 1 column(s). additional help: file: 40 / line:8 / field:1, bad character (ascii 0) encountered: field starts with: <��hy�>

the worst problem don't know file "file: 40" order seems random, otherwise remove file , load data, or seek find error in file.

i uncertainty there actual file error, illustration in above case when removed files _0001 , _0002 (that worked fine load single files) still output:

recent jobs load 11:44am gs://albertbigquery.appspot.com/uep/201409/01/* toalbertbigquery:uep.201409 errors: file: 1 / line:1 / field:1, bad character (ascii 0) encountered: field starts with: <�> file: 1 / line:2 / field:3, bad character (ascii 0) encountered: field starts with: file: 1 / line:3, few columns: expected 7 column(s) got 1 column(s). additional help: file: 1 / line:4 / field:3, bad character (ascii 0) encountered: field starts with:

sometimes though files load fine, otherwise i'd expect multiple file loading wrecked.

info: average file size around 20mb, directory 70 files somewhere between 1 , 2 gb.

it looks you're hitting bigquery bug.

when bigquery gets load job request wildcard pattern (i.e. gs://foo/bar*) first expand pattern list of files. read first 1 determine compression type.

one oddity gcs there isn't real concept of directory. gs://foo/bar/baz.csv bucket: 'foo', object: 'bar/baz.csv'. looks have empty files placeholders directories (as in gs://albertbigquery.appspot.com/uep/201409/01/).

this empty file doesn't play nicely bigquery probe-for-compression type, since when expand file pattern, directory dummy file first thing gets returned. open dummy file, , doesn't appear gzip file, assume compression type of entire load uncompressed.

we've filed bug , have prepare under testing. prepare out next week. in mean time, options either expand pattern yourself, utilize longer pattern won't match directory (as in gs://albertbigquery.appspot.com/uep/201409/01/wpc*), or delete dummy directory file.

google-bigquery google-cloud-storage

My Blog

Wednesday, 15 June 2011

google bigquery - Issue when loading data from cloud storage, at least an error message improvement is needed -

No comments:

Post a Comment