google bigquery - Issue when loading data from cloud storage, at least an error message improvement is needed -
when seek load multiple files cloud storage larger jobs fail. when seek load individual file works, loading batches much more convenient.
snippet: recent jobs load 11:24am gs://albertbigquery.appspot.com/uep/201409/01/wpc_5012_20140901_0002.log.gz toalbertbigquery:uep.201409
load 11:23am gs://albertbigquery.appspot.com/uep/201409/01/wpc_5012_20140901_0001.log.gz toalbertbigquery:uep.201409
load 11:22am gs://albertbigquery.appspot.com/uep/201409/01/* toalbertbigquery:uep.201409 errors: file: 40 / line:1 / field:1, bad character (ascii 0) encountered: field starts with: <�> file: 40 / line:2 / field:1, bad character (ascii 0) encountered: field starts with: <5c���>}�> file: 40 / line:3 / field:1, bad character (ascii 0) encountered: field starts with: <����w�o�> file: 40 / line:4, few columns: expected 7 column(s) got 2 column(s). additional help: file: 40 / line:5, few columns: expected 7 column(s) got 1 column(s). additional help: file: 40 / line:6, few columns: expected 7 column(s) got 1 column(s). additional help: file: 40 / line:7, few columns: expected 7 column(s) got 1 column(s). additional help: file: 40 / line:8 / field:1, bad character (ascii 0) encountered: field starts with: <��hy�>
the worst problem don't know file "file: 40" order seems random, otherwise remove file , load data, or seek find error in file.
i uncertainty there actual file error, illustration in above case when removed files _0001 , _0002 (that worked fine load single files) still output:
recent jobs load 11:44am gs://albertbigquery.appspot.com/uep/201409/01/* toalbertbigquery:uep.201409 errors: file: 1 / line:1 / field:1, bad character (ascii 0) encountered: field starts with: <�> file: 1 / line:2 / field:3, bad character (ascii 0) encountered: field starts with: file: 1 / line:3, few columns: expected 7 column(s) got 1 column(s). additional help: file: 1 / line:4 / field:3, bad character (ascii 0) encountered: field starts with:
sometimes though files load fine, otherwise i'd expect multiple file loading wrecked.
info: average file size around 20mb, directory 70 files somewhere between 1 , 2 gb.
it looks you're hitting bigquery bug.
when bigquery gets load job request wildcard pattern (i.e. gs://foo/bar*
) first expand pattern list of files. read first 1 determine compression type.
one oddity gcs there isn't real concept of directory. gs://foo/bar/baz.csv
bucket: 'foo', object: 'bar/baz.csv'
. looks have empty files placeholders directories (as in gs://albertbigquery.appspot.com/uep/201409/01/
).
this empty file doesn't play nicely bigquery probe-for-compression type, since when expand file pattern, directory dummy file first thing gets returned. open dummy file, , doesn't appear gzip file, assume compression type of entire load uncompressed.
we've filed bug , have prepare under testing. prepare out next week. in mean time, options either expand pattern yourself, utilize longer pattern won't match directory (as in gs://albertbigquery.appspot.com/uep/201409/01/wpc*
), or delete dummy directory file.
google-bigquery google-cloud-storage
No comments:
Post a Comment