apache pig - How to validate my Pig input data is as per the dml -
how validate input info right per dml.
input data: jorge posada |yankees| {(catcher,2000),(designated_hitter,2001)}|[games#1594,hit_by_pitch#65,grand_slams#7] landon powell |oakland|{(catcher,2000),(first_baseman,2001)}|[on_base_percentage#0.297,games#26,home_runs#7] martin prado |atlanta| {(second_baseman,2002),(infielder,2003),(left_fielder)}|[games#258,hit_by_pitch#3]
see in bold part ,i have missed year field. bfile= load 'basketball1.txt' using pigstorage('|') (name:chararray,team:chararray,pos:bag{t:tuple(point:chararray,year:int)},bat:map[]);
dump bfile; (jorge posada ,yankees,{(catcher,2000),(designated_hitter,2001)},[games#1594,hit_by_pitch#65,grand_slams#7]) (landon powell ,oakland,{(catcher,2000),(first_baseman,2001)},[on_base_percentage#0.297,games#26,home_runs#7]) (martin prado ,atlanta,,[games#258,hit_by_pitch#3])
regards sanjeeb
here regex script schema, validated fields. please run against inputs , allow me know if need other validations.
regex:
'^ ([a-za-z]+\\s+[a-za-z]+)\\s*\\|\\s* ([a-za-z]+)\\s*\\|\\s* (\\{(?:\\([a-za-z_]+,[0-9]+\\))(?:,\\([a-za-z_]+,[0-9]+\\))*\\})\\s*\\|\\s* (\\[(?:[a-za-z_]+#[0-9\\.]+)(?:,[a-za-z_]+#[0-9\\.]+)*\\]) $'
input.txt have marked each below input valid or invalid
jorge posada |yankees| {(catcher,2000),(designated_hitter,2001)}|[games#1594,hit_by_pitch#65,grand_slams#7] -->valid landon powell |oakland|{(catcher,2000),(first_baseman,2001)}|[on_base_percentage#0.297,games#26,home_runs#7] ->valid martin prado |atlanta| {(second_baseman,2002),(infielder,2003),(left_fielder)}|[games#258,hit_by_pitch#3] -->invalid year missing martin prado |atlanta| {(second_baseman,2002)(infielder,2003)}|[games#258,hit_by_pitch#3] ->invalid no comma between 2 tuples martin prado |atlanta| {,(second_baseman,2002),(infielder,2003)}|[games#258,hit_by_pitch#3] --> invalid comma in start of tuple martin prado |atlanta| {(second_baseman,2002),(,2003)}|[games#258,hit_by_pitch#3] -->invalid position missing martin prado |atlanta| {(second_baseman,2002),(infielder,2003)}[games#258,hit_by_pitch#3] --> invalid demiiter | missing martin prado || {(second_baseman,2002),(infielder,2003)}[games#258,hit_by_pitch#3] --> invalid team name missing martin prado |atlanta| {(second_baseman,2002),(infielder,2003)}[games#,hit_by_pitch#3] --> invalid key value missing games landon powell |oakland|{(catcher,2000)}|[on_base_percentage#0.297] --> valid landon powell |oakland|{(catcher,2000),(first_baseman,2001),(test,3000)}|[on_base_percentage#0.297,games#26,home_runs#7,test#1.2] -->valid
pigscript:
a = load 'input.txt' line; b = foreach generate flatten(regex_extract_all(line,'^([a-za-z]+\\s+[a-za-z]+)\\s*\\|\\s*([a-za-z]+)\\s*\\|\\s*(\\{(?:\\([a-za-z_]+,[0-9]+\\))(?:,\\([a-za-z_]+,[0-9]+\\))*\\})\\s*\\|\\s*(\\[(?:[a-za-z_]+#[0-9\\.]+)(?:,[a-za-z_]+#[0-9\\.]+)*\\])$')) (name:chararray,team:chararray,pos:bag{t:(p:chararray)},bat:map[]);; dump b;
output: if input doesn't match schema, print output null.
(jorge posada,yankees,{(catcher,2000),(designated_hitter,2001)},[games#1594,hit_by_pitch#65,grand_slams#7]) -->valid (landon powell,oakland,{(catcher,2000),(first_baseman,2001)},[on_base_percentage#0.297,games#26,home_runs#7]) -->valid () -->invalid,year missing () -->invalid,no comma between 2 tuples () -->invalid,comma in start of tuple () -->invalid,position missing () -->invalid,demiiter | missing () -->invalid team name missing () -->invalid key value missing games (landon powell,oakland,{(catcher,2000)},[on_base_percentage#0.297]) -->valid (landon powell,oakland,{(catcher,2000),(first_baseman,2001),(test,3000)},[on_base_percentage#0.297,games#26,home_runs#7,test#1.2]) -->valid
apache-pig
No comments:
Post a Comment