Tuesday, 15 February 2011

regex - What regular expression will match this (XML-ish) input pattern -



regex - What regular expression will match this (XML-ish) input pattern -

i had requirement need parse xml fragment looks this:

<tag name="books">books1</tag> <tag name="textbooks"> textbooks1</tag> <tag name="textbooks"> textbooks2</tag> <tag name="textbooks"> textbooks3</tag> <tag name="textbooks"> textbooks4</tag> <tag name="textbooks"> textbooks5</tag> <tag name="books">books2</tag> <tag name="textbooks"> textbooks1</tag> <tag name="textbooks"> textbooks2</tag> <tag name="books">books3</tag> <tag name="textbooks"> textbooks4</tag> <tag name="textbooks"> textbooks5</tag>

i need tags name="textbooks" including <tag name="books"></tag> lastly textbooks before <tag name="books"></tag>.

so results follows

<tag name="books">books1</tag> <tag name="textbooks"> textbooks1</tag> <tag name="textbooks"> textbooks2</tag> <tag name="textbooks"> textbooks3</tag> <tag name="textbooks"> textbooks4</tag> <tag name="textbooks"> textbooks5</tag> <tag name="books">books2</tag> <tag name="textbooks"> textbooks1</tag> <tag name="textbooks"> textbooks2</tag> <tag name="books">books3</tag> <tag name="textbooks"> textbooks4</tag> <tag name="textbooks"> textbooks5</tag>

if question nil more "which regular look match <tag name="books">" reply <tag name="books">.

your output illustration looks want insert empty line before each occurrence except first, maybe seek like

sed '1b;/<tag name="books">/i\ ' xml-fragment.txt

if mean, capture each grouping of name="textbooks" tags along preceding name="books" tag , respective contents, seek like

(<tag name="books">[^<>]*(?:</tag>\s*<tag name="textbooks">[^<>]*)*</tag>)

where \s matches whitespace (including newlines) in regex implementations include perl extensions (so, not sed, modern programming languages, including php [which include here witty remark suitability ... things] , python).

note many regex implementations line-oriented default -- applying above multi-line regular look input single line not work. assuming doing like

lines = file.read() re.match(regex, lines) :

you should find want.

like indicated in comments, should utilize xml tools xml input. if input isn't proper xml, maybe can preprocess is, , postprocess remove whatever preprocessor had add together in order create acceptable xml processing pipeline.

regex

No comments:

Post a Comment