regex - What regular expression will match this (XML-ish) input pattern -
i had requirement need parse xml fragment looks this:
<tag name="books">books1</tag> <tag name="textbooks"> textbooks1</tag> <tag name="textbooks"> textbooks2</tag> <tag name="textbooks"> textbooks3</tag> <tag name="textbooks"> textbooks4</tag> <tag name="textbooks"> textbooks5</tag> <tag name="books">books2</tag> <tag name="textbooks"> textbooks1</tag> <tag name="textbooks"> textbooks2</tag> <tag name="books">books3</tag> <tag name="textbooks"> textbooks4</tag> <tag name="textbooks"> textbooks5</tag>
i need tags name="textbooks"
including <tag name="books"></tag>
lastly textbooks before <tag name="books"></tag>
.
so results follows
<tag name="books">books1</tag> <tag name="textbooks"> textbooks1</tag> <tag name="textbooks"> textbooks2</tag> <tag name="textbooks"> textbooks3</tag> <tag name="textbooks"> textbooks4</tag> <tag name="textbooks"> textbooks5</tag> <tag name="books">books2</tag> <tag name="textbooks"> textbooks1</tag> <tag name="textbooks"> textbooks2</tag> <tag name="books">books3</tag> <tag name="textbooks"> textbooks4</tag> <tag name="textbooks"> textbooks5</tag>
if question nil more "which regular look match <tag name="books">
" reply <tag name="books">
.
your output illustration looks want insert empty line before each occurrence except first, maybe seek like
sed '1b;/<tag name="books">/i\ ' xml-fragment.txt
if mean, capture each grouping of name="textbooks"
tags along preceding name="books"
tag , respective contents, seek like
(<tag name="books">[^<>]*(?:</tag>\s*<tag name="textbooks">[^<>]*)*</tag>)
where \s
matches whitespace (including newlines) in regex implementations include perl extensions (so, not sed
, modern programming languages, including php [which include here witty remark suitability ... things] , python).
note many regex implementations line-oriented default -- applying above multi-line regular look input single line not work. assuming doing like
lines = file.read() re.match(regex, lines) :
you should find want.
like indicated in comments, should utilize xml tools xml input. if input isn't proper xml, maybe can preprocess is, , postprocess remove whatever preprocessor had add together in order create acceptable xml processing pipeline.
regex
No comments:
Post a Comment