Sunday, 15 September 2013

regex - Splitting string having special characters, words, numbers and URL -



regex - Splitting string having special characters, words, numbers and URL -

i have .txt file contains:

"'the url address checked is: https://www.google.com/ 2times , it's awesome!."

after parsing, expected output should be:

['"',"'",'the','url','address','i','checked','is',':','https://www.google.com/','for','2','times','and',"it's",'awesome','!','.','"']

how split list output using re module.

i came pattern:

pattern = re.compile(r"\d+|[a-za-z]+[a-za-z']*|[^\w\s]")

but splitting url. can 1 please help?

just pick url regex somewhere , create first in alternations. illustration -

# (?!mailto:)(?:(?:https?|ftp)://)?(?:\s+(?::\s*)?@)?(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))|localhost)(?::\d{2,5})?(?:/[^\s]*)?|\d+|[a-za-z]+[a-za-z']*|[^\w\s] (?! mailto: ) (?: (?: https? | ftp ) :// )? (?: \s+ (?: : \s* )? @ )? (?: (?: (?: [1-9] \d? | 1 \d\d | 2 [01] \d | 22 [0-3] ) (?: \. (?: 1? \d{1,2} | 2 [0-4] \d | 25 [0-5] ) ){2} (?: \. (?: [1-9] \d? | 1 \d\d | 2 [0-4] \d | 25 [0-4] ) ) | (?: (?: [a-z\u00a1-\uffff0-9]+ -? )* [a-z\u00a1-\uffff0-9]+ ) (?: \. (?: [a-z\u00a1-\uffff0-9]+ -? )* [a-z\u00a1-\uffff0-9]+ )* (?: \. (?: [a-z\u00a1-\uffff]{2,} ) ) ) | localhost ) (?: : \d{2,5} )? (?: / [^\s]* )? | \d+ | [a-za-z]+ [a-za-z']* | [^\w\s]

outputs:

['"',"'",'the','url','address','i','checked','is',':','https://www.google.com/','for','2','times','and',"it's",'awesome','!','.','"']

regex python-3.x

No comments:

Post a Comment