i started pyspark, here task:
i have input of:
i need use regex remove punctuation , leading or trailing space , underscore. output lowercase.
what came not complete:
sentence = regexp_replace(trim(lower(column)), '\\*\s\w\s*\\*_', '') and result is:
how fix regex here? need use regexp_replace here.
thank much.
you may use
^\w+|\w+$|[^\w\s]+|_ the ^ , $ anchors must match line start/end.
if pattern must not overflow across lines, replace \w+$ [^\w\n]+$ , ^\w+ pattern ^[^\w\n]+:
^[^\w\n]+|[^\w\n]+$|[^\w\s]+|_ see regex demo.
explanation:
^- start of line (if multiline option onby default, else, try adding(?m)@ pattern start)[^\w\n]+- 1 or more non-word chars (non-[a-za-z0-9_]) except newline|- or[^\w\n]+$- 1 or more non-word chars @ end of line ($)|- or[^\w\s]+- 1 or more non-word chars except whitespace|- or_- underscore.
if not care unicode (i used \w, \s can made unicode aware), may use shorter, more simple pattern:
^[^a-za-z\n]+|[^a-za-z\n]+$|[^a-za-z\s]+ see this regex demo.


Comments
Post a Comment