how to remove star * from string using regex in pyspark -

i started pyspark, here task:

i have input of:

i need use regex remove punctuation , leading or trailing space , underscore. output lowercase.

what came not complete:

sentence = regexp_replace(trim(lower(column)), '\\*\s\w\s*\\*_', '')

and result is:

how fix regex here? need use regexp_replace here.

thank much.

you may use

^\w+|\w+$|[^\w\s]+|_

the ^ , $ anchors must match line start/end.

if pattern must not overflow across lines, replace \w+$ [^\w\n]+$ , ^\w+ pattern ^[^\w\n]+:

^[^\w\n]+|[^\w\n]+$|[^\w\s]+|_

explanation:

^ - start of line (if multiline option onby default, else, try adding (?m) @ pattern start)
[^\w\n]+ - 1 or more non-word chars (non-[a-za-z0-9_]) except newline
| - or
[^\w\n]+$ - 1 or more non-word chars @ end of line ($)
| - or
[^\w\s]+ - 1 or more non-word chars except whitespace
| - or
_ - underscore.

if not care unicode (i used \w, \s can made unicode aware), may use shorter, more simple pattern:

^[^a-za-z\n]+|[^a-za-z\n]+$|[^a-za-z\s]+

Trigger