how to remove star * from string using regex in pyspark -


i started pyspark, here task:

i have input of:

enter image description here

i need use regex remove punctuation , leading or trailing space , underscore. output lowercase.

what came not complete:

sentence = regexp_replace(trim(lower(column)), '\\*\s\w\s*\\*_', '') 

and result is:

enter image description here

how fix regex here? need use regexp_replace here.

thank much.

you may use

^\w+|\w+$|[^\w\s]+|_ 

the ^ , $ anchors must match line start/end.

if pattern must not overflow across lines, replace \w+$ [^\w\n]+$ , ^\w+ pattern ^[^\w\n]+:

^[^\w\n]+|[^\w\n]+$|[^\w\s]+|_ 

see regex demo.

explanation:

  • ^ - start of line (if multiline option onby default, else, try adding (?m) @ pattern start)
  • [^\w\n]+ - 1 or more non-word chars (non-[a-za-z0-9_]) except newline
  • | - or
  • [^\w\n]+$ - 1 or more non-word chars @ end of line ($)
  • | - or
  • [^\w\s]+ - 1 or more non-word chars except whitespace
  • | - or
  • _ - underscore.

if not care unicode (i used \w, \s can made unicode aware), may use shorter, more simple pattern:

^[^a-za-z\n]+|[^a-za-z\n]+$|[^a-za-z\s]+ 

see this regex demo.


Comments