i started pyspark, here task:
i have input of:
i need use regex remove punctuation , leading or trailing space , underscore. output lowercase.
what came not complete:
sentence = regexp_replace(trim(lower(column)), '\\*\s\w\s*\\*_', '')
and result is:
how fix regex here? need use regexp_replace here.
thank much.
you may use
^\w+|\w+$|[^\w\s]+|_
the ^
, $
anchors must match line start/end.
if pattern must not overflow across lines, replace \w+$
[^\w\n]+$
, ^\w+
pattern ^[^\w\n]+
:
^[^\w\n]+|[^\w\n]+$|[^\w\s]+|_
see regex demo.
explanation:
^
- start of line (if multiline option onby default, else, try adding(?m)
@ pattern start)[^\w\n]+
- 1 or more non-word chars (non-[a-za-z0-9_]
) except newline|
- or[^\w\n]+$
- 1 or more non-word chars @ end of line ($
)|
- or[^\w\s]+
- 1 or more non-word chars except whitespace|
- or_
- underscore.
if not care unicode (i used \w
, \s
can made unicode aware), may use shorter, more simple pattern:
^[^a-za-z\n]+|[^a-za-z\n]+$|[^a-za-z\s]+
see this regex demo.
Comments
Post a Comment