python - beautifulsoup with get_text

i using bs4 (python3) extracting text html file. file looks this:

<body> <p>hello         world!</p> </body> </html>

when calling get_text() method, output "hello world!". because it's html, expected "hello world!" (two or more spaces replaced 1 space in html).

this relevant situation:

<body> <p>hello      world!</p> </body> </html>

i expected find "hello world!" "hello \n world!".

how can achieve goal?

the problem is, neither get_text(strip=true) nor joining .stripped_strings work here because there single navigablestring in p element in second case , it's value hello\n world!. newline inside text node in other words.

in case, have replace newlines manually:

soup.p.get_text().replace("\n", "")

or, handle br elements (replacing them newlines), can make converting function prepare text you:

from bs4 import beautifulsoup, navigablestring  data = """ <body>  <p>hello  world!</p>  <p>hello <br/>  world!</p>  </body> </html> """  def replace_with_newlines(element):     text = ''     elem in element.children:         if isinstance(elem, navigablestring):             text += elem.replace("\n", "").strip()         elif elem.name == 'br':             text += '\n'     return text  soup = beautifulsoup(data, "html.parser")  p in soup.find_all("p"):     print(replace_with_newlines(p))

prints (no newlines in first case, single newline in second):

hello world! hello world!

Trigger

Search This Blog

python - beautifulsoup with get_text - handle spaces -

Comments

Post a Comment