i using bs4 (python3) extracting text html file. file looks this:
<body> <p>hello world!</p> </body> </html>
when calling get_text()
method, output "hello world!". because it's html, expected "hello world!" (two or more spaces replaced 1 space in html).
this relevant situation:
<body> <p>hello world!</p> </body> </html>
i expected find "hello world!" "hello \n world!".
how can achieve goal?
the problem is, neither get_text(strip=true)
nor joining .stripped_strings
work here because there single navigablestring
in p
element in second case , it's value hello\n world!
. newline inside text node in other words.
in case, have replace newlines manually:
soup.p.get_text().replace("\n", "")
or, handle br
elements (replacing them newlines), can make converting function prepare text you:
from bs4 import beautifulsoup, navigablestring data = """ <body> <p>hello world!</p> <p>hello <br/> world!</p> </body> </html> """ def replace_with_newlines(element): text = '' elem in element.children: if isinstance(elem, navigablestring): text += elem.replace("\n", "").strip() elif elem.name == 'br': text += '\n' return text soup = beautifulsoup(data, "html.parser") p in soup.find_all("p"): print(replace_with_newlines(p))
prints (no newlines in first case, single newline in second):
hello world! hello world!
Comments
Post a Comment