i using bs4 (python3) extracting text html file. file looks this:
<body> <p>hello world!</p> </body> </html> when calling get_text() method, output "hello world!". because it's html, expected "hello world!" (two or more spaces replaced 1 space in html).
this relevant situation:
<body> <p>hello world!</p> </body> </html> i expected find "hello world!" "hello \n world!".
how can achieve goal?
the problem is, neither get_text(strip=true) nor joining .stripped_strings work here because there single navigablestring in p element in second case , it's value hello\n world!. newline inside text node in other words.
in case, have replace newlines manually:
soup.p.get_text().replace("\n", "") or, handle br elements (replacing them newlines), can make converting function prepare text you:
from bs4 import beautifulsoup, navigablestring data = """ <body> <p>hello world!</p> <p>hello <br/> world!</p> </body> </html> """ def replace_with_newlines(element): text = '' elem in element.children: if isinstance(elem, navigablestring): text += elem.replace("\n", "").strip() elif elem.name == 'br': text += '\n' return text soup = beautifulsoup(data, "html.parser") p in soup.find_all("p"): print(replace_with_newlines(p)) prints (no newlines in first case, single newline in second):
hello world! hello world!
Comments
Post a Comment