i'm using beutifulsoup 4 , python 3.5+ extract webdata. have following html, extracting:
<div class="the-one-i-want"> <p> content </p> <p> content </p> <p> content </p> <p> content </p> <ol> <li> list item </li> <li> list item </li> </ol> <div class='something-i-don't-want> content </div> <script class="something-else-i-dont-want'> script </script> <p> content </p> </div>
all of content want extract found within <div class="the-one-i-want">
element. right now, i'm using following methods, work of time:
soup = beautifulsoup(html.text, 'lxml') content = soup.find('div', class_='the-one-i-want').findall('p')
this excludes scripts, weird insert div
's , otherwise un-predictable content such ads or 'recommended content' type stuff.
now, there instances in there elements other <p>
tags, has content contextually important main content, such lists.
is there way content <div class="the-one-i-want">
in manner such:
soup = beautifulsoup(html.text, 'lxml') content = soup.find('div', class_='the-one-i-want').findall(desired-content-elements)
where desired-content-elements
would inclusive of every element deemed fit particular content? such as, <p>
tags, <ol>
, <li>
tags, no <div>
or <script>
tags.
perhaps noteworthy, method of saving content:
content_string = '' p in content: content_string += str(p)
this approach collects data, in order of occurrence, prove difficult manage if found different element types through different iteration processes. i'm looking not have manage re-construction of split lists re-assemble order in each element occurred in content, if possible.
you can pass list of tags want:
content = soup.find('div', class_='the-one-i-want').find_all(["p", "ol", "whatever"])
if run similar on question url looking p , pre tags, can see both:
...: ele in soup.select_one("td.postcell").find_all(["pre","p"]): ...: print(ele) ...: <p>i'm using beutifulsoup 4 , python 3.5+ extract webdata. have following html, extracting:</p> <pre><code><div class="the-one-i-want"> <p> content </p> <p> content </p> <p> content </p> <p> content </p> <ol> <li> list item </li> <li> list item </li> </ol> <div class='something-i-don't-want> content </div> <script class="something-else-i-dont-want'> script </script> <p> content </p> </div> </code></pre> <p>all of content want extract found within <code><div class="the-one-i-want"></code> element. right now, i'm using following methods, work of time:</p> <pre><code>soup = beautifulsoup(html.text, 'lxml') content = soup.find('div', class_='the-one-i-want').findall('p') </code></pre> <p>this excludes scripts, weird insert <code>div</code>'s , otherwise un-predictable content such ads or 'recommended content' type stuff.</p> <p>now, there instances in there elements other <code><p></code> tags, has content contextually important main content, such lists.</p> <p>is there way content <code><div class="the-one-i-want"></code> in manner such:</p> <pre><code>soup = beautifulsoup(html.text, 'lxml') content = soup.find('div', class_='the-one-i-want').findall(desired-content-elements) </code></pre> <p>where <code>desired-content-elements</code>would inclusive of every element deemed fit particular content? such as, <code><p></code> tags, <code><ol></code> , <code><li></code> tags, no <code><div></code> or <code><script></code> tags.</p> <p>perhaps noteworthy, method of saving content:</p> <pre><code>content_string = '' p in content: content_string += str(p) </code></pre> <p>this approach collects data, in order of occurrence, prove difficult manage if found different element types through different iteration processes. i'm looking not have manage re-construction of split lists re-assemble order in each element occurred in content, if possible.</p>
Comments
Post a Comment