python - With Beautifulsoup, Extract Tags of Element Except Those specified -

i'm using beutifulsoup 4 , python 3.5+ extract webdata. have following html, extracting:

<div class="the-one-i-want">     <p>         content     </p>     <p>         content     </p>     <p>         content     </p>     <p>         content     </p>     <ol>         <li>             list item         </li>         <li>             list item         </li>     </ol>     <div class='something-i-don't-want>         content     </div>     <script class="something-else-i-dont-want'>         script     </script>     <p>         content     </p> </div>

all of content want extract found within <div class="the-one-i-want"> element. right now, i'm using following methods, work of time:

soup = beautifulsoup(html.text, 'lxml') content = soup.find('div', class_='the-one-i-want').findall('p')

this excludes scripts, weird insert div's , otherwise un-predictable content such ads or 'recommended content' type stuff.

now, there instances in there elements other <p> tags, has content contextually important main content, such lists.

is there way content <div class="the-one-i-want"> in manner such:

soup = beautifulsoup(html.text, 'lxml') content = soup.find('div', class_='the-one-i-want').findall(desired-content-elements)

where desired-content-elementswould inclusive of every element deemed fit particular content? such as, <p> tags, <ol> , <li> tags, no <div> or <script> tags.

perhaps noteworthy, method of saving content:

content_string = '' p in content:     content_string += str(p)

this approach collects data, in order of occurrence, prove difficult manage if found different element types through different iteration processes. i'm looking not have manage re-construction of split lists re-assemble order in each element occurred in content, if possible.

you can pass list of tags want:

 content = soup.find('div', class_='the-one-i-want').find_all(["p", "ol", "whatever"])

if run similar on question url looking p , pre tags, can see both:

   ...: ele in soup.select_one("td.postcell").find_all(["pre","p"]):    ...:     print(ele)    ...:   <p>i'm using beutifulsoup 4 , python 3.5+ extract webdata. have following html, extracting:</p> <pre><code>&lt;div class="the-one-i-want"&gt;     &lt;p&gt;         content     &lt;/p&gt;     &lt;p&gt;         content     &lt;/p&gt;     &lt;p&gt;         content     &lt;/p&gt;     &lt;p&gt;         content     &lt;/p&gt;     &lt;ol&gt;         &lt;li&gt;             list item         &lt;/li&gt;         &lt;li&gt;             list item         &lt;/li&gt;     &lt;/ol&gt;     &lt;div class='something-i-don't-want&gt;         content     &lt;/div&gt;     &lt;script class="something-else-i-dont-want'&gt;         script     &lt;/script&gt;     &lt;p&gt;         content     &lt;/p&gt; &lt;/div&gt; </code></pre> <p>all of content want extract found within <code>&lt;div class="the-one-i-want"&gt;</code> element. right now, i'm using following methods, work of time:</p> <pre><code>soup = beautifulsoup(html.text, 'lxml') content = soup.find('div', class_='the-one-i-want').findall('p') </code></pre> <p>this excludes scripts, weird insert <code>div</code>'s , otherwise un-predictable content such ads or 'recommended content' type stuff.</p> <p>now, there instances in there elements other <code>&lt;p&gt;</code> tags, has content contextually important main content, such lists.</p> <p>is there way content <code>&lt;div class="the-one-i-want"&gt;</code> in manner such:</p> <pre><code>soup = beautifulsoup(html.text, 'lxml') content = soup.find('div', class_='the-one-i-want').findall(desired-content-elements) </code></pre> <p>where <code>desired-content-elements</code>would inclusive of every element deemed fit particular content? such as, <code>&lt;p&gt;</code> tags, <code>&lt;ol&gt;</code> , <code>&lt;li&gt;</code> tags, no <code>&lt;div&gt;</code> or <code>&lt;script&gt;</code> tags.</p> <p>perhaps noteworthy, method of saving content:</p> <pre><code>content_string = '' p in content:     content_string += str(p) </code></pre> <p>this approach collects data, in order of occurrence, prove difficult manage if found different element types through different iteration processes. i'm looking not have manage re-construction of split lists re-assemble order in each element occurred in content, if possible.</p>

Trigger

Search This Blog

python - With Beautifulsoup, Extract Tags of Element Except Those specified -

Comments

Post a Comment