$ pip3 install requests
$ pip3 install beautifulsoup4
301:表示如果服务器已切换域名或必须更改端点名称,则服务器将重定向到其他端点。
400:表示用户发出了错误请求。
401:表示用户未通过身份验证。
403:表示用户正在尝试访问禁用的资源。
404:表示用户尝试访问的资源在服务器上不可用。
import requests from bs4 import BeautifulSoup page_result = requests.get('https://www.news.baidu.com') parse_obj = BeautifulSoup(page_result.content, 'html.parser') print(parse_obj)
student@ubuntu:~/work$ python3 parse_web_page.py Output:<!DOCTYPE html> <html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#"> <head> <meta charset="utf-8"/> <meta content="IE=edge" http-equiv="X-UA-Compatible"/> <meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/> <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script> <script> if (typeof uet == 'function') { uet("bb", "LoadTitle", {wb: 1}); } </script> <script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script> <title>Top News - IMDb</title> <script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script> <script> if (typeof uet == 'function') { uet("be", "LoadTitle", {wb: 1}); } </script> <script> if (typeof uex == 'function') { uex("ld", "LoadTitle", {wb: 1}); } </script> <link href="https://www.imdb.com/news/top" rel="canonical"/> <meta content="http://www.imdb.com/news/top" property="og:url"> <script> if (typeof uet == 'function') { uet("bb", "LoadIcons", {wb: 1}); }
import requests from bs4 import BeautifulSoup page_result = requests.get('https://www.news.baidu.com') parse_obj = BeautifulSoup(page_result.content, 'html.parser') top_news = parse_obj.find(class_='news-article__content') print(top_news)
student@ubuntu:~/work$ python3 extract_from_class.py Output :<div class="news-article__content"> <a href="/name/nm4793987/">Issa Rae</a> and <a href="/name/nm0000368/">Laura Dern</a> are teaming up to star in a limited series called "The Dolls" currently in development at <a href="/company/co0700043/">HBO</a>.<br/><br/>Inspired by true events, the series recounts the aftermath of Christmas Eve riots in two small Arkansastowns in 1983, riots which erupted over Cabbage Patch Dolls. The seriesexplores class, race, privilege and what it takes to be a "goodmother."<br/><br/>Rae will serve as a writer and executive producer on the series in addition to starring, with Dern also executive producing. <a href="/name/nm3308450/">Laura Kittrell</a> and <a href="/name/nm4276354/">Amy Aniobi</a> will also serve as writers and coexecutive producers. <a href="/name/nm0501536/">Jayme Lemons</a> of Dern’s <a href="/company/co0641481/">Jaywalker Pictures</a> and <a href="/name/nm3973260/">Deniese Davis</a> of <a href="/company/co0363033/">Issa Rae Productions</a> will also executive produce.<br/><br/>Both Rae and Dern currently star in HBO shows, with Dern appearing in the acclaimed drama "<a href="/title/tt3920596/">Big Little Lies</a>" and Rae starring in and having created the hit comedy "<a href="/title/tt5024912/">Insecure</a>." Dern also recently starred in the film "<a href="/title/tt4015500/">The Tale</a>, </div>
import requests from bs4 import BeautifulSoup page_result = requests.get('https://www.news.baidu.com/news') parse_obj = BeautifulSoup(page_result.content, 'html.parser') top_news = parse_obj.find(class_='news-article__content') top_news_a_content = top_news.find_all('a') print(top_news_a_content)
student@ubuntu:~/work$ python3 extract_from_tag.py Output:[<a href="/name/nm4793987/">Issa Rae</a>, <a href="/name/nm0000368/">Laura Dern</a>, <a href="/company/co0700043/">HBO</a>, <a href="/name/nm3308450/">Laura Kittrell</a>, <a href="/name/nm4276354/">Amy Aniobi</a>, <a href="/name/nm0501536/">Jayme Lemons</a>, <a href="/company/co0641481/">Jaywalker Pictures</a>, <a href="/name/nm3973260/">Deniese Davis</a>, <a href="/company/co0363033/">Issa Rae Productions</a>, <a href="/title/tt3920596/">Big Little Lies</a>, <a href="/title/tt5024912/">Insecure</a>, <a href="/title/tt4015500/">The Tale</a>]
import requests from bs4 import BeautifulSoup page_result = requests.get('https://en.wikipedia.org/wiki/Portal:History') parse_obj = BeautifulSoup(page_result.content, 'html.parser') h_obj = parse_obj.find(class_='hlist noprint') h_obj_a_content = h_obj.find_all('a') print(h_obj) print(h_obj_a_content) 运行脚本程序,如下所示。 student@ubuntu:~/work$ python3 extract_from_wikipedia.py 输出如下。 <div class="hlist noprint" id="portals-browsebar" style="text-align: center;"> <dl><dt><a href="/wiki/Portal:Contents/Portals" title="Portal:Contents/Portals">Portal topics</a></dt> <dd><a href="/wiki/Portal:Contents/Portals#Human_activities" title="Portal:Contents/Portals">Activities</a></dd> <dd><a href="/wiki/Portal:Contents/Portals#Culture_and_the_arts" title="Portal:Contents/Portals">Culture</a></dd> <dd><a href="/wiki/Portal:Contents/Portals#Geography_and_places" title="Portal:Contents/Portals">Geography</a></dd> <dd><a href="/wiki/Portal:Contents/Portals#Health_and_fitness" title="Portal:Contents/Portals">Health</a></dd> <dd><a href="/wiki/Portal:Contents/Portals#History_and_events" title="Portal:Contents/Portals">History</a></dd> <dd><a href="/wiki/Portal:Contents/Portals#Mathematics_and_logic" title="Portal:Contents/Portals">Mathematics</a></dd> <dd><a href="/wiki/Portal:Contents/Portals#Natural_and_physical_sciences" title="Portal:Contents/Portals">Nature</a></dd> <dd><a href="/wiki/Portal:Contents/Portals#People_and_self" title="Portal:Contents/Portals">People</a></dd> In the preceding example, we extracted the content from Wikipedia. In this example also, we extracted the content from class as well as tag. ....