这是一篇让人脸红的python数据分析,嘿嘿嘿嘿嘿( 二 )


①左上象限:实惠好评的商家
②右上象限:有点贵,但一分钱一分货的商家
③右下象限:贵,但质量不咋地的商家
④左下象限:便宜没好货的商家
所以借助这张散点图,挑商家买东西就容易多啦:
顾客可以根据自己的喜好挑选合适的商家,那么作为商家如何改进自己呢?
⑦词频分析
前面在爬取的过程中,同样爬取了评论标签,通过对此进行词频分析,可以发现顾客最关心的依次是:
1.是否合身:size、fit等相关字眼多次出现且排位靠前
2.质量:good 、well made;soft and 、是对材质的肯定
3.款式:cute、sexy、like the 你懂的
4.价格: made勉强算价格吧,但更多是对商品质量的怀疑
5.口碑:,评论的还是非常有参考价值的
评论标签的数量较少,进一步对2.4w条评论进行词频分析,并制作成词云:
最直观的,仍然是跟“是否合身”以及质量或款式有关 。那么我们就从顾客购买商品的Size&Color继续分析
Size&Color的词频数据存在几点问题:
1、数据量较少,仅有约6000条
2、Size&Color无法较好的区分开,因此一起分析
3、商家的命名规则不同,比如同样是黑色款,有个商家会命名black,而有的可能是(所以一些奇怪的数字编号其实是商家的款式编号)
4、有些奇怪的字眼如trim可能是爬虫时爬错了或者导出csv时的格式错乱
可以明显看出:
Size方面:large、、small肯定均有涵盖,但另外还有、、,亚马逊主要是欧美顾客,可能体型相对较大,所以商家应该多研发以及备货针对体型较大的顾客的商品 。
Color方面:非常直观:Black > red > blue > green > white > …所以黑色、红色永远不会错;绿色是出乎我意料的,商家也可以大胆尝试 。
Style方面:词频中出现trim、lace字眼,蕾丝最高!!!
完整代码
商品评论
# 0、导入模块from bs4 import BeautifulSoupimport requestsimport randomimport timefrom multiprocessing import Poolimport csvimport pymongo'''python学习交流群:1136201545更多学习资料可以加群获取'''# 0、创建数据库client = pymongo.MongoClient('localhost', 27017)Amazon = client['Amazon']reviews_info_M = Amazon['reviews_info_M']# 0、反爬措施headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'}# http://cn-proxy.com/proxy_list = ['http://117.177.250.151:8081','http://111.85.219.250:3129','http://122.70.183.138:8118',]proxy_ip = random.choice(proxy_list) # 随机获取代理ipproxies = {'http': proxy_ip}# 1、读取csv中的'Rank','item_name','reviews','reviews_link'csv_file = csv.reader(open('C:/Users/zbd/Desktop/3.csv','r'))reviews_datalst = []for i in csv_file:reviews_data = http://www.kingceram.com/post/{'Rank':i[10],'item_name':i[8],'reviews':i[6],'reviews_link':i[5]}reviews_datalst.append(reviews_data)del reviews_datalst[0]# 删除表头#print(reviews_datalst)reviews_links = list(i['reviews_link'] for i in reviews_datalst)# 将评论详情页链接存储到列表reviews_links# 清洗reviews,其中有空值或者“1,234”样式reviews = []for i in reviews_datalst:if i['reviews']:reviews.append(int(i['reviews'].replace(',','')))else:reviews.append(0)print(reviews_links)print(reviews)# 2、抓取每个商品的评论页链接# 商品 1# 第1页:https://www.amazon.com/Avidlove-Lingerie-Babydoll-Sleepwear-Chemise/product-reviews/B0712188H2/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews# 第2页:https://www.amazon.com/Avidlove-Lingerie-Babydoll-Sleepwear-Chemise/product-reviews/B0712188H2/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2# 第3页:https://www.amazon.com/Avidlove-Lingerie-Babydoll-Sleepwear-Chemise/product-reviews/B0712188H2/ref=cm_cr_getr_d_paging_btm_next_3?ie=UTF8&reviewerType=all_reviews&pageNumber=3# 商品 2# 第1页:https://www.amazon.com/Avidlove-Women-Lingerie-Babydoll-Bodysuit/product-reviews/B077CLFWVN/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews'# 第2页:https://www.amazon.com/Avidlove-Women-Lingerie-Babydoll-Bodysuit/product-reviews/B077CLFWVN/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2# 每页有8个评论,pages = reviews // 8 + 1# 目标格式:https://www.amazon.com/Avidlove-Lingerie-Babydoll-Sleepwear-Chemise/product-reviews/B0712188H2/pageNumber=1url = 'https://www.amazon.com/Avidlove-Lingerie-Babydoll-Sleepwear-Chemise/product-reviews/B0712188H2/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews'counts = 0def get_item_reviews(reviews_link,reviews):if reviews_link:pages = reviews // 8# 每页有8个评论,pages = reviews // 8 ,最后一页不爬取for i in range(1,pages+1):full_url = reviews_link.split('ref=')[0] + '?pageNumber={}'.format(i)#full_url = 'https://www.amazon.com/Avidlove-Lingerie-Babydoll-Sleepwear-Chemise/product-reviews/B0712188H2/?pageNumber=10'wb_data = requests.get(full_url, headers=headers, proxies=proxies)soup = BeautifulSoup(wb_data.text, 'lxml')every_page_reviews_num = len(soup.select('div.a-row.a-spacing-small.review-data> span'))for j in range(every_page_reviews_num):reviews_info ={'customer_name' : soup.select('div:nth-child(1) > a > div.a-profile-content > span')[j].text,'star': soup.select('div.a-row>a.a-link-normal > i > span')[j].text.split('out')[0],'review_date': soup.select('div.a-section.review >div>div>span.a-size-base.a-color-secondary.review-date')[j].text,'review_title': soup.select('a.a-size-base.a-link-normal.review-title.a-color-base.a-text-bold')[j].text,'review_text': soup.select('div.a-row.a-spacing-small.review-data > span')[j].text,'item_name': soup.title.text.split(':')[-1]}yield reviews_inforeviews_info_M.insert_one(reviews_info)globalcountscounts +=1print('第{}条评论'.format(counts),reviews_info)else:pass'''# 这边主要是爬取size和color,因为其数据大量缺失,所以另外爬取# 与上一步的代码基本一样,主要在于要确认每页评论的size&color个数# 写入数据库和csv也需要作相应修改,但方法相同def get_item_reviews(reviews_link,reviews):if reviews_link:pages = reviews // 8# 每页有8个评论,pages = reviews // 8 ,最后一页不爬取,要做一个小于8个评论的判断for i in range(1,pages+1):full_url = reviews_link.split('ref=')[0] + '?pageNumber={}'.format(i)#full_url = 'https://www.amazon.com/Avidlove-Lingerie-Babydoll-Sleepwear-Chemise/product-reviews/B0712188H2/?pageNumber=10'wb_data = http://www.kingceram.com/post/requests.get(full_url, headers=headers, proxies=proxies)soup = BeautifulSoup(wb_data.text, 'lxml')every_page_reviews_num = len(soup.select('div.a-row.a-spacing-mini.review-data.review-format-strip> a'))# 每页的size&color个数for j in range(every_page_reviews_num):reviews_info ={'item_name': soup.title.text.split(':')[-1],'size_color': soup.select('div.a-row.a-spacing-mini.review-data.review-format-strip > a')[j].text,}yield reviews_infoprint(reviews_info)reviews_size_color.insert_one(reviews_info)else:pass'''# 3、开始爬取和存储数据all_reviews = []def get_all_reviews(reviews_links,reviews):for i in range(100):for n in get_item_reviews(reviews_links[i],reviews[i]):all_reviews.append(n)get_all_reviews(reviews_links,reviews)#print(all_reviews)# 4、写入csvheaders = ['_id','item_name', 'customer_name', 'star', 'review_date', 'review_title', 'review_text']with open('C:/Users/zbd/Desktop/4.csv','w',newline='',encoding='utf-8') as f:f_csv = csv.DictWriter(f, headers)f_csv.writeheader()f_csv.writerows(all_reviews)print('写入完毕!')