pandas杭州市出租房分析( 二 )


def googlelocatebyLatLng(lat, lng, pois=0):'''根据经纬度查询地址'''items = {'location': str(lng) + ',' + str(lat), 'key': 'XXXXXX'}headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36','Cookie': 'id58=c5/njVpymhR0X0thDRuHAg==; commontopbar_new_city_info=79%7C%E6%9D%AD%E5%B7%9E%7Chz; 58tj_uuid=763a5398-da95-4db2-9a54-ba7f4171f17c; new_uv=1; wmda_uuid=80797b44db9604b45dfbf4807417e58f; wmda_new_uuid=1; wmda_visited_projects=%3B2385390625025; commontopbar_ipcity=hz%7C%E6%9D%AD%E5%B7%9E%7C0; commontopbar_myfeet_tooltip=end; als=0; xxzl_deviceid=d7wGUAUqik8MomhIMsEH98iyUnHRBDyrCJYsasv1uq9biXZ%2F%2Bxav%2BhZr%2FQQmLjYF; wmda_session_id_2385390625025=1517477544470-6db397e1-9d59-3e58'#'Host': 'cdata.58.com',#'Referer': 'http://webim.58.com/index?p=rb'}res = requests.get('http://restapi.amap.com/v3/geocode/regeo', params=items,headers=headers)result = res.json()#print('--------------------------------------------')#result = result['result']['formatted_address'] + ',' + result['result']['sematic_description']result = result['regeocode']['addressComponent']['district']print(result)return resultinfo['lng']=info.lngandlat.str.split(',',expand=True)[0]info['lat']=info.lngandlat.str.split(',',expand=True)[1]info['district']=info[info['lngandlat'].notnull()].apply(lambda info:googlelocatebyLatLng(info['lat'],info['lng']),axis=1) #axis = 1 , 就会把一行数据作为Series的数据结构传入给自己实现的函数中 , 我们在函数中实现对Series不同属性之间的计算 , 返回一个结果
查看数据基本情况
()
从上图可以发现字段又100多条记录缺失 , 因此删除这些缺失值的记录 。
缺失值处理
info1=info[info.department1.notnull()][['title','type','m_area','m_room_type','m_rent_origin','m_department','m_address','price_int','fee_avg_square','district','lngandlat']]
异常值检测和处理
info1.()
从上面可以看到面积的最大值有800 , 租金最大有 , 明显是异常值,下面用箱型图定位一下面积和租金的字段的异常值范围
##import matplotlib.pyplot as plt #导入图像库import seaborn as snsplt.rcParams['font.sans-serif'] = ['SimHei'] #用来正常显示中文标签plt.rcParams['axes.unicode_minus'] = False #用来正常显示负号% matplotlib inlineimport numpy as npPercentile = np.percentile(info1['m_area'],[0,25,50,75,100])IQR = Percentile[3] - Percentile[1]UpLimit = Percentile[3]+IQR*1.5DownLimit = Percentile[1]-IQR*1.5print('上界:',UpLimit,'下界:',DownLimit)import matplotlib.pyplot as plt #导入图像库import seaborn as snsplt.rcParams['font.sans-serif'] = ['SimHei'] #用来正常显示中文标签plt.rcParams['axes.unicode_minus'] = False #用来正常显示负号f,ax=plt.subplots(figsize=(10,8))sns.boxplot(data=http://www.kingceram.com/post/info1[['m_area']])plt.show()
上界: 123.0 下界: -45.0
同理租金
上界: 4650.0 下界: -550.0

pandas杭州市出租房分析

文章插图
根据箱型图以及实际情况,将面积大于300租金大于20000视为异常将其剔除 。
info2=info1[(info1.m_area<=300)&(info1.price_int<=20000)]
另外看了下房屋户型类型发现有9卫9室 , 0卫等异常的户型 , 也将其剔除 。