《scrapy爬虫抓取赶集网二手房信息并用matplotlib展示》
写在文首:头一次写csdn写博客呢,简介一下这个小demo
数据挖掘的课后作业,用到的技术要点:python,scrapy,pymysql,numpy,matplotlib。
步骤就是先用scrapy爬取房屋信息,然后入mysql库,接着从库里到处数据,最后用numpy处理数据并用matplotlib画图。
1、scrapy爬取赶集网:
①、spider里用xpath解析标签的内容;
# 房子名字 house_name = response.xpath("//dd[@class='dd-item title']/a/text()").extract() # 房子户型 house_type = response.xpath("//dd[@class='dd-item size']/span[1]/text()").extract() # 房子面积 house_area = response.xpath("//dd[@class='dd-item size']/span[3]/text()").extract() # 房子总共价格 house_cost = response.xpath("//dd[@class='dd-item info']/div[@class='price']/span[@class='num']/text()").extract() # 房子单价 house_price = response.xpath("//dd[@class='dd-item info']/div[@class='time']/text()").extract() # 房子所在的县级市 house_add = response.xpath("//dd[@class='dd-item address']//a[1]/text()").extract() # 房子所在的街道 house_add_area = response.xpath("//dd[@class='dd-item address']//a[2]/span/text()").extract()
②、把爬到的东西传给pipline,pipline负责把数据写入数据库;
def process_item(self, item, spider): insert_sql = "insert into ershoufang values (0, %s, %s, %s, %s, %s, %s, %s);" print('我要往数据库写东西了!!!!!!!!!!!!!!!!!!!!!!!') self.cur.execute(insert_sql, ( item['house_name'], item['house_type'], item['house_area'], item['house_cost'], item['house_price'], item['house_add'], item['house_add_area'])) self.conn.commit() # return item print('success ! ')
2、展示二手房价折线图
①、从数据库里取数据
def get_house_info(name): # 从数据库获得数据,house_add匹配name的数据 query = "select * from ershoufang where house_add like %s ;" cursor.execute(query, name) house_info = tuple_to_list(cursor.fetchall()) return house_info def get_nt_house_info(name): cursor.execute("select * from ershoufang;") nt_house_info = tuple_to_list(cursor.fetchall()) # sleep(1) print('从数据库成功获取到南通的二手房信息!') # sleep(1) return nt_house_info
②、用numpy清晰数据并构造一元线性回归模型
def numpy_deal(info): """ 用numpy清理数据,构造一元线性回归模型,并展示 """ info = numpy.array(info) info_s = info[:, [3]] # 获取所有房子的面积/平方 info_c = info[:, [4]] # 获取所有房子的价格,单位(万元) info_p = info[:, [5]] # 获取所有房子的每平价格单位(元) info_s = info_s.reshape(1, len(info)).astype(float)[0] # reshape成一维数组,并把字符串改正int型 info_c = info_c.reshape(1, len(info)).astype(float)[0] info_p = info_p.reshape(1, len(info)).astype(float)[0] # 下面构造一元线性回归模型,方程1:面积与房价 和 方程2:面积与单价 mean_s = numpy.mean(info_s) mean_c = numpy.mean(info_c) mean_p = numpy.mean(info_p) # 方程1,面积与房价的关系 w1 = (numpy.sum((info_s - mean_s) * (info_c - mean_c))) / (numpy.sum(numpy.power((info_s - mean_s), 2))) b1 = mean_c - w1 * mean_s # 方程2,面积与单价的关系 w2 = (numpy.sum((info_s - mean_s) * (info_p - mean_p))) / (numpy.sum(numpy.power((info_s - mean_s), 2))) b2 = mean_p - w2 * mean_s return info_s, info_c, info_p, w1, b1, w2, b2
③、展示数据图
plt.plot(s_list, nantong_pre, label='南通', color='black', linewidth=3) plt.plot(s_list, chongchuan_pre, label='崇川', color='red') plt.plot(s_list, gangzha_pre, label='港闸', color='orange') plt.plot(s_list, tongzhou_pre, label='通州', color='yellow') plt.plot(s_list, haian_pre, label='海安', color='green') plt.plot(s_list, rugao_pre, label='如皋', color='blue') plt.plot(s_list, rudong_pre, label='如东', color='indigo') plt.plot(s_list, haimen_pre, label='海门', color='violet') plt.plot(s_list, qidong_pre, label='启东', color='pink') plt.title('南通市县级市房价曲线', fontproperties=my_font) plt.xlabel('面积/m²', fontproperties=my_font) plt.ylabel('总价/万元', fontproperties=my_font) plt.yticks(range(0, 501)[::25]) plt.grid(alpha=0.4) plt.legend(prop=my_font)
写在最后,本人也是个小菜鸟,希望有一天我也能成为个大佬。:)
有不懂的或者有改进的地方大家多多交流。:)
搬运请注明来源。:)