the proejct1(6.17) folder --- is documented the project I did on data science form 6.17 to 8.19, it contains the News atricle data I scraped from ChinaDaily website, and doing NLP and model building on the exsiting data.
scrapy: it contains scrapy(the tool I mainly used for scrapying the data -- with 37 spiders that scraped the office,shop,residential data from multiple resources,using rotating IP and fake user agent dealing with the anti-crawling detection and blocking. all the scraped data can be further connected to Database.
BUY: all the real estate that are on sell, including 5 cities (Beijing, Shanghai, chengdu, shenzhen, guangzhou) and 3 types of bulding. also connected with gaode API to attain the location coordinates.
RENT: all the real estate that are for rent, including 5 cities (Beijing, Shanghai, chengdu, shenzhen, guangzhou) and 3 types of bulding. also connected with gaode API to attain the location coordinates.
For all the files(including data) with reference on model building check the link: https://pan.baidu.com/s/1rraiuGeCXP5Oe5Xo5QK9VA
using conventional nural networks models based on the reference to predict the house price in shanghai area.
the result shows significant improment on the prediction result comparing with other tradition machine learning models.
1. tocsv.py Select useful attributes, extract data into data_c.csv file, and filter out unreasonable points and outliers. Some of the attributes attrs = ['decoration_condition', 'elevator', 'ownership', 'framework'] have unknown values, initially used -1 instead, as a new type, the model prediction effect is not ideal; after using the random forest RandomForestClassifier The unknown value is predicted, and the processed data is written to the data_r.csv file.
(27 input attributes + 2 output attributes Average room rate and total room rate, predicted total room rate when forecasting)
(61,609 data)