一、介绍
数据清洗主要内容是删除原始数据集中的无关数据、重复数据,平滑噪声数据,刷选掉与挖掘主题无关的数据,处理缺失值、异常值等。
二、缺失值处理
缺失值处理的方法分为三类:删除记录、数据插补和不处理。常见的数据插补方法如下图:
其中,需要介绍的两个插值法为:拉格朗日插值法和牛顿插值法。
2.1 拉格朗日插值法
2.2 牛顿插值法
3、关于拉格朗日插值法的示例
Out[35]:
|
日期 |
销量 |
0 |
2015-03-01 |
51.0 |
1 |
2015-02-28 |
2618.2 |
2 |
2015-02-27 |
2608.4 |
3 |
2015-02-26 |
2651.9 |
4 |
2015-02-25 |
3442.1 |
5 |
2015-02-24 |
3393.1 |
6 |
2015-02-23 |
3136.6 |
7 |
2015-02-22 |
3744.1 |
8 |
2015-02-21 |
6607.4 |
9 |
2015-02-20 |
4060.3 |
10 |
2015-02-19 |
3614.7 |
11 |
2015-02-18 |
3295.5 |
12 |
2015-02-16 |
2332.1 |
13 |
2015-02-15 |
2699.3 |
14 |
2015-02-14 |
NaN |
15 |
2015-02-13 |
3036.8 |
16 |
2015-02-12 |
865.0 |
17 |
2015-02-11 |
3014.3 |
18 |
2015-02-10 |
2742.8 |
19 |
2015-02-09 |
2173.5 |
20 |
2015-02-08 |
3161.8 |
21 |
2015-02-07 |
3023.8 |
22 |
2015-02-06 |
2998.1 |
23 |
2015-02-05 |
2805.9 |
24 |
2015-02-04 |
2383.4 |
25 |
2015-02-03 |
2620.2 |
26 |
2015-02-02 |
2600.0 |
27 |
2015-02-01 |
2358.6 |
28 |
2015-01-31 |
2682.2 |
29 |
2015-01-30 |
2766.8 |
... |
... |
... |
171 |
2014-08-31 |
3494.7 |
172 |
2014-08-30 |
3691.9 |
173 |
2014-08-29 |
2929.5 |
174 |
2014-08-28 |
2760.6 |
175 |
2014-08-27 |
2593.7 |
176 |
2014-08-26 |
2884.4 |
177 |
2014-08-25 |
2591.3 |
178 |
2014-08-24 |
3022.6 |
179 |
2014-08-23 |
3052.1 |
180 |
2014-08-22 |
2789.2 |
181 |
2014-08-21 |
2909.8 |
182 |
2014-08-20 |
2326.8 |
183 |
2014-08-19 |
2453.1 |
184 |
2014-08-18 |
2351.2 |
185 |
2014-08-17 |
3279.1 |
186 |
2014-08-16 |
3381.9 |
187 |
2014-08-15 |
2988.1 |
188 |
2014-08-14 |
2577.7 |
189 |
2014-08-13 |
2332.3 |
190 |
2014-08-12 |
2518.6 |
191 |
2014-08-11 |
2697.5 |
192 |
2014-08-10 |
3244.7 |
193 |
2014-08-09 |
3346.7 |
194 |
2014-08-08 |
2900.6 |
195 |
2014-08-07 |
2759.1 |
196 |
2014-08-06 |
2915.8 |
197 |
2014-08-05 |
2618.1 |
198 |
2014-08-04 |
2993.0 |
199 |
2014-08-03 |
3436.4 |
200 |
2014-08-02 |
2261.7 |
201 rows × 2 columns
D:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
if __name__ == '__main__':
Out[43]:
1 2618.200000
2 2608.400000
3 2651.900000
4 3442.100000
5 3393.100000
6 3136.600000
7 3744.100000
8 4275.254762
9 4060.300000
Name: 销量, dtype: float64
D:\Anaconda3\lib\site-packages\ipykernel\__main__.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy