Skip to content

Commit f59f4b3

Browse files
authored
Merge pull request #82 from zillow/feature/window_density_model_improvements
Feature/window density model improvements
2 parents 3bcc374 + ee24284 commit f59f4b3

File tree

5 files changed

+216
-125
lines changed

5 files changed

+216
-125
lines changed

docs/tutorial/streaming.rst

Lines changed: 191 additions & 105 deletions
Original file line numberDiff line numberDiff line change
@@ -6,124 +6,210 @@ Luminaire *WindowDensityModel* implements the idea of monitoring data over compa
66
.. image:: windows.png
77
:scale: 40%
88

9-
Although *WindowDensityModel* is designed to track anomalies over streaming data, it can be used to track any sustained fluctuations over a window for any frequency. This detection type is suggested for up to hourly data frequency.
9+
Although *WindowDensityModel* is designed to track anomalies over streaming data, it can be used to track anomalies even for low frequency time series. This detection type is suggested for up to hourly data frequency.
1010

11-
Anomaly Detection: Pre-Configured Settings
12-
------------------------------------------
11+
This window based anomaly detection feature in Luminaire operates fully automatically where the underlying model detects the frequency that the data has been observed, the optimal size of the window (using the periodic signals in the data) and the optimal detection method given some identified characteristics from the input time series. Moreover, user also has the ability to overwright the configuration for custom use cases.
1312

14-
Luminaire provides the capability to configure model parameters based on the frequency that the data has been observed and the methods that can be applied (please refer to the Window density Model user guide for detailed configuration options). Luminaire settings for the window density model are already pre-configured for some typical pandas frequency types and settings for any other frequency types should be configured manually (see the API reference for `Streaming Anomaly Detection Models <https://zillow.github.io/luminaire/api_reference/streaming.html>`_).
13+
Fully Automated Anomaly Detection using Time-windows
14+
----------------------------------------------------
15+
16+
Luminaire provides a fully automated anomaly detection method that tracks time series abnormalities over time-windows. Luminaire is capable of selecting the best possible setting by studying different characteristics of the input time series. Although, compared to the Luminaire outlier detection module, the window based anomaly detection does not require running any separate configuration optimization to obtain the best hyperparameters. Rather, the automation process is embedded withing the data exploration and the training process.
17+
18+
Similar to the outlier detection module, Luminaire Window Density Model comes with a streaming data profiling module to extract different characteristics about the high-frequency time series.
1519

1620
>>> from luminaire.model.window_density import WindowDensityHyperParams, WindowDensityModel
21+
>>> from luminaire.exploration.data_exploration import DataExploration
1722
>>> print(data)
18-
raw interpolated
19-
index
20-
2020-05-25 00:00:00 10585.0 10585.0
21-
2020-05-25 00:01:00 10996.0 10996.0
22-
2020-05-25 00:02:00 10466.0 10466.0
23-
2020-05-25 00:03:00 10064.0 10064.0
24-
2020-05-25 00:04:00 10221.0 10221.0
25-
... ... ...
26-
2020-06-16 23:55:00 11356.0 11356.0
27-
2020-06-16 23:56:00 10852.0 10852.0
28-
2020-06-16 23:57:00 11114.0 11114.0
29-
2020-06-16 23:58:00 10663.0 10663.0
30-
2020-06-16 23:59:00 11034.0 11034.0
31-
32-
>>> hyper_params = WindowDensityHyperParams(freq='T').params
33-
>>> wdm_obj = WindowDensityModel(hyper_params=hyper_params)
34-
>>> success, model = wdm_obj.train(data=data)
35-
>>> print(success, model)
36-
(True, <luminaire_models.model.window_density.WindowDensityModel object at 0x7f8cda42dcc0>)
37-
38-
The model object contains the data density structure over a pre-specified window, given the frequency. Luminaire sets the following defaults for some typical pandas frequencies (any custom requirements can be updated in the hyperparameter object instance):
39-
40-
- 'S': Hourly windows
41-
- 'T': 24 hours windows
42-
- '15T': 24 hours windows
43-
- 'H': 24 hours windows
44-
- 'D': 4 weeks windows
45-
- 'custom': User specified windows
46-
47-
In order to score a new window innovation given the trained model object, we have to provide a equal sized window that represents a similar time interval. For example, if each of the windows in the training data represents a 24 hour window between 9 AM to 8:59:59 AM (next day) for last few days, the scoring data should represent the same interval of a different day and should have the same window size.
23+
raw
24+
index
25+
2020-06-04 00:00:00 227798
26+
2020-06-04 00:10:00 224593
27+
2020-06-04 00:20:00 229400
28+
2020-06-04 00:30:00 217813
29+
2020-06-04 00:40:00 217862
30+
... ...
31+
2020-07-02 23:20:00 221226
32+
2020-07-02 23:30:00 218762
33+
2020-07-02 23:40:00 225726
34+
2020-07-02 23:50:00 220783
35+
2020-07-03 00:00:00 260981
36+
37+
>>> config = WindowDensityHyperParams().params
38+
>>> de_obj = DataExploration(**config)
39+
>>> data, pre_prc = de_obj.stream_profile(df=data)
40+
print(data, pre_prc)
41+
raw interpolated
42+
2020-06-04 00:10:00 224593 224593.0
43+
2020-06-04 00:20:00 229400 229400.0
44+
2020-06-04 00:30:00 217813 217813.0
45+
2020-06-04 00:40:00 217862 217862.0
46+
2020-06-04 00:50:00 226861 226861.0
47+
... ... ...
48+
2020-07-02 23:20:00 221226 221226.0
49+
2020-07-02 23:30:00 218762 218762.0
50+
2020-07-02 23:40:00 225726 225726.0
51+
2020-07-02 23:50:00 220783 220783.0
52+
2020-07-03 00:00:00 260981 260981.0
53+
[4176 rows x 2 columns]
54+
{'success': True, 'freq': '0 days 00:10:00', 'window_length': 144, 'min_window_length': 10, 'max_window_length': 100000}
55+
56+
Luminaire *stream_profile* performs missing data imputation if necessary, extracts the frequency information and obtains the optimal size of the window to be monitored (if not specified by the user). All the information obtained by the profiler can be used to update the configuration for the actual training process.
57+
58+
>>> config.update(pre_prc)
59+
>>> wdm_obj = WindowDensityModel(hyper_params=config)
60+
>>> success, training_end, model = wdm_obj.train(data=data)
61+
>>> print(success, training_end, model)
62+
True 2020-07-03 00:00:00 <luminaire.model.window_density.WindowDensityModel object at 0x7fb6fab80b00>
63+
64+
The training process generates the success flag, the model timestamp and the actual trained model. The trained model here is a collection of several sub-models that can be used to score any equal length time segment of the day and does not depend on the specific patterns based on the selected time window.
65+
In order to score a new window innovation given the trained model object, we have to provide a equal sized time window. Moreover, Luminaire allows the user to perform basic processing (imputing missing index etc.) of the scoring window in order to get the data ready for scoring.
4866

4967
.. image:: window_train_score_auto.png
50-
:scale: 45%
51-
52-
>>> scoring_data
53-
raw interpolated
54-
index
55-
2020-06-17 00:00:00 11021.0 11021.0
56-
2020-06-17 00:01:00 10931.0 10931.0
57-
2020-06-17 00:02:00 10637.0 10637.0
58-
2020-06-17 00:03:00 10845.0 10845.0
59-
2020-06-17 00:04:00 10163.0 10163.0
60-
... ... ...
61-
2020-06-17 23:55:00 9680.0 9680.0
62-
2020-06-17 23:56:00 9985.0 9985.0
63-
2020-06-17 23:57:00 9363.0 9363.0
64-
2020-06-17 23:58:00 9686.0 9686.0
65-
2020-06-17 23:59:00 9220.0 9220.0
66-
67-
>>> scores = model.score(scoring_data)
68-
>>> print(scores)
69-
{'Success': True, 'ConfLevel': 99.9, 'IsAnomaly': False, 'AnomalyProbability': 0.6956745734841678}
70-
71-
Anomaly Detection: Manual Configuration
72-
---------------------------------------
73-
74-
There are several options in the *WindowDensityHyperParams* class that can be manually configured. The configuration should be selected mostly based on the frequency that the data has been observed.
68+
:scale: 100%
69+
70+
>>> print(scoring_data)
71+
raw
72+
index
73+
2020-07-03 00:00:00 260981
74+
2020-07-03 00:10:00 274249
75+
2020-07-03 00:20:00 293194
76+
2020-07-03 00:30:00 272722
77+
2020-07-03 00:40:00 276930
78+
... ...
79+
2020-07-03 23:10:00 287773
80+
2020-07-03 23:20:00 255438
81+
2020-07-03 23:30:00 277127
82+
2020-07-03 23:40:00 266263
83+
2020-07-03 23:50:00 275432
84+
>>> freq = model._params['freq']
85+
>>> de_obj = DataExploration(freq=freq)
86+
>>> processed_data, pre_prc = de_obj.stream_profile(df=scoring_data, impute_only=True, impute_zero=True)
87+
88+
The processed data can be used to score as:
89+
90+
>>> score, scored_window = model.score(processed_data)
91+
>>> print(score)
92+
{'Success': True, 'ConfLevel': 99.9, 'IsAnomaly': True, 'AnomalyProbability': 1.0}
93+
94+
User can also score rolling (or overlapping windows) windows instead of sequential windows for more frequent anomaly detection use cases.
95+
96+
>>> print(scoring_data)
97+
raw
98+
index
99+
2020-07-02 12:10:00 203836
100+
2020-07-02 12:20:00 209813
101+
2020-07-02 12:30:00 206271
102+
2020-07-02 12:40:00 209135
103+
2020-07-02 12:50:00 207085
104+
... ...
105+
2020-07-03 11:20:00 255009
106+
2020-07-03 11:30:00 260246
107+
2020-07-03 11:40:00 248541
108+
2020-07-03 11:50:00 246094
109+
2020-07-03 12:00:00 252223
110+
>>> freq = model._params['freq']
111+
>>> de_obj = DataExploration(freq=freq)
112+
>>> processed_data, pre_prc = de_obj.stream_profile(df=scoring_data, impute_only=True, impute_zero=True)
113+
>>> score, scored_window = model.score(processed_data)
114+
>>> print(score)
115+
'Success': True, 'ConfLevel': 99.9, 'IsAnomaly': True, 'AnomalyProbability': 0.9999867236}
116+
117+
Reusing Past Trained Model
118+
^^^^^^^^^^^^^^^^^^^^^^^^^^
119+
120+
Luminaire Window Density model also comes with the capability of ingesting previously trained model in the future model trainings. This can be part of a sequential process that always passes the last trained model in the next training. This ensures richer data accumulation to have more reliable scores, specially when the training history is limited to a fixed length rolling window. This way, the model is able to keep larger history as a metadata even though the actual training history is limited.
121+
122+
>>> past_model = <luminaire.model.window_density.WindowDensityModel object at 0x7fb6fab80b00>
123+
>>> print(new_training_data)
124+
raw
125+
index
126+
2020-06-04 00:00:00 227798
127+
2020-06-04 00:10:00 224593
128+
2020-06-04 00:20:00 229400
129+
2020-06-04 00:30:00 217813
130+
2020-06-04 00:40:00 217862
131+
... ...
132+
2020-07-03 23:10:00 287773
133+
2020-07-03 23:20:00 255438
134+
2020-07-03 23:30:00 277127
135+
2020-07-03 23:40:00 266263
136+
2020-07-03 23:50:00 275432
137+
>>> success, training_end, model = wdm_obj.train(data=new_training_data, past_model=past_model)
138+
139+
Anomaly Detection using Time-windows: Manual Configuration
140+
----------------------------------------------------------
141+
142+
There are several options in the *WindowDensityHyperParams* class that can be manually configured. User can select different option starting from the desired window size, whether all previous windows should be used to identify anomalies or the last window only, the detection method and how to manage nonstationarity and periodicity present in the data and so on. Please refer to the API reference for `Streaming Anomaly Detection Models <https://zillow.github.io/luminaire/api_reference/streaming.html>`_.
75143

76144
>>> from luminaire.model.window_density import WindowDensityHyperParams, WindowDensityModel
77145
>>> print(data)
78-
raw interpolated
79-
index
80-
2020-05-20 00:03:00 6393.451190 6393.451190
81-
2020-05-20 00:13:00 6491.426190 6491.426190
82-
2020-05-20 00:23:00 6770.469444 6770.469444
83-
2020-05-20 00:33:00 6490.798810 6490.798810
84-
2020-05-20 00:43:00 6273.786508 6273.786508
85-
... ... ...
86-
2020-06-09 23:13:00 5619.341270 5619.341270
87-
2020-06-09 23:23:00 5573.001190 5573.001190
88-
2020-06-09 23:33:00 5745.400000 5745.400000
89-
2020-06-09 23:43:00 5761.355556 5761.355556
90-
2020-06-09 23:53:00 5558.577778 5558.577778
91-
>>>hyper_params = WindowDensityHyperParams(freq='custom',
92-
detection_method='kldiv',
93-
baseline_type="last_window",
94-
min_window_length=6*12,
95-
max_window_length=6*24*84,
96-
window_length=6*24,
97-
ma_window_length=24,
98-
).params
99-
>>> wdm_obj = WindowDensityModel(hyper_params=hyper_params)
100-
>>> success, model = wdm_obj.train(data=data)
101-
>>> print(success, model)
102-
(True, <luminaire_models.model.window_density.WindowDensityModel object at 0x7f8d5f1a6940>)
103-
104-
The trained model object can be used to score data representing the same interval from a different day and having the same window size.
146+
raw
147+
index
148+
2020-06-04 00:00:00 227798
149+
2020-06-04 00:10:00 224593
150+
2020-06-04 00:20:00 229400
151+
2020-06-04 00:30:00 217813
152+
2020-06-04 00:40:00 217862
153+
... ...
154+
2020-07-02 23:20:00 221226
155+
2020-07-02 23:30:00 218762
156+
2020-07-02 23:40:00 225726
157+
2020-07-02 23:50:00 220783
158+
2020-07-03 00:00:00 218315
159+
>>>config = WindowDensityHyperParams(freq='10T',
160+
detection_method='kldiv',
161+
baseline_type="last_window",
162+
window_length=6*6,
163+
detrend_method='modeling'
164+
).params
165+
>>> de_obj = DataExploration(**config)
166+
>>> data, pre_prc = de_obj.stream_profile(df=data)
167+
>>> print(data, pre_prc)
168+
raw interpolated
169+
2020-06-05 00:10:00 227504 227504.0
170+
2020-06-05 00:20:00 225664 225664.0
171+
2020-06-05 00:30:00 227586 227586.0
172+
2020-06-05 00:40:00 223805 223805.0
173+
2020-06-05 00:50:00 222679 222679.0
174+
... ... ...
175+
2020-07-02 23:20:00 221226 221226.0
176+
2020-07-02 23:30:00 218762 218762.0
177+
2020-07-02 23:40:00 225726 225726.0
178+
2020-07-02 23:50:00 220783 220783.0
179+
2020-07-03 00:00:00 218315 218315.0
180+
[4032 rows x 2 columns]
181+
{'success': True, 'freq': '10T', 'window_length': 36, 'min_window_length': 10, 'max_window_length': 100000}
182+
>>> config.update(pre_prc)
183+
>>> wdm_obj = WindowDensityModel(hyper_params=config)
184+
>>> success, training_end, model = wdm_obj.train(data=data)
185+
>>> print(success, training_end, model)
186+
True 2020-07-03 00:00:00 <luminaire.model.window_density.WindowDensityModel object at 0x7ff33ef74550>
187+
188+
The trained model object can be used to score the data of a similar window size.
105189

106190
.. image:: window_train_score_manual.png
107-
:scale: 45%
108-
109-
>>> scoring_data
110-
raw interpolated
111-
index
112-
2020-06-10 00:00:00 5532.556746 5532.556746
113-
2020-06-10 00:10:00 5640.711905 5640.711905
114-
2020-06-10 00:20:00 5880.368254 5880.368254
115-
2020-06-10 00:30:00 5842.397222 5842.397222
116-
2020-06-10 00:40:00 5827.231746 5827.231746
117-
... ... ...
118-
2020-06-10 23:10:00 7210.905952 7210.905952
119-
2020-06-10 23:20:00 5739.459524 5739.459524
120-
2020-06-10 23:30:00 5590.413889 5590.413889
121-
2020-06-10 23:40:00 5608.291270 5608.291270
122-
2020-06-10 23:50:00 5753.794444 5753.794444
123-
>>> scores = model.score(scoring_data)
124-
>>> print(scores)
125-
{'Success': True, 'ConfLevel': 99.9, 'IsAnomaly': True, 'AnomalyProbability': 0.9999999851834622}
191+
:scale: 100%
126192

193+
>>> print(data)
194+
raw
195+
index
196+
2020-07-03 06:10:00 222985
197+
2020-07-03 06:20:00 210951
198+
2020-07-03 06:30:00 210094
199+
2020-07-03 06:40:00 215166
200+
2020-07-03 06:50:00 212968
201+
... ...
202+
2020-07-03 11:20:00 209008
203+
2020-07-03 11:30:00 211170
204+
2020-07-03 11:40:00 203302
205+
2020-07-03 11:50:00 204498
206+
2020-07-03 12:00:00 203234
207+
>>> freq = model._params['freq']
208+
>>> de_obj = DataExploration(freq=freq)
209+
>>> processed_data, pre_prc = de_obj.stream_profile(df=data, impute_only=True, impute_zero=True)
210+
>>> score, scored_window = model.score(processed_data)
211+
>>> print(score)
212+
{'Success': True, 'ConfLevel': 99.9, 'IsAnomaly': False, 'AnomalyProbability': 0.330817121756509}
127213

128214

129215

-155 KB
Loading
-125 KB
Loading

0 commit comments

Comments
 (0)