Anomaly Models

Models which implment a .anomaly(X, y) and can be served under the model server /anomaly/prediction endpoint.

AnomalyDetectorBase

The base class for all other anomaly detector models

class gordo.machine.model.anomaly.base.AnomalyDetectorBase(**kwargs)[source]

Bases: sklearn.base.BaseEstimator, gordo.machine.model.base.GordoBase

Initialize the model

abstract anomaly(X: Union[pandas.core.frame.DataFrame, xarray.core.dataarray.DataArray], y: Union[pandas.core.frame.DataFrame, xarray.core.dataarray.DataArray], frequency: Optional[datetime.timedelta] = None) → Union[pandas.core.frame.DataFrame, xarray.core.dataset.Dataset][source]

Take X, y and optionally frequency; returning a dataframe containing anomaly score(s)

DiffBasedAnomalyDetector

Calculates the absolute value prediction differences between y and yhat as well as the absolute difference error between both matrices via numpy.linalg.norm(..., axis=1)

class gordo.machine.model.anomaly.diff.DiffBasedAnomalyDetector(base_estimator: sklearn.base.BaseEstimator = tensorflow.keras.wrappers.scikit_learn.KerasRegressor, scaler: sklearn.base.TransformerMixin = MinMaxScaler(), require_thresholds: bool = True, shuffle: bool = False, window: Optional[int] = None, smoothing_method: Optional[str] = None)[source]

Bases: gordo.machine.model.anomaly.base.AnomalyDetectorBase

Estimator which wraps a base_estimator and provides a diff error based approach to anomaly detection.

It trains a scaler to the target after training, purely for error calculations. The underlying base_estimator is trained with the original, unscaled, y.

Threshold calculation is based on a rolling statistic of the validation errors on the last fold of cross-validation.

Parameters
  • base_estimator (sklearn.base.BaseEstimator) – The model to which normal .fit, .predict methods will be used. defaults to py:class:gordo.machine.model.models.KerasAutoEncoder with kind='feedforward_hourglass

  • scaler (sklearn.base.TransformerMixin) – Defaults to sklearn.preprocessing.RobustScaler Used for transforming model output and the original y to calculate the difference/error in model output vs expected.

  • require_thresholds (bool) – Requires calculating thresholds_ via a call to cross_validate(). If this is set (default True), but cross_validate() was not called before calling anomaly() an AttributeError will be raised.

  • shuffle (bool) – Flag to shuffle or not data in .fit so that the model, if relevant, will be trained on a sample of data accross the time range and not just the last elements according to model arg validation_split.

  • window (int) – Window size for smoothed thresholds

  • smoothing_method (str) – Method to be used together with window to smooth metrics. Must be one of: ‘smm’: simple moving median, ‘sma’: simple moving average or ‘ewma’: exponential weighted moving average.

anomaly(X: Union[pandas.core.frame.DataFrame, xarray.core.dataarray.DataArray], y: Union[pandas.core.frame.DataFrame, xarray.core.dataarray.DataArray], frequency: Optional[datetime.timedelta] = None) → Union[pandas.core.frame.DataFrame, xarray.core.dataset.Dataset][source]

Create an anomaly dataframe from the base provided dataframe.

Parameters
  • X (pd.DataFrame) – Dataframe representing the data to go into the model.

  • y (pd.DataFrame) – Dataframe representing the target output of the model.

Returns

A superset of the original base dataframe with added anomaly specific features

Return type

pd.DataFrame

cross_validate(*, X: Union[pandas.core.frame.DataFrame, numpy.ndarray], y: Union[pandas.core.frame.DataFrame, numpy.ndarray], cv=TimeSeriesSplit(max_train_size=None, n_splits=3), **kwargs)[source]

Run TimeSeries cross validation on the model, and will update the model’s threshold values based on the cross validation folds.

Parameters
  • X (Union[pd.DataFrame, np.ndarray]) – Input data to the model

  • y (Union[pd.DataFrame, np.ndarray]) – Target data

  • kwargs (dict) – Any additional kwargs to be passed to sklearn.model_selection.cross_validate()

Returns

Return type

dict

fit(X: numpy.ndarray, y: numpy.ndarray)[source]
get_metadata()[source]

Generates model metadata.

Returns

Return type

dict

get_params(deep=True)[source]

Get parameters for this estimator.

Returns

Return type

dict

score(X: Union[numpy.ndarray, pandas.core.frame.DataFrame], y: Union[numpy.ndarray, pandas.core.frame.DataFrame], sample_weight: Optional[numpy.ndarray] = None) → float[source]

Score the model; must implement the correct default scorer based on model type

class gordo.machine.model.anomaly.diff.DiffBasedKFCVAnomalyDetector(base_estimator: sklearn.base.BaseEstimator = tensorflow.keras.wrappers.scikit_learn.KerasRegressor, scaler: sklearn.base.TransformerMixin = MinMaxScaler(), require_thresholds: bool = True, shuffle: bool = True, window: int = 144, smoothing_method: str = 'smm', threshold_percentile: float = 0.99)[source]

Bases: gordo.machine.model.anomaly.diff.DiffBasedAnomalyDetector

Estimator which wraps a base_estimator and provides a diff error based approach to anomaly detection.

It trains a scaler to the target after training, purely for error calculations. The underlying base_estimator is trained with the original, unscaled, y.

Threshold calculation is based on a percentile of the smoothed validation errors as calculated from cross-validation predictions.

Parameters
  • base_estimator (sklearn.base.BaseEstimator) – The model to which normal .fit, .predict methods will be used. defaults to py:class:gordo.machine.model.models.KerasAutoEncoder with kind='feedforward_hourglass

  • scaler (sklearn.base.TransformerMixin) – Defaults to sklearn.preprocessing.RobustScaler Used for transforming model output and the original y to calculate the difference/error in model output vs expected.

  • require_thresholds (bool) – Requires calculating thresholds_ via a call to cross_validate(). If this is set (default True), but cross_validate() was not called before calling anomaly() an AttributeError will be raised.

  • shuffle (bool) – Flag to shuffle or not data in .fit so that the model, if relevant, will be trained on a sample of data accross the time range and not just the last elements according to model arg validation_split.

  • window (int) – Window size for smooth metrics and threshold calculation.

  • smoothing_method (str) – Method to be used together with window to smooth metrics. Must be one of: ‘smm’: simple moving median, ‘sma’: simple moving average or ‘ewma’: exponential weighted moving average.

  • threshold_percentile (float) – Percentile of the validation data to be used to calculate the threshold.

cross_validate(*, X: Union[pandas.core.frame.DataFrame, numpy.ndarray], y: Union[pandas.core.frame.DataFrame, numpy.ndarray], cv=KFold(n_splits=5, random_state=0, shuffle=True), **kwargs)[source]

Run Kfold cross validation on the model, and will update the model’s threshold values based on a percentile of the validation metrics.

Parameters
  • X (Union[pd.DataFrame, np.ndarray]) – Input data to the model

  • y (Union[pd.DataFrame, np.ndarray]) – Target data

  • kwargs (dict) – Any additional kwargs to be passed to sklearn.model_selection.cross_validate()

Returns

Return type

dict

get_metadata()[source]

Generates model metadata.

Returns

Return type

dict

get_params(deep=True)[source]

Get parameters for this estimator.

Returns

Return type

dict