Builder

Model builder

class gordo.builder.build_model.ModelBuilder(machine: gordo.machine.machine.Machine)[source]

Bases: object

Build a model for a given gordo.workflow.config_elements.machine.Machine

Parameters

machine (Machine) –

Example

>>> from gordo_dataset.sensor_tag import SensorTag
>>> from gordo.machine import Machine
>>> from gordo.dependencies import configure_once
>>> configure_once()
>>> machine = Machine(
...     name="special-model-name",
...     model={"sklearn.decomposition.PCA": {"svd_solver": "auto"}},
...     dataset={
...         "type": "RandomDataset",
...         "train_start_date": "2017-12-25 06:00:00Z",
...         "train_end_date": "2017-12-30 06:00:00Z",
...         "tag_list": [SensorTag("Tag 1", None), SensorTag("Tag 2", None)],
...         "target_tag_list": [SensorTag("Tag 3", None), SensorTag("Tag 4", None)]
...     },
...     project_name='test-proj',
... )
>>> builder = ModelBuilder(machine=machine)
>>> model, machine = builder.build()
build(output_dir: Union[os.PathLike, str, None] = None, model_register_dir: Union[os.PathLike, str, None] = None, replace_cache=False) → Tuple[sklearn.base.BaseEstimator, gordo.machine.machine.Machine][source]

Always return a model and its metadata.

If output_dir is supplied, it will save the model there. model_register_dir points to the model cache directory which it will attempt to read the model from. Supplying both will then have the effect of both; reading from the cache and saving that cached model to the new output directory.

Parameters
  • output_dir (Optional[Union[os.PathLike, str]]) – A path to where the model will be deposited.

  • model_register_dir (Optional[Union[os.PathLike, str]]) – A path to a register, see :func:gordo.util.disk_registry. If this is None then always build the model, otherwise try to resolve the model from the registry.

  • replace_cache (bool) – Forces a rebuild of the model, and replaces the entry in the cache with the new model.

Returns

Built model and an updated Machine

Return type

Tuple[sklearn.base.BaseEstimator, Machine]

static build_metrics_dict(metrics_list: list, y: pandas.core.frame.DataFrame, scaler: Union[sklearn.base.TransformerMixin, str, None] = None) → dict[source]

Given a list of metrics that accept a true_y and pred_y as inputs this returns a dictionary with keys in the form ‘{score}-{tag_name}’ for each given target tag and ‘{score}’ for the average score across all target tags and folds, and values being the callable make_scorer(metric_wrapper(score)). Note: score in {score}-{tag_name} is a sklearn’s score function name with ‘_’ replaced by ‘-‘ and tag_name corresponds to given target tag name with ‘ ‘ replaced by ‘-‘.

Parameters
  • metrics_list (list) – List of sklearn score functions

  • y (pd.DataFrame) – Target data

  • scaler (Optional[Union[TransformerMixin, str]]) – Scaler which will be fitted on y, and used to transform the data before scoring. Useful when the metrics are sensitive to the amplitude of the data, and you have multiple targets.

Returns

Return type

dict

static build_split_dict(X: pandas.core.frame.DataFrame, split_obj: Type[sklearn.model_selection._split.BaseCrossValidator]) → dict[source]

Get dictionary of cross-validation training dataset split metadata

Parameters
  • X (pd.DataFrame) – The training dataset that will be split during cross-validation.

  • split_obj (Type[sklearn.model_selection.BaseCrossValidator]) – The cross-validation object that returns train, test indices for splitting.

Returns

split_metadata – Dictionary of cross-validation train/test split metadata

Return type

Dict[str,Any]

property cache_key
property cached_model_path
static calculate_cache_key(machine: gordo.machine.machine.Machine) → str[source]

Calculates a hash-key from the model and data-config.

Returns

A 512 byte hex value as a string based on the content of the parameters.

Return type

str

Examples

>>> from gordo.machine import Machine
>>> from gordo_dataset.sensor_tag import SensorTag
>>> from gordo.dependencies import configure_once
>>> configure_once()
>>> machine = Machine(
...     name="special-model-name",
...     model={"sklearn.decomposition.PCA": {"svd_solver": "auto"}},
...     dataset={
...         "type": "RandomDataset",
...         "train_start_date": "2017-12-25 06:00:00Z",
...         "train_end_date": "2017-12-30 06:00:00Z",
...         "tag_list": [SensorTag("Tag 1", None), SensorTag("Tag 2", None)],
...         "target_tag_list": [SensorTag("Tag 3", None), SensorTag("Tag 4", None)]
...     },
...     project_name='test-proj'
... )
>>> builder = ModelBuilder(machine)
>>> len(builder.cache_key)
128
check_cache(model_register_dir: Union[os.PathLike, str])[source]

Checks if the model is cached, and returns its path if it exists.

Parameters
  • model_register_dir ([os.PathLike, None]) – The register dir where the model lies.

  • cache_key (str) –

    A 512 byte hex value as a string based on the content of the parameters.

    Returns

  • -------

  • None] (Union[os.PathLike,) – The path to the cached model, or None if it does not exist.

static metrics_from_list(metric_list: Optional[List[str]] = None) → List[Callable][source]

Given a list of metric function paths. ie. sklearn.metrics.r2_score or simple function names which are expected to be in the sklearn.metrics module, this will return a list of those loaded functions.

Parameters

metrics (Optional[List[str]]) – List of function paths to use as metrics for the model Defaults to those specified in gordo.workflow.config_components.NormalizedConfig sklearn.metrics.explained_variance_score, sklearn.metrics.r2_score, sklearn.metrics.mean_squared_error, sklearn.metrics.mean_absolute_error

Returns

A list of the functions loaded

Return type

List[Callable]

Raises

AttributeError: – If the function cannot be loaded.

set_seed(seed: int)[source]

Local Model builder

This is meant to provide a good way to validate a configuration file as well as to enable creating and testing models locally with little overhead.

gordo.builder.local_build.local_build(config_str: str) → Iterable[Tuple[Optional[sklearn.base.BaseEstimator], gordo.machine.machine.Machine]][source]

Build model(s) from a bare Gordo config file locally.

This is very similar to the same steps as the normal workflow generation and subsequent Gordo deployment process makes. Should help developing locally, as well as giving a good indication that your config is valid for deployment with Gordo.

Parameters

config_str (str) – The raw yaml config file in string format.

Examples

>>> import numpy as np
>>> from gordo.dependencies import configure_once
>>> configure_once()
>>> config = '''
... machines:
...       - dataset:
...           tags:
...             - SOME-TAG1
...             - SOME-TAG2
...           target_tag_list:
...             - SOME-TAG3
...             - SOME-TAG4
...           train_end_date: '2019-03-01T00:00:00+00:00'
...           train_start_date: '2019-01-01T00:00:00+00:00'
...           asset: asgb
...           data_provider:
...             type: RandomDataProvider
...         metadata:
...           information: Some sweet information about the model
...         model:
...           gordo.machine.model.anomaly.diff.DiffBasedAnomalyDetector:
...             base_estimator:
...               sklearn.pipeline.Pipeline:
...                 steps:
...                 - sklearn.decomposition.PCA
...                 - sklearn.multioutput.MultiOutputRegressor:
...                     estimator: sklearn.linear_model.LinearRegression
...         name: crazy-sweet-name
... '''
>>> models_n_metadata = local_build(config)
>>> assert len(list(models_n_metadata)) == 1
Returns

A generator yielding tuples of models and their metadata.

Return type

Iterable[Tuple[Union[BaseEstimator, None], Machine]]