Builder¶

Model builder¶

class gordo.builder.build_model.ModelBuilder(machine: gordo.machine.machine.Machine)[source]¶

Bases: object

Build a model for a given gordo.workflow.config_elements.machine.Machine

Parameters: machine (Machine) –

Example

>>> from gordo_dataset.sensor_tag import SensorTag
>>> from gordo.machine import Machine
>>> from gordo.dependencies import configure_once
>>> configure_once()
>>> machine = Machine(
...     name="special-model-name",
...     model={"sklearn.decomposition.PCA": {"svd_solver": "auto"}},
...     dataset={
...         "type": "RandomDataset",
...         "train_start_date": "2017-12-25 06:00:00Z",
...         "train_end_date": "2017-12-30 06:00:00Z",
...         "tag_list": [SensorTag("Tag 1", None), SensorTag("Tag 2", None)],
...         "target_tag_list": [SensorTag("Tag 3", None), SensorTag("Tag 4", None)]
...     },
...     project_name='test-proj',
... )
>>> builder = ModelBuilder(machine=machine)
>>> model, machine = builder.build()

build(output_dir: Union[os.PathLike, str, None] = None, model_register_dir: Union[os.PathLike, str, None] = None, replace_cache=False) → Tuple[sklearn.base.BaseEstimator, gordo.machine.machine.Machine][source]¶

Always return a model and its metadata.

If output_dir is supplied, it will save the model there. model_register_dir points to the model cache directory which it will attempt to read the model from. Supplying both will then have the effect of both; reading from the cache and saving that cached model to the new output directory.

Parameters

output_dir (Optional[Union[os.PathLike, str]]) – A path to where the model will be deposited.
model_register_dir (Optional[Union[os.PathLike, str]]) – A path to a register, see :func:gordo.util.disk_registry. If this is None then always build the model, otherwise try to resolve the model from the registry.
replace_cache (bool) – Forces a rebuild of the model, and replaces the entry in the cache with the new model.

Returns

Built model and an updated Machine

Return type

Tuple[sklearn.base.BaseEstimator, Machine]

static build_metrics_dict(metrics_list: list, y: pandas.core.frame.DataFrame, scaler: Union[sklearn.base.TransformerMixin, str, None] = None) → dict[source]¶

Given a list of metrics that accept a true_y and pred_y as inputs this returns a dictionary with keys in the form ‘{score}-{tag_name}’ for each given target tag and ‘{score}’ for the average score across all target tags and folds, and values being the callable make_scorer(metric_wrapper(score)). Note: score in {score}-{tag_name} is a sklearn’s score function name with ‘_’ replaced by ‘-‘ and tag_name corresponds to given target tag name with ‘ ‘ replaced by ‘-‘.

Parameters

metrics_list (list) – List of sklearn score functions
y (pd.DataFrame) – Target data
scaler (Optional[Union[TransformerMixin, str]]) – Scaler which will be fitted on y, and used to transform the data before scoring. Useful when the metrics are sensitive to the amplitude of the data, and you have multiple targets.

Returns

Return type

dict

static build_split_dict(X: pandas.core.frame.DataFrame, split_obj: Type[sklearn.model_selection._split.BaseCrossValidator]) → dict[source]¶

Get dictionary of cross-validation training dataset split metadata

Parameters

X (pd.DataFrame) – The training dataset that will be split during cross-validation.
split_obj (Type[sklearn.model_selection.BaseCrossValidator]) – The cross-validation object that returns train, test indices for splitting.

Returns

split_metadata – Dictionary of cross-validation train/test split metadata

Return type

Dict[str,Any]

property cache_key¶

property cached_model_path¶

static calculate_cache_key(machine: gordo.machine.machine.Machine) → str[source]¶

Calculates a hash-key from the model and data-config.

Returns: A 512 byte hex value as a string based on the content of the parameters.
Return type: str

Examples

>>> from gordo.machine import Machine
>>> from gordo_dataset.sensor_tag import SensorTag
>>> from gordo.dependencies import configure_once
>>> configure_once()
>>> machine = Machine(
...     name="special-model-name",
...     model={"sklearn.decomposition.PCA": {"svd_solver": "auto"}},
...     dataset={
...         "type": "RandomDataset",
...         "train_start_date": "2017-12-25 06:00:00Z",
...         "train_end_date": "2017-12-30 06:00:00Z",
...         "tag_list": [SensorTag("Tag 1", None), SensorTag("Tag 2", None)],
...         "target_tag_list": [SensorTag("Tag 3", None), SensorTag("Tag 4", None)]
...     },
...     project_name='test-proj'
... )
>>> builder = ModelBuilder(machine)
>>> len(builder.cache_key)
128

check_cache(model_register_dir: Union[os.PathLike, str])[source]¶

Checks if the model is cached, and returns its path if it exists.

Parameters

model_register_dir ([os.PathLike, None]) – The register dir where the model lies.
cache_key (str) –
A 512 byte hex value as a string based on the content of the parameters.

Returns
------- –
None] (Union[os.PathLike,) – The path to the cached model, or None if it does not exist.

static metrics_from_list(metric_list: Optional[List[str]] = None) → List[Callable][source]¶

Given a list of metric function paths. ie. sklearn.metrics.r2_score or simple function names which are expected to be in the sklearn.metrics module, this will return a list of those loaded functions.

Parameters: metrics (Optional[List[str]]) – List of function paths to use as metrics for the model Defaults to those specified in gordo.workflow.config_components.NormalizedConfig sklearn.metrics.explained_variance_score, sklearn.metrics.r2_score, sklearn.metrics.mean_squared_error, sklearn.metrics.mean_absolute_error
Returns: A list of the functions loaded
Return type: List[Callable]
Raises: AttributeError: – If the function cannot be loaded.

set_seed(seed: int)[source]¶

Local Model builder¶

This is meant to provide a good way to validate a configuration file as well as to enable creating and testing models locally with little overhead.

gordo.builder.local_build.local_build(config_str: str) → Iterable[Tuple[Optional[sklearn.base.BaseEstimator], gordo.machine.machine.Machine]][source]¶

Build model(s) from a bare Gordo config file locally.

This is very similar to the same steps as the normal workflow generation and subsequent Gordo deployment process makes. Should help developing locally, as well as giving a good indication that your config is valid for deployment with Gordo.

Parameters: config_str (str) – The raw yaml config file in string format.

Examples

>>> import numpy as np
>>> from gordo.dependencies import configure_once
>>> configure_once()
>>> config = '''
... machines:
...       - dataset:
...           tags:
...             - SOME-TAG1
...             - SOME-TAG2
...           target_tag_list:
...             - SOME-TAG3
...             - SOME-TAG4
...           train_end_date: '2019-03-01T00:00:00+00:00'
...           train_start_date: '2019-01-01T00:00:00+00:00'
...           asset: asgb
...           data_provider:
...             type: RandomDataProvider
...         metadata:
...           information: Some sweet information about the model
...         model:
...           gordo.machine.model.anomaly.diff.DiffBasedAnomalyDetector:
...             base_estimator:
...               sklearn.pipeline.Pipeline:
...                 steps:
...                 - sklearn.decomposition.PCA
...                 - sklearn.multioutput.MultiOutputRegressor:
...                     estimator: sklearn.linear_model.LinearRegression
...         name: crazy-sweet-name
... '''
>>> models_n_metadata = local_build(config)
>>> assert len(list(models_n_metadata)) == 1

Returns: A generator yielding tuples of models and their metadata.
Return type: Iterable[Tuple[Union[BaseEstimator, None], Machine]]