Builder¶
Model builder¶
-
class
gordo.builder.build_model.
ModelBuilder
(machine: gordo.machine.machine.Machine)[source]¶ Bases:
object
Build a model for a given
gordo.workflow.config_elements.machine.Machine
- Parameters
machine (Machine) –
Example
>>> from gordo_dataset.sensor_tag import SensorTag >>> from gordo.machine import Machine >>> from gordo.dependencies import configure_once >>> configure_once() >>> machine = Machine( ... name="special-model-name", ... model={"sklearn.decomposition.PCA": {"svd_solver": "auto"}}, ... dataset={ ... "type": "RandomDataset", ... "train_start_date": "2017-12-25 06:00:00Z", ... "train_end_date": "2017-12-30 06:00:00Z", ... "tag_list": [SensorTag("Tag 1", None), SensorTag("Tag 2", None)], ... "target_tag_list": [SensorTag("Tag 3", None), SensorTag("Tag 4", None)] ... }, ... project_name='test-proj', ... ) >>> builder = ModelBuilder(machine=machine) >>> model, machine = builder.build()
-
build
(output_dir: Union[os.PathLike, str, None] = None, model_register_dir: Union[os.PathLike, str, None] = None, replace_cache=False) → Tuple[sklearn.base.BaseEstimator, gordo.machine.machine.Machine][source]¶ Always return a model and its metadata.
If
output_dir
is supplied, it will save the model there.model_register_dir
points to the model cache directory which it will attempt to read the model from. Supplying both will then have the effect of both; reading from the cache and saving that cached model to the new output directory.- Parameters
output_dir (Optional[Union[os.PathLike, str]]) – A path to where the model will be deposited.
model_register_dir (Optional[Union[os.PathLike, str]]) – A path to a register, see :func:gordo.util.disk_registry. If this is None then always build the model, otherwise try to resolve the model from the registry.
replace_cache (bool) – Forces a rebuild of the model, and replaces the entry in the cache with the new model.
- Returns
Built model and an updated
Machine
- Return type
Tuple[sklearn.base.BaseEstimator, Machine]
-
static
build_metrics_dict
(metrics_list: list, y: pandas.core.frame.DataFrame, scaler: Union[sklearn.base.TransformerMixin, str, None] = None) → dict[source]¶ Given a list of metrics that accept a true_y and pred_y as inputs this returns a dictionary with keys in the form ‘{score}-{tag_name}’ for each given target tag and ‘{score}’ for the average score across all target tags and folds, and values being the callable make_scorer(metric_wrapper(score)). Note: score in {score}-{tag_name} is a sklearn’s score function name with ‘_’ replaced by ‘-‘ and tag_name corresponds to given target tag name with ‘ ‘ replaced by ‘-‘.
- Parameters
metrics_list (list) – List of sklearn score functions
y (pd.DataFrame) – Target data
scaler (Optional[Union[TransformerMixin, str]]) – Scaler which will be fitted on y, and used to transform the data before scoring. Useful when the metrics are sensitive to the amplitude of the data, and you have multiple targets.
- Returns
- Return type
dict
-
static
build_split_dict
(X: pandas.core.frame.DataFrame, split_obj: Type[sklearn.model_selection._split.BaseCrossValidator]) → dict[source]¶ Get dictionary of cross-validation training dataset split metadata
- Parameters
X (pd.DataFrame) – The training dataset that will be split during cross-validation.
split_obj (Type[sklearn.model_selection.BaseCrossValidator]) – The cross-validation object that returns train, test indices for splitting.
- Returns
split_metadata – Dictionary of cross-validation train/test split metadata
- Return type
Dict[str,Any]
-
property
cache_key
¶
-
property
cached_model_path
¶
-
static
calculate_cache_key
(machine: gordo.machine.machine.Machine) → str[source]¶ Calculates a hash-key from the model and data-config.
- Returns
A 512 byte hex value as a string based on the content of the parameters.
- Return type
str
Examples
>>> from gordo.machine import Machine >>> from gordo_dataset.sensor_tag import SensorTag >>> from gordo.dependencies import configure_once >>> configure_once() >>> machine = Machine( ... name="special-model-name", ... model={"sklearn.decomposition.PCA": {"svd_solver": "auto"}}, ... dataset={ ... "type": "RandomDataset", ... "train_start_date": "2017-12-25 06:00:00Z", ... "train_end_date": "2017-12-30 06:00:00Z", ... "tag_list": [SensorTag("Tag 1", None), SensorTag("Tag 2", None)], ... "target_tag_list": [SensorTag("Tag 3", None), SensorTag("Tag 4", None)] ... }, ... project_name='test-proj' ... ) >>> builder = ModelBuilder(machine) >>> len(builder.cache_key) 128
-
check_cache
(model_register_dir: Union[os.PathLike, str])[source]¶ Checks if the model is cached, and returns its path if it exists.
- Parameters
model_register_dir ([os.PathLike, None]) – The register dir where the model lies.
cache_key (str) –
A 512 byte hex value as a string based on the content of the parameters.
Returns
------- –
None] (Union[os.PathLike,) – The path to the cached model, or None if it does not exist.
-
static
metrics_from_list
(metric_list: Optional[List[str]] = None) → List[Callable][source]¶ Given a list of metric function paths. ie. sklearn.metrics.r2_score or simple function names which are expected to be in the
sklearn.metrics
module, this will return a list of those loaded functions.- Parameters
metrics (Optional[List[str]]) – List of function paths to use as metrics for the model Defaults to those specified in
gordo.workflow.config_components.NormalizedConfig
sklearn.metrics.explained_variance_score, sklearn.metrics.r2_score, sklearn.metrics.mean_squared_error, sklearn.metrics.mean_absolute_error- Returns
A list of the functions loaded
- Return type
List[Callable]
- Raises
AttributeError: – If the function cannot be loaded.
Local Model builder¶
This is meant to provide a good way to validate a configuration file as well as to enable creating and testing models locally with little overhead.
-
gordo.builder.local_build.
local_build
(config_str: str) → Iterable[Tuple[Optional[sklearn.base.BaseEstimator], gordo.machine.machine.Machine]][source]¶ Build model(s) from a bare Gordo config file locally.
This is very similar to the same steps as the normal workflow generation and subsequent Gordo deployment process makes. Should help developing locally, as well as giving a good indication that your config is valid for deployment with Gordo.
- Parameters
config_str (str) – The raw yaml config file in string format.
Examples
>>> import numpy as np >>> from gordo.dependencies import configure_once >>> configure_once() >>> config = ''' ... machines: ... - dataset: ... tags: ... - SOME-TAG1 ... - SOME-TAG2 ... target_tag_list: ... - SOME-TAG3 ... - SOME-TAG4 ... train_end_date: '2019-03-01T00:00:00+00:00' ... train_start_date: '2019-01-01T00:00:00+00:00' ... asset: asgb ... data_provider: ... type: RandomDataProvider ... metadata: ... information: Some sweet information about the model ... model: ... gordo.machine.model.anomaly.diff.DiffBasedAnomalyDetector: ... base_estimator: ... sklearn.pipeline.Pipeline: ... steps: ... - sklearn.decomposition.PCA ... - sklearn.multioutput.MultiOutputRegressor: ... estimator: sklearn.linear_model.LinearRegression ... name: crazy-sweet-name ... ''' >>> models_n_metadata = local_build(config) >>> assert len(list(models_n_metadata)) == 1
- Returns
A generator yielding tuples of models and their metadata.
- Return type
Iterable[Tuple[Union[BaseEstimator, None], Machine]]