Welcome to Gordo’ documentation!¶
Overview¶
Gordo is a collection of tools to create a distributed
ML service represented by a specific pipeline. Generally, any
sklearn.pipeline.Pipeline
object can be defined within a config file
and deployed as a REST API on Kubernetes.
Quick start¶
The concept of Gordo
is to (as of now) process, only, timeseries
datasets which are comprised of sensors/tag identifies. The workflow
launches the collection of these tags, building of a defined model and
subsequent deployment of a ML Server which acts as a REST interface
in front of the model.
A typical config file might look like this:
apiVersion: equinor.com/v1
kind: Gordo
metadata:
name: test-project
spec:
deploy-version: 0.39.0
config:
machines:
# This machine specifies all keys, and will train a model on one month
# worth of data, as shown in its train_start/end_date dataset keys.
- name: some-name-here
dataset:
train_start_date: 2018-01-01T00:00:00Z
train_end_date: 2018-02-01T00:00:00Z
resolution: 2T # Resample timeseries at 2min intervals (pandas freq strings)
tags:
- tag-1
- tag-2
model:
sklearn.pipeline.Pipeline:
steps:
- sklearn.preprocessing.MinMaxScaler
- gordo.model.models.KerasAutoEncoder:
kind: feedforward_hourglass
metadata:
key1: some-value
# This machine does NOT specify all keys, it is missing 'model' but will
# have the 'model' under 'globals' inserted as its default.
# And will train a model on one month as well.
- name: some-name-here
dataset:
train_start_date: 2018-01-01T00:00:00Z
train_end_date: 2018-02-01T00:00:00Z
resolution: 2T # Resample timeseries at 2min intervals (pandas freq strings)
tags:
- tag-1
- tag-2
metadata:
key1: some-different-value-if-you-want
nested-keys-allowed:
- correct: true
globals:
model:
sklearn.pipeline.Pipeline:
steps:
- sklearn.preprocessing.MinMaxScaler
- gordo.model.models.KerasAutoEncoder:
kind: feedforward_model
metadata:
what-does-this-do: "This metadata will get mapped to every machine's metadata!"
One can experiment locally with Gordo through the Jupyter Notebooks provided in the examples directory of the repository.
Architecture¶
Gordo
is based on parsing a config file written in yaml
that is converted into an Argo
workflow. This is
deployed with ArgoCD
onto a Kubernetes
cluster. The main interface after building the models is a set of
REST
APIs
For illustrating the architecture, we use the C4 approach.
Endpoints¶
Project index page¶
Going to the base path of the project, ie. /gordo/v0/my-project/
will return the
project level index, with returns a collection of the metadata surrounding the models currently deployed and their status.
Each endpoint
key has an associated endpoint-metadata
key which is the direct transferal of metadata returned from
the ML servers at their /metadata/ route.
This returns a lot of metadata data, so we’ll show a small screen-shot of some of the data you might expect to get:

Machine Learning Server Routes¶
When a model is deployed from a config file, it results in a ML server capable of the following paths:
Under normal Equinor deployments, paths listed below should be prefixed with /gordo/v0/<project-name>/<model-name>
.
Otherwise, the paths listed below are the raw exposed endpoints from the server’s perspective.
/¶
This is the Swagger UI for the given model. Allows for manual testing of endpoints via a GUI interface.
/prediction/¶
The /prediction
endpoint will return the basic values a model
is capable of returning. Namely, this will be:
model-output
:The raw model output, after calling
.predict
on the model or pipeline or.transform
if the pipeline/model does not have a.predict
method.
original-input
:Represents the data supplied to the Pipeline, the raw untransformed values.
Sample response:
{'data': {'end': {'end': {'0': None, '1': None}},
'model-input': {'TAG-1': {'0': 0.7149938815135232,
'1': 0.5804863352453888},
'TAG-2': {'0': 0.724091483437877,
'1': 0.9307866320901698},
'TAG-3': {'0': 0.018676439423681468,
'1': 0.3389969016787632},
'TAG-4': {'0': 0.285813103358881,
'1': 0.12008312306966606}},
'model-output': {'TARGET-TAG-1': {'0': 31.12387466430664,
'1': 31.12371063232422},
'TARGET-TAG-2': {'0': 30.122753143310547,
'1': 30.122438430786133},
'TARGET-TAG-3': {'0': 20.38254737854004,
'1': 20.382972717285156}},
'start': {'start': {'0': None, '1': None}}}}
The endpoint only accepts POST requests.
POST
requests take raw data:
>>> import requests
>>>
>>> # Single sample:
>>> requests.post("https://my-server.io/prediction", json={"X": [1, 2, 3, 4]})
>>>
>>> # Multiple samples:
>>> requests.post("https://my-server.io/prediction", json={"X": [[1, 2, 3, 4], [5, 6, 7, 8]]})
NOTE: The client must provide the correct number of input features, ie. if the model was trained on 4 features, the client should provide 4 feature sample(s).
You may also supply a dataframe using gordo.server.utils.dataframe_to_dict()
:
>>> import requests
>>> import pprint
>>> from gordo.server import utils
>>> import pandas as pd
>>> X = pd.DataFrame({"TAG-1": range(4),
... "TAG-2": range(4),
... "TAG-3": range(4),
... "TAG-4": range(4)},
... index=pd.date_range('2019-01-01', '2019-01-02', periods=4)
... )
>>> resp = requests.post("https://my-server.io/gordo/v0/project-name/model-name/prediction",
... json={"X": utils.dataframe_to_dict(X)}
... )
>>> pprint.pprint(resp.json())
{'data': {'end': {'end': {'2019-01-01 00:00:00': None,
'2019-01-01 08:00:00': None,
'2019-01-01 16:00:00': None,
'2019-01-02 00:00:00': None}},
'model-input': {'TAG-1': {'2019-01-01 00:00:00': 0,
'2019-01-01 08:00:00': 1,
'2019-01-01 16:00:00': 2,
'2019-01-02 00:00:00': 3},
'TAG-2': {'2019-01-01 00:00:00': 0,
'2019-01-01 08:00:00': 1,
'2019-01-01 16:00:00': 2,
'2019-01-02 00:00:00': 3},
'TAG-3': {'2019-01-01 00:00:00': 0,
'2019-01-01 08:00:00': 1,
'2019-01-01 16:00:00': 2,
'2019-01-02 00:00:00': 3},
'TAG-4': {'2019-01-01 00:00:00': 0,
'2019-01-01 08:00:00': 1,
'2019-01-01 16:00:00': 2,
'2019-01-02 00:00:00': 3}},
'model-output': {'TARGET-TAG-1': {'2019-01-01 00:00:00': 31.123781204223633,
'2019-01-01 08:00:00': 31.122915267944336,
'2019-01-01 16:00:00': 31.12187385559082,
'2019-01-02 00:00:00': 31.120620727539062},
'TARGET-TAG-2': {'2019-01-01 00:00:00': 30.122575759887695,
'2019-01-01 08:00:00': 30.120899200439453,
'2019-01-01 16:00:00': 30.11887550354004,
'2019-01-02 00:00:00': 30.116445541381836},
'TARGET-TAG-3': {'2019-01-01 00:00:00': 20.382783889770508,
'2019-01-01 08:00:00': 20.385055541992188,
'2019-01-01 16:00:00': 20.38779640197754,
'2019-01-02 00:00:00': 20.391088485717773}},
'start': {'start': {'2019-01-01 00:00:00': '2019-01-01T00:00:00',
'2019-01-01 08:00:00': '2019-01-01T08:00:00',
'2019-01-01 16:00:00': '2019-01-01T16:00:00',
'2019-01-02 00:00:00': '2019-01-02T00:00:00'}}}}
>>> # Alternatively, you can convert the json back into a dataframe with:
>>> df = utils.dataframe_from_dict(resp.json())
Furthermore, you can increase efficiency by instead converting your data to parquet with the following:
>>> resp = requests.post("https://my-server.io/gordo/v0/project-name/model-name/prediction?format=parquet", # <- note the '?format=parquet'
... files={"X": utils.dataframe_into_parquet_bytes(X)}
... )
>>> resp.ok
True
>>> df = utils.dataframe_from_parquet_bytes(resp.content)
/anomaly/prediction/¶
The /anomaly/prediction
endpoint will return the data supplied by the /prediction
endpoint
but reserved for models which inherit from gordo.model.anomaly.base.AnomalyDetectorBase
By this restriction, additional _features_ are calculated and returned (depending on the AnomalyDetector model being served.
For example, the gordo.model.anomaly.diff.DiffBasedAnomalyDetector
will return the following:
tag-anomaly-scaled
&tag-anomaly-unscaled
:Anomaly per feature/tag calculated from the expected tag input (y) and the model’s output for those tags (yhat), using scaled and unscaled values.
total-anomaly-scaled
&total-anomaly-unscaled
:This is the total anomaly for the given point as calculated by the model, using scaled and unscaled values.
Sample response:
{'data': {'end': {'end': {'2019-01-01 00:00:00': '2019-01-01T00:10:00',
'2019-01-01 08:00:00': '2019-01-01T08:10:00',
'2019-01-01 16:00:00': '2019-01-01T16:10:00',
'2019-01-02 00:00:00': '2019-01-02T00:10:00'}},
'model-input': {'TAG-1': {'2019-01-01 00:00:00': 0,
'2019-01-01 08:00:00': 1,
'2019-01-01 16:00:00': 2,
'2019-01-02 00:00:00': 3},
'TAG-2': {'2019-01-01 00:00:00': 0,
'2019-01-01 08:00:00': 1,
'2019-01-01 16:00:00': 2,
'2019-01-02 00:00:00': 3},
'TAG-3': {'2019-01-01 00:00:00': 0,
'2019-01-01 08:00:00': 1,
'2019-01-01 16:00:00': 2,
'2019-01-02 00:00:00': 3},
'TAG-4': {'2019-01-01 00:00:00': 0,
'2019-01-01 08:00:00': 1,
'2019-01-01 16:00:00': 2,
'2019-01-02 00:00:00': 3}},
'model-output': {'TARGET-TAG-1': {'2019-01-01 00:00:00': 31.123781204223633,
'2019-01-01 08:00:00': 31.122915267944336,
'2019-01-01 16:00:00': 31.12187385559082,
'2019-01-02 00:00:00': 31.120620727539062},
'TARGET-TAG-2': {'2019-01-01 00:00:00': 30.122575759887695,
'2019-01-01 08:00:00': 30.120899200439453,
'2019-01-01 16:00:00': 30.11887550354004,
'2019-01-02 00:00:00': 30.116445541381836},
'TARGET-TAG-3': {'2019-01-01 00:00:00': 20.382783889770508,
'2019-01-01 08:00:00': 20.385055541992188,
'2019-01-01 16:00:00': 20.38779640197754,
'2019-01-02 00:00:00': 20.391088485717773}},
'start': {'start': {'2019-01-01 00:00:00': '2019-01-01T00:00:00',
'2019-01-01 08:00:00': '2019-01-01T08:00:00',
'2019-01-01 16:00:00': '2019-01-01T16:00:00',
'2019-01-02 00:00:00': '2019-01-02T00:00:00'}},
'tag-anomaly-scaled': {'TARGET-TAG-1': {'2019-01-01 00:00:00': 43.9791088965509,
'2019-01-01 08:00:00': 42.564846544761124,
'2019-01-01 16:00:00': 41.15033623847873,
'2019-01-02 00:00:00': 39.73552676971069},
'TARGET-TAG-2': {'2019-01-01 00:00:00': 42.73147969197182,
'2019-01-01 08:00:00': 41.310514834943056,
'2019-01-01 16:00:00': 39.88905753340811,
'2019-01-02 00:00:00': 38.46702390945659},
'TARGET-TAG-3': {'2019-01-01 00:00:00': 26.2922285259887,
'2019-01-01 08:00:00': 25.005235450434874,
'2019-01-01 16:00:00': 23.71884761692332,
'2019-01-02 00:00:00': 22.43317081979476}},
'total-anomaly-scaled': {'total-anomaly-scaled': {'2019-01-01 00:00:00': 66.71898273252445,
'2019-01-01 08:00:00': 64.37069672792737,
'2019-01-01 16:00:00': 62.024759698996235,
'2019-01-02 00:00:00': 59.68141393388054}}},
'time-seconds': '0.1623'}
This endpoint accepts only POST
requests.
Model requests are exactly the same as /prediction/, but will require a y
to compare the anomaly
against.
/download-model/¶
Returns the current model being served. Loadable via gordo.serializer.loads(downloaded_bytes)
/metadata/¶
Various metadata surrounding the current model and environment.
Machine¶
A Machine is the central unity of a model, dataset, metadata and everything needed to create and build a ML model to be served by a deployment.
An example of a Machine
in the context of a YAML config, could be
the following:
- name: ct-23-0001
dataset:
tags:
- TAG 1
- TAG 2
- TAG 3
train_start_date: 2016-11-07T09:11:30+01:00
train_end_date: 2018-09-15T03:01:00+01:00
metadata:
arbitrary-key: arbitrary-value
model:
gordo.machine.model.anomaly.diff.DiffBasedAnomalyDetector:
base_estimator:
sklearn.pipeline.Pipeline:
steps:
- sklearn.preprocessing.MinMaxScaler
- gordo.machine.model.models.KerasAutoEncoder:
kind: feedforward_hourglass
And to construct this into a python object:
>>> from gordo.machine import Machine
>>> # `config` is the result of the parsed and loaded yaml element above
>>> machine = Machine.from_config(config, project_name='test-proj')
>>> machine.name
ct-23-0001
-
class
gordo.machine.machine.
Machine
(name: str, model: dict, dataset: Union[gordo_dataset.base.GordoBaseDataset, dict], project_name: str, evaluation: Optional[dict] = None, metadata: Union[dict, gordo.machine.metadata.metadata.Metadata, None] = None, runtime=None)[source]¶ Bases:
object
Represents a single machine in a config file
-
dataset
¶ Descriptor for attributes requiring type
gordo.workflow.config_elements.Dataset
-
classmethod
from_config
(config: Dict[str, Any], project_name: str, config_globals=None)[source]¶ Construct an instance from a block of YAML config file which represents a single Machine; loaded as a
dict
.- Parameters
config (dict) – The loaded block of config which represents a ‘Machine’ in YAML
project_name (str) – Name of the project this Machine belongs to.
config_globals – The block of config within the YAML file within globals
- Returns
- Return type
-
classmethod
from_dict
(d: dict) → gordo.machine.machine.Machine[source]¶ Get an instance from a dict taken from
to_dict()
-
host
¶ Descriptor for use in objects which require valid URL values. Where ‘valid URL values’ is Gordo’s version: alphanumeric with dashes.
Use:
class MySpecialClass: url_attribute = ValidUrlString() ... myspecialclass = MySpecialClass() myspecialclass.url_attribute = 'this-is-ok' myspecialclass.url_attribute = 'this will r@ise a ValueError'
-
metadata
¶ Descriptor for attributes requiring type Optional[dict]
-
model
¶ Descriptor for attributes requiring type Union[dict, str]
-
name
¶ Descriptor for use in objects which require valid URL values. Where ‘valid URL values’ is Gordo’s version: alphanumeric with dashes.
Use:
class MySpecialClass: url_attribute = ValidUrlString() ... myspecialclass = MySpecialClass() myspecialclass.url_attribute = 'this-is-ok' myspecialclass.url_attribute = 'this will r@ise a ValueError'
Finding assets for all of the tags according to information from the dataset metadata
- Parameters
tag_list (TagsList) –
- Returns
- Return type
List[SensorTag]
-
project_name
¶ Descriptor for use in objects which require valid URL values. Where ‘valid URL values’ is Gordo’s version: alphanumeric with dashes.
Use:
class MySpecialClass: url_attribute = ValidUrlString() ... myspecialclass = MySpecialClass() myspecialclass.url_attribute = 'this-is-ok' myspecialclass.url_attribute = 'this will r@ise a ValueError'
-
report
()[source]¶ Run any reporters in the machine’s runtime for the current state.
Reporters implement the
gordo.reporters.base.BaseReporter
and can be specified in a config file of the machine for example:runtime: reporters: - gordo.reporters.postgres.PostgresReporter: host: my-special-host
-
runtime
¶ Descriptor for runtime dict in a machine object. Must be a valid runtime, but also must contain server.resources.limits/requests.memory/cpu to be valid.
-
to_dict
()[source]¶ Convert to a
dict
representation along with all attributes which can also be converted to adict
. Can reload withfrom_dict()
-
-
class
gordo.machine.machine.
MachineEncoder
(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]¶ Bases:
json.encoder.JSONEncoder
A JSONEncoder for machine objects, handling datetime.datetime objects as strings and handles any numpy numeric instances; both of which common in the
dict
representation of aMachine
Example
>>> from pytz import UTC >>> s = json.dumps({"now":datetime.now(tz=UTC)}, cls=MachineEncoder, indent=4) >>> s = '{"now": "2019-11-22 08:34:41.636356+"}'
Constructor for JSONEncoder, with sensible defaults.
If skipkeys is false, then it is a TypeError to attempt encoding of keys that are not str, int, float or None. If skipkeys is True, such items are simply skipped.
If ensure_ascii is true, the output is guaranteed to be str objects with all incoming non-ASCII characters escaped. If ensure_ascii is false, the output can contain non-ASCII characters.
If check_circular is true, then lists, dicts, and custom encoded objects will be checked for circular references during encoding to prevent an infinite recursion (which would cause an OverflowError). Otherwise, no such check takes place.
If allow_nan is true, then NaN, Infinity, and -Infinity will be encoded as such. This behavior is not JSON specification compliant, but is consistent with most JavaScript based encoders and decoders. Otherwise, it will be a ValueError to encode such floats.
If sort_keys is true, then the output of dictionaries will be sorted by key; this is useful for regression tests to ensure that JSON serializations can be compared on a day-to-day basis.
If indent is a non-negative integer, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0 will only insert newlines. None is the most compact representation.
If specified, separators should be an (item_separator, key_separator) tuple. The default is (‘, ‘, ‘: ‘) if indent is
None
and (‘,’, ‘: ‘) otherwise. To get the most compact JSON representation, you should specify (‘,’, ‘:’) to eliminate whitespace.If specified, default is a function that gets called for objects that can’t otherwise be serialized. It should return a JSON encodable version of the object or raise a
TypeError
.-
default
(obj)[source]¶ Implement this method in a subclass such that it returns a serializable object for
o
, or calls the base implementation (to raise aTypeError
).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)
-
Descriptors¶
Collection of descriptors to verify types and conditions of the Machine attributes when loading.
And example of which is if the machine name is set to a value which isn’t
a valid URL string, thus causing early failure before k8s itself discovers that the
name isn’t valid. (See: gordo.machine.validators.ValidUrlString
)
-
class
gordo.machine.validators.
BaseDescriptor
[source]¶ Bases:
object
Base descriptor class
New object should override __set__(self, instance, value) method to check if ‘value’ meets required needs.
-
class
gordo.machine.validators.
ValidDataProvider
[source]¶ Bases:
gordo.machine.validators.BaseDescriptor
Descriptor for DataProvider
-
class
gordo.machine.validators.
ValidDataset
[source]¶ Bases:
gordo.machine.validators.BaseDescriptor
Descriptor for attributes requiring type
gordo.workflow.config_elements.Dataset
-
class
gordo.machine.validators.
ValidDatasetKwargs
[source]¶ Bases:
gordo.machine.validators.BaseDescriptor
Descriptor for attributes requiring type
gordo.workflow.config_elements.Dataset
-
class
gordo.machine.validators.
ValidDatetime
[source]¶ Bases:
gordo.machine.validators.BaseDescriptor
Descriptor for attributes requiring valid datetime.datetime attribute
-
class
gordo.machine.validators.
ValidMachineRuntime
[source]¶ Bases:
gordo.machine.validators.BaseDescriptor
Descriptor for runtime dict in a machine object. Must be a valid runtime, but also must contain server.resources.limits/requests.memory/cpu to be valid.
-
class
gordo.machine.validators.
ValidMetadata
[source]¶ Bases:
gordo.machine.validators.BaseDescriptor
Descriptor for attributes requiring type Optional[dict]
-
class
gordo.machine.validators.
ValidModel
[source]¶ Bases:
gordo.machine.validators.BaseDescriptor
Descriptor for attributes requiring type Union[dict, str]
-
class
gordo.machine.validators.
ValidTagList
[source]¶ Bases:
gordo.machine.validators.BaseDescriptor
Descriptor for attributes requiring a non-empty list of strings
-
class
gordo.machine.validators.
ValidUrlString
[source]¶ Bases:
gordo.machine.validators.BaseDescriptor
Descriptor for use in objects which require valid URL values. Where ‘valid URL values’ is Gordo’s version: alphanumeric with dashes.
Use:
class MySpecialClass: url_attribute = ValidUrlString() ... myspecialclass = MySpecialClass() myspecialclass.url_attribute = 'this-is-ok' myspecialclass.url_attribute = 'this will r@ise a ValueError'
-
gordo.machine.validators.
fix_resource_limits
(resources: dict) → dict[source]¶ Resource limitations must be higher or equal to resource requests, if they are both specified. This bumps any limits to the corresponding request if they are both set.
- Parameters
resources (dict) – Dictionary with possible requests/limits
Examples
>>> fix_resource_limits({"requests": {"cpu": 10}, "limits":{"cpu":9}}) {'requests': {'cpu': 10}, 'limits': {'cpu': 10}} >>> fix_resource_limits({"requests": {"cpu": 10}}) {'requests': {'cpu': 10}}
- Returns
A copy of resource_dict with the any limits bumped to the corresponding request if they are both set.
- Return type
dict
Models¶
Models are a collection of Scikit-Learn
like models, built specifically to fulfill a need. One example of which is
the KerasAutoEncoder
.
Other scikit-learn compliant models can be used within the config files without any additional configuration.
Base Model¶
The base model is designed to be inherited from any other models which need to be implemented within Gordo due to special model requirements. ie. PyTorch, Keras, etc.
Custom Gordo models¶
This group of models are already implemented and ready to be used within
config files, by simply specifying their full path. For example:
gordo.machine.model.models.KerasAutoEncoder
-
class
gordo.machine.model.models.
KerasAutoEncoder
(kind: Union[str, Callable[[int, Dict[str, Any]], tensorflow.keras.models.Model]], **kwargs)[source]¶ Bases:
gordo.machine.model.models.KerasBaseEstimator
,sklearn.base.TransformerMixin
Subclass of the KerasBaseEstimator to allow fitting to just X without requiring y.
Initialized a Scikit-Learn API compatitble Keras model with a pre-registered function or a builder function directly.
- Parameters
kind (Union[callable, str]) – The structure of the model to build. As designated by any registered builder functions, registered with gordo_compontents.model.register.register_model_builder. Alternatively, one may pass a builder function directly to this argument. Such a function should accept n_features as it’s first argument, and pass any additional parameters to **kwargs
kwargs (dict) – Any additional args which are passed to the factory building function and/or any additional args to be passed to Keras’ fit() method
-
score
(X: Union[numpy.ndarray, pandas.core.frame.DataFrame], y: Union[numpy.ndarray, pandas.core.frame.DataFrame], sample_weight: Optional[numpy.ndarray] = None) → float[source]¶ Returns the explained variance score between auto encoder’s input vs output
- Parameters
X (Union[np.ndarray, pd.DataFrame]) – Input data to the model
y (Union[np.ndarray, pd.DataFrame]) – Target
sample_weight (Optional[np.ndarray]) – sample weights
- Returns
score – Returns the explained variance score
- Return type
float
-
class
gordo.machine.model.models.
KerasBaseEstimator
(kind: Union[str, Callable[[int, Dict[str, Any]], tensorflow.keras.models.Model]], **kwargs)[source]¶ Bases:
tensorflow.keras.wrappers.scikit_learn.KerasRegressor
,gordo.machine.model.base.GordoBase
,sklearn.base.BaseEstimator
Initialized a Scikit-Learn API compatitble Keras model with a pre-registered function or a builder function directly.
- Parameters
kind (Union[callable, str]) – The structure of the model to build. As designated by any registered builder functions, registered with gordo_compontents.model.register.register_model_builder. Alternatively, one may pass a builder function directly to this argument. Such a function should accept n_features as it’s first argument, and pass any additional parameters to **kwargs
kwargs (dict) – Any additional args which are passed to the factory building function and/or any additional args to be passed to Keras’ fit() method
-
classmethod
extract_supported_fit_args
(kwargs)[source]¶ Filtering only
fit
related kwargs- Parameters
kwargs (dict) –
-
fit
(X: Union[numpy.ndarray, pandas.core.frame.DataFrame, xarray.core.dataarray.DataArray], y: Union[numpy.ndarray, pandas.core.frame.DataFrame, xarray.core.dataarray.DataArray], **kwargs)[source]¶ Fit the model to X given y.
- Parameters
X (Union[np.ndarray, pd.DataFrame, xr.Dataset]) – numpy array or pandas dataframe
y (Union[np.ndarray, pd.DataFrame, xr.Dataset]) – numpy array or pandas dataframe
sample_weight (np.ndarray) – array like - weight to assign to samples
kwargs – Any additional kwargs to supply to keras fit method.
- Returns
‘KerasAutoEncoder’
- Return type
self
-
classmethod
from_definition
(definition: dict)[source]¶ Handler for
gordo.serializer.from_definition
- Parameters
definition (dict) –
-
get_metadata
()[source]¶ Get metadata for the KerasBaseEstimator. Includes a dictionary with key “history”. The key’s value is a a dictionary with a key “params” pointing another dictionary with various parameters. The metrics are defined in the params dictionary under “metrics”. For each of the metrics there is a key who’s value is a list of values for this metric per epoch.
- Returns
Metadata dictionary, including a history object if present
- Return type
Dict
-
static
get_n_features
(X: Union[numpy.ndarray, pandas.core.frame.DataFrame, xarray.core.dataarray.DataArray]) → Union[int, tuple][source]¶
-
static
get_n_features_out
(y: Union[numpy.ndarray, pandas.core.frame.DataFrame, xarray.core.dataarray.DataArray]) → Union[int, tuple][source]¶
-
get_params
(**params)[source]¶ Gets the parameters for this estimator
- Parameters
params – ignored (exists for API compatibility).
- Returns
Parameters used in this estimator
- Return type
Dict[str, Any]
-
into_definition
() → dict[source]¶ Handler for
gordo.serializer.into_definition
- Returns
- Return type
dict
-
predict
(X: numpy.ndarray, **kwargs) → numpy.ndarray[source]¶ - Parameters
X (np.ndarray) – Input data
kwargs (dict) – kwargs which are passed to Kera’s
predict
method
- Returns
np.ndarray
- Return type
results
-
property
sk_params
¶ Parameters used for scikit learn kwargs
-
supported_fit_args
= ['batch_size', 'epochs', 'verbose', 'callbacks', 'validation_split', 'shuffle', 'class_weight', 'initial_epoch', 'steps_per_epoch', 'validation_batch_size', 'max_queue_size', 'workers', 'use_multiprocessing']¶
-
class
gordo.machine.model.models.
KerasLSTMAutoEncoder
(kind: Union[Callable, str], lookback_window: int = 1, batch_size: int = 32, **kwargs)[source]¶ Bases:
gordo.machine.model.models.KerasLSTMBaseEstimator
- Parameters
kind (Union[Callable, str]) – The structure of the model to build. As designated by any registered builder functions, registered with gordo.machine.model.register.register_model_builder. Alternatively, one may pass a builder function directly to this argument. Such a function should accept n_features as it’s first argument, and pass any additional parameters to **kwargs.
lookback_window (int) – Number of timestamps (lags) used to train the model.
batch_size (int) – Number of training examples used in one epoch.
epochs (int) – Number of epochs to train the model. An epoch is an iteration over the entire data provided.
verbose (int) – Verbosity mode. Possible values are 0, 1, or 2 where 0 = silent, 1 = progress bar, 2 = one line per epoch.
kwargs (dict) – Any arguments which are passed to the factory building function and/or any additional args to be passed to the intermediate fit method.
-
property
lookahead
¶ Steps ahead in y the model should target
-
class
gordo.machine.model.models.
KerasLSTMBaseEstimator
(kind: Union[Callable, str], lookback_window: int = 1, batch_size: int = 32, **kwargs)[source]¶ Bases:
gordo.machine.model.models.KerasBaseEstimator
,sklearn.base.TransformerMixin
Abstract Base Class to allow to train a many-one LSTM autoencoder and an LSTM 1 step forecast
- Parameters
kind (Union[Callable, str]) – The structure of the model to build. As designated by any registered builder functions, registered with gordo.machine.model.register.register_model_builder. Alternatively, one may pass a builder function directly to this argument. Such a function should accept n_features as it’s first argument, and pass any additional parameters to **kwargs.
lookback_window (int) – Number of timestamps (lags) used to train the model.
batch_size (int) – Number of training examples used in one epoch.
epochs (int) – Number of epochs to train the model. An epoch is an iteration over the entire data provided.
verbose (int) – Verbosity mode. Possible values are 0, 1, or 2 where 0 = silent, 1 = progress bar, 2 = one line per epoch.
kwargs (dict) – Any arguments which are passed to the factory building function and/or any additional args to be passed to the intermediate fit method.
-
fit
(X: numpy.ndarray, y: numpy.ndarray, **kwargs) → gordo.machine.model.models.KerasLSTMForecast[source]¶ This fits a one step forecast LSTM architecture.
- Parameters
X (np.ndarray) – 2D numpy array of dimension n_samples x n_features. Input data to train.
y (np.ndarray) – 2D numpy array representing the target
kwargs (dict) – Any additional args to be passed to Keras fit_generator method.
- Returns
KerasLSTMForecast
- Return type
class
-
get_metadata
()[source]¶ Add number of forecast steps to metadata
- Returns
metadata – Metadata dictionary, including forecast steps.
- Return type
dict
-
abstract property
lookahead
¶ Steps ahead in y the model should target
-
predict
(X: numpy.ndarray, **kwargs) → numpy.ndarray[source]¶ - Parameters
X (np.ndarray) – Data to predict/transform. 2D numpy array of dimension n_samples x n_features where n_samples must be > lookback_window.
- Returns
results – 2D numpy array of dimension (n_samples - lookback_window) x 2*n_features. The first half of the array (results[:, :n_features]) corresponds to X offset by lookback_window+1 (i.e., X[lookback_window:,:]) whereas the second half corresponds to the predicted values of X[lookback_window:,:].
- Return type
np.ndarray
Example
>>> import numpy as np >>> from gordo.machine.model.factories.lstm_autoencoder import lstm_model >>> from gordo.machine.model.models import KerasLSTMForecast >>> #Define train/test data >>> X_train = np.array([[1, 1], [2, 3], [0.5, 0.6], [0.3, 1], [0.6, 0.7]]) >>> X_test = np.array([[2, 3], [1, 1], [0.1, 1], [0.5, 2]]) >>> #Initiate model, fit and transform >>> lstm_ae = KerasLSTMForecast(kind="lstm_model", ... lookback_window=2, ... verbose=0) >>> model_fit = lstm_ae.fit(X_train, y=X_train.copy()) >>> model_transform = lstm_ae.predict(X_test) >>> model_transform.shape (2, 2)
-
score
(X: Union[numpy.ndarray, pandas.core.frame.DataFrame], y: Union[numpy.ndarray, pandas.core.frame.DataFrame], sample_weight: Optional[numpy.ndarray] = None) → float[source]¶ Returns the explained variance score between 1 step forecasted input and true input at next time step (note: for LSTM X is offset by lookback_window).
- Parameters
X (Union[np.ndarray, pd.DataFrame]) – Input data to the model.
y (Union[np.ndarray, pd.DataFrame]) – Target
sample_weight (Optional[np.ndarray]) – Sample weights
- Returns
score – Returns the explained variance score.
- Return type
float
-
class
gordo.machine.model.models.
KerasLSTMForecast
(kind: Union[Callable, str], lookback_window: int = 1, batch_size: int = 32, **kwargs)[source]¶ Bases:
gordo.machine.model.models.KerasLSTMBaseEstimator
- Parameters
kind (Union[Callable, str]) – The structure of the model to build. As designated by any registered builder functions, registered with gordo.machine.model.register.register_model_builder. Alternatively, one may pass a builder function directly to this argument. Such a function should accept n_features as it’s first argument, and pass any additional parameters to **kwargs.
lookback_window (int) – Number of timestamps (lags) used to train the model.
batch_size (int) – Number of training examples used in one epoch.
epochs (int) – Number of epochs to train the model. An epoch is an iteration over the entire data provided.
verbose (int) – Verbosity mode. Possible values are 0, 1, or 2 where 0 = silent, 1 = progress bar, 2 = one line per epoch.
kwargs (dict) – Any arguments which are passed to the factory building function and/or any additional args to be passed to the intermediate fit method.
-
property
lookahead
¶ Steps ahead in y the model should target
-
class
gordo.machine.model.models.
KerasRawModelRegressor
(kind: Union[str, Callable[[int, Dict[str, Any]], tensorflow.keras.models.Model]], **kwargs)[source]¶ Bases:
gordo.machine.model.models.KerasAutoEncoder
Create a scikit-learn like model with an underlying tensorflow.keras model from a raw config. .. rubric:: Examples
>>> import yaml >>> import numpy as np >>> config_str = ''' ... # Arguments to the .compile() method ... compile: ... loss: mse ... optimizer: adam ... ... # The architecture of the model itself. ... spec: ... tensorflow.keras.models.Sequential: ... layers: ... - tensorflow.keras.layers.Dense: ... units: 4 ... - tensorflow.keras.layers.Dense: ... units: 1 ... ''' >>> config = yaml.safe_load(config_str) >>> model = KerasRawModelRegressor(kind=config) >>> >>> X, y = np.random.random((10, 4)), np.random.random((10, 1)) >>> model.fit(X, y, verbose=0) KerasRawModelRegressor(kind: {'compile': {'loss': 'mse', 'optimizer': 'adam'}, 'spec': {'tensorflow.keras.models.Sequential': {'layers': [{'tensorflow.keras.layers.Dense': {'units': 4}}, {'tensorflow.keras.layers.Dense': {'units': 1}}]}}}) >>> out = model.predict(X)
Initialized a Scikit-Learn API compatitble Keras model with a pre-registered function or a builder function directly.
- Parameters
kind (Union[callable, str]) – The structure of the model to build. As designated by any registered builder functions, registered with gordo_compontents.model.register.register_model_builder. Alternatively, one may pass a builder function directly to this argument. Such a function should accept n_features as it’s first argument, and pass any additional parameters to **kwargs
kwargs (dict) – Any additional args which are passed to the factory building function and/or any additional args to be passed to Keras’ fit() method
-
gordo.machine.model.models.
create_keras_timeseriesgenerator
(X: numpy.ndarray, y: Optional[numpy.ndarray], batch_size: int, lookback_window: int, lookahead: int) → tensorflow.keras.preprocessing.sequence.TimeseriesGenerator[source]¶ Provides a keras.preprocessing.sequence.TimeseriesGenerator for use with LSTM’s, but with the added ability to specify the lookahead of the target in y.
If lookahead==0 then the generated samples in X will have as their last element the same as the corresponding Y. If lookahead is 1 then the values in Y is shifted so it is one step in the future compared to the last value in the samples in X, and similar for larger values.
- Parameters
X (np.ndarray) – 2d array of values, each row being one sample.
y (Optional[np.ndarray]) – array representing the target.
batch_size (int) – How big should the generated batches be?
lookback_window (int) – How far back should each sample see. 1 means that it contains a single measurement
lookahead (int) – How much is Y shifted relative to X
- Returns
3d matrix with a list of batchX-batchY pairs, where batchX is a batch of X-values, and correspondingly for batchY. A batch consist of batch_size nr of pairs of samples (or y-values), and each sample is a list of length lookback_window.
- Return type
TimeseriesGenerator
Examples
>>> import numpy as np >>> X, y = np.random.rand(100,2), np.random.rand(100, 2) >>> gen = create_keras_timeseriesgenerator(X, y, ... batch_size=10, ... lookback_window=20, ... lookahead=0) >>> len(gen) # 9 = (100-20+1)/10 9 >>> len(gen[0]) # batchX and batchY 2 >>> len(gen[0][0]) # batch_size=10 10 >>> len(gen[0][0][0]) # a single sample, lookback_window = 20, 20 >>> len(gen[0][0][0][0]) # n_features = 2 2
Model factories¶
Model factories are stand alone functions which take an arbitrary number of
primitive parameters (int, float, list, dict, etc) and return a model which
can then be used in the kind
parameter of some Scikit-Learn like wrapper model.
An example of this is KerasAutoEncoder
which accepts a kind
argument
(as all custom gordo models do) and can be given feedforward_model. Meaning
that function will be used to create the underlying Keras model for
KerasAutoEncoder
-
gordo.machine.model.factories.feedforward_autoencoder.
feedforward_hourglass
(n_features: int, n_features_out: int = None, encoding_layers: int = 3, compression_factor: float = 0.5, func: str = 'tanh', optimizer: Union[str, tensorflow.keras.optimizers.Optimizer] = 'Adam', optimizer_kwargs: Dict[str, Any] = {}, compile_kwargs: Dict[str, Any] = {}, **kwargs) → tensorflow.keras.models.Sequential[source]¶ Builds an hourglass shaped neural network, with decreasing number of neurons as one gets deeper into the encoder network and increasing number of neurons as one gets out of the decoder network.
- Parameters
n_features (int) – Number of input and output neurons.
n_features_out (Optional[int]) – Number of features the model will output, default to
n_features
.encoding_layers (int) – Number of layers from the input layer (exclusive) to the narrowest layer (inclusive). Must be > 0. The total nr of layers including input and output layer will be 2*encoding_layers + 1.
compression_factor (float) – How small the smallest layer is as a ratio of n_features (smallest layer is rounded up to nearest integer). Must satisfy 0 <= compression_factor <= 1.
func (str) – Activation function for the internal layers
optimizer (Union[str, Optimizer]) – If str then the name of the optimizer must be provided (e.x. “Adam”). The arguments of the optimizer can be supplied in optimization_kwargs. If a Keras optimizer call the instance of the respective class (e.x. Adam(lr=0.01,beta_1=0.9, beta_2=0.999)). If no arguments are provided Keras default values will be set.
optimizer_kwargs (Dict[str, Any]) – The arguments for the chosen optimizer. If not provided Keras’ default values will be used.
compile_kwargs (Dict[str, Any]) – Parameters to pass to
keras.Model.compile
.
Notes
The resulting model will look like this when n_features = 10, encoding_layers= 3, and compression_factor = 0.3:
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
- Returns
- Return type
keras.models.Sequential
Examples
>>> model = feedforward_hourglass(10) >>> len(model.layers) 7 >>> [model.layers[i].units for i in range(len(model.layers))] [8, 7, 5, 5, 7, 8, 10] >>> model = feedforward_hourglass(5) >>> [model.layers[i].units for i in range(len(model.layers))] [4, 4, 3, 3, 4, 4, 5] >>> model = feedforward_hourglass(10, compression_factor=0.2) >>> [model.layers[i].units for i in range(len(model.layers))] [7, 5, 2, 2, 5, 7, 10] >>> model = feedforward_hourglass(10, encoding_layers=1) >>> [model.layers[i].units for i in range(len(model.layers))] [5, 5, 10]
-
gordo.machine.model.factories.feedforward_autoencoder.
feedforward_model
(n_features: int, n_features_out: int = None, encoding_dim: Tuple[int, ...] = (256, 128, 64), encoding_func: Tuple[str, ...] = ('tanh', 'tanh', 'tanh'), decoding_dim: Tuple[int, ...] = (64, 128, 256), decoding_func: Tuple[str, ...] = ('tanh', 'tanh', 'tanh'), out_func: str = 'linear', optimizer: Union[str, tensorflow.keras.optimizers.Optimizer] = 'Adam', optimizer_kwargs: Dict[str, Any] = {}, compile_kwargs: Dict[str, Any] = {}, **kwargs) → tensorflow.keras.models.Sequential[source]¶ Builds a customized keras neural network auto-encoder based on a config dict
- Parameters
n_features (int) – Number of features the dataset X will contain.
n_features_out (Optional[int]) – Number of features the model will output, default to
n_features
.encoding_dim (tuple) – Tuple of numbers with the number of neurons in the encoding part.
decoding_dim (tuple) – Tuple of numbers with the number of neurons in the decoding part.
encoding_func (tuple) – Activation functions for the encoder part.
decoding_func (tuple) – Activation functions for the decoder part.
out_func (str) – Activation function for the output layer
optimizer (Union[str, Optimizer]) – If str then the name of the optimizer must be provided (e.x. “Adam”). The arguments of the optimizer can be supplied in optimize_kwargs. If a Keras optimizer call the instance of the respective class (e.x. Adam(lr=0.01,beta_1=0.9, beta_2=0.999)). If no arguments are provided Keras default values will be set.
optimizer_kwargs (Dict[str, Any]) – The arguments for the chosen optimizer. If not provided Keras’ default values will be used.
compile_kwargs (Dict[str, Any]) – Parameters to pass to
keras.Model.compile
.
- Returns
- Return type
keras.models.Sequential
-
gordo.machine.model.factories.feedforward_autoencoder.
feedforward_symmetric
(n_features: int, n_features_out: int = None, dims: Tuple[int, ...] = (256, 128, 64), funcs: Tuple[str, ...] = ('tanh', 'tanh', 'tanh'), optimizer: Union[str, tensorflow.keras.optimizers.Optimizer] = 'Adam', optimizer_kwargs: Dict[str, Any] = {}, compile_kwargs: Dict[str, Any] = {}, **kwargs) → tensorflow.keras.models.Sequential[source]¶ Builds a symmetrical feedforward model
- Parameters
n_features (int) – Number of input and output neurons.
n_features_out (Optional[int]) – Number of features the model will output, default to
n_features
.dim (List[int]) – Number of neurons per layers for the encoder, reversed for the decoder. Must have len > 0.
funcs (List[str]) – Activation functions for the internal layers
optimizer (Union[str, Optimizer]) – If str then the name of the optimizer must be provided (e.x. “Adam”). The arguments of the optimizer can be supplied in optimization_kwargs. If a Keras optimizer call the instance of the respective class (e.x. Adam(lr=0.01,beta_1=0.9, beta_2=0.999)). If no arguments are provided Keras default values will be set.
optimizer_kwargs (Dict[str, Any]) – The arguments for the chosen optimizer. If not provided Keras’ default values will be used.
compile_kwargs (Dict[str, Any]) – Parameters to pass to
keras.Model.compile
.
- Returns
- Return type
keras.models.Sequential
-
gordo.machine.model.factories.lstm_autoencoder.
lstm_hourglass
(n_features: int, n_features_out: int = None, lookback_window: int = 1, encoding_layers: int = 3, compression_factor: float = 0.5, func: str = 'tanh', out_func: str = 'linear', optimizer: Union[str, tensorflow.keras.optimizers.Optimizer] = 'Adam', optimizer_kwargs: Dict[str, Any] = {}, compile_kwargs: Dict[str, Any] = {}, **kwargs) → tensorflow.keras.models.Sequential[source]¶ Builds an hourglass shaped neural network, with decreasing number of neurons as one gets deeper into the encoder network and increasing number of neurons as one gets out of the decoder network.
- Parameters
n_features (int) – Number of input and output neurons.
n_features_out (Optional[int]) – Number of features the model will output, default to
n_features
.encoding_layers (int) –
- Number of layers from the input layer (exclusive) to the
narrowest layer (inclusive). Must be > 0. The total nr of layers including input and output layer will be 2*encoding_layers + 1.
- compression_factor: float
How small the smallest layer is as a ratio of n_features (smallest layer is rounded up to nearest integer). Must satisfy 0 <= compression_factor <= 1.
func (str) – Activation function for the internal layers.
out_func (str) – Activation function for the output Dense layer.
optimizer (Union[str, Optimizer]) – If str then the name of the optimizer must be provided (e.x. “Adam”). The arguments of the optimizer can be supplied in optimization_kwargs. If a Keras optimizer call the instance of the respective class (e.x. Adam(lr=0.01,beta_1=0.9, beta_2=0.999)). If no arguments are provided Keras default values will be set.
optimizer_kwargs (Dict[str, Any]) – The arguments for the chosen optimizer. If not provided Keras’ default values will be used.
compile_kwargs (Dict[str, Any]) – Parameters to pass to
keras.Model.compile
.
- Returns
- Return type
keras.models.Sequential
Examples
>>> model = lstm_hourglass(10) >>> len(model.layers) 7 >>> [model.layers[i].units for i in range(len(model.layers))] [8, 7, 5, 5, 7, 8, 10] >>> model = lstm_hourglass(5) >>> [model.layers[i].units for i in range(len(model.layers))] [4, 4, 3, 3, 4, 4, 5] >>> model = lstm_hourglass(10, compression_factor=0.2) >>> [model.layers[i].units for i in range(len(model.layers))] [7, 5, 2, 2, 5, 7, 10] >>> model = lstm_hourglass(10, encoding_layers=1) >>> [model.layers[i].units for i in range(len(model.layers))] [5, 5, 10]
-
gordo.machine.model.factories.lstm_autoencoder.
lstm_model
(n_features: int, n_features_out: int = None, lookback_window: int = 1, encoding_dim: Tuple[int, ...] = (256, 128, 64), encoding_func: Tuple[str, ...] = ('tanh', 'tanh', 'tanh'), decoding_dim: Tuple[int, ...] = (64, 128, 256), decoding_func: Tuple[str, ...] = ('tanh', 'tanh', 'tanh'), out_func: str = 'linear', optimizer: Union[str, tensorflow.keras.optimizers.Optimizer] = 'Adam', optimizer_kwargs: Dict[str, Any] = {}, compile_kwargs: Dict[str, Any] = {}, **kwargs) → tensorflow.keras.models.Sequential[source]¶ Builds a customized Keras LSTM neural network auto-encoder based on a config dict.
- Parameters
n_features (int) – Number of features the dataset X will contain.
n_features_out (Optional[int]) – Number of features the model will output, default to
n_features
.lookback_window (int) – Number of timesteps used to train the model. One timestep = current observation in the sample. Two timesteps = current observation + previous observation in the sample. …
encoding_dim (tuple) – Tuple of numbers with the number of neurons in the encoding part.
decoding_dim (tuple) – Tuple of numbers with the number of neurons in the decoding part.
encoding_func (tuple) – Activation functions for the encoder part.
decoding_func (tuple) – Activation functions for the decoder part.
out_func (str) – Activation function for the output Dense layer.
optimizer (Union[str, Optimizer]) – If str then the name of the optimizer must be provided (e.x. “Adam”). The arguments of the optimizer can be supplied in optimize_kwargs. If a Keras optimizer call the instance of the respective class (e.x. Adam(lr=0.01,beta_1=0.9, beta_2=0.999)). If no arguments are provided Keras default values will be set.
optimizer_kwargs (Dict[str, Any]) – The arguments for the chosen optimizer. If not provided Keras’ default values will be used.
compile_kwargs (Dict[str, Any]) – Parameters to pass to
keras.Model.compile
.
- Returns
Returns Keras sequential model.
- Return type
keras.models.Sequential
-
gordo.machine.model.factories.lstm_autoencoder.
lstm_symmetric
(n_features: int, n_features_out: int = None, lookback_window: int = 1, dims: Tuple[int, ...] = (256, 128, 64), funcs: Tuple[str, ...] = ('tanh', 'tanh', 'tanh'), out_func: str = 'linear', optimizer: Union[str, tensorflow.keras.optimizers.Optimizer] = 'Adam', optimizer_kwargs: Dict[str, Any] = {}, compile_kwargs: Dict[str, Any] = {}, **kwargs) → tensorflow.keras.models.Sequential[source]¶ Builds a symmetrical lstm model
- Parameters
n_features (int) – Number of input and output neurons.
n_features_out (Optional[int]) – Number of features the model will output, default to
n_features
.lookback_window (int) – Number of timesteps used to train the model. One timestep = sample contains current observation. Two timesteps = sample contains current and previous observation. …
dims (Tuple[int,..]) – Number of neurons per layers for the encoder, reversed for the decoder. Must have len > 0
funcs (List[str]) – Activation functions for the internal layers.
out_func (str) – Activation function for the output Dense layer.
optimizer (Union[str, Optimizer]) – If str then the name of the optimizer must be provided (e.x. “Adam”). The arguments of the optimizer can be supplied in optimization_kwargs. If a Keras optimizer call the instance of the respective class (e.x. Adam(lr=0.01,beta_1=0.9, beta_2=0.999)). If no arguments are provided Keras default values will be set.
optimizer_kwargs (Dict[str, Any]) – The arguments for the chosen optimizer. If not provided Keras’ default values will be used.
compile_kwargs (Dict[str, Any]) – Parameters to pass to
keras.Model.compile
.
- Returns
Returns Keras sequential model.
- Return type
keras.models.Sequential
Transformer Functions¶
A collection of functions which can be referenced within the
sklearn.preprocessing.FunctionTransformer
transformer.
Functions to be used within sklearn’s FunctionTransformer https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html
Each function SHALL take an X, and optionally a y.
Functions CAN take additional arguments which should be given during the initialization of the FunctionTransformer
Example:
>>> from sklearn.preprocessing import FunctionTransformer
>>> import numpy as np
>>> def my_function(X, another_arg):
... # Some fancy X manipulation...
... return X
>>> transformer = FunctionTransformer(func=my_function, kw_args={'another_arg': 'this thing'})
>>> out = transformer.fit_transform(np.random.random(100).reshape(10, 10))
Transformers¶
Specialized transformers to address Gordo specific problems.
This function just like Scikit-Learn’s transformers and thus can be
inserted into Pipeline
objects.
-
class
gordo.machine.model.transformers.imputer.
InfImputer
(inf_fill_value=None, neg_inf_fill_value=None, strategy='minmax', delta: float = 2.0)[source]¶ Bases:
sklearn.base.TransformerMixin
Fill inf/-inf values of a 2d array/dataframe with imputed or provided values By default it will find the min and max of each feature/column and fill -infs/infs with those values +/-
delta
- Parameters
inf_fill_value (numeric) – Value to fill ‘inf’ values
neg_inf_fill_value (numeric) – Value to fill ‘-inf’ values
strategy (str) – How to fill values, irrelevant if fill value is provided. choices: ‘extremes’, ‘minmax’ -‘extremes’ will use the min and max values for the current datatype. such that ‘inf’ in a float32 dataset will have float32’s largest value inserted. - ‘minmax’ will look at the min and max values in the feature where the -inf / inf appears and fill with the max/min found in that feature.
delta (float) – Only applicable if
strategy='minmax'
Will add/subtract the max/min value, by feature, by this delta. If the max value in a feature was 10 anddelta=2
any inf value will be filled with 12. Likewise, if the min feature was -10 any -inf will be filled with -12.
Anomaly Models¶
Models which implment a .anomaly(X, y)
and can be served under the
model server /anomaly/prediction
endpoint.
The base class for all other anomaly detector models
-
class
gordo.machine.model.anomaly.base.
AnomalyDetectorBase
(**kwargs)[source]¶ Bases:
sklearn.base.BaseEstimator
,gordo.machine.model.base.GordoBase
Initialize the model
-
abstract
anomaly
(X: Union[pandas.core.frame.DataFrame, xarray.core.dataarray.DataArray], y: Union[pandas.core.frame.DataFrame, xarray.core.dataarray.DataArray], frequency: Optional[datetime.timedelta] = None) → Union[pandas.core.frame.DataFrame, xarray.core.dataset.Dataset][source]¶ Take X, y and optionally frequency; returning a dataframe containing anomaly score(s)
-
abstract
Calculates the absolute value prediction differences between y and yhat as well
as the absolute difference error between both matrices via numpy.linalg.norm(..., axis=1)
-
class
gordo.machine.model.anomaly.diff.
DiffBasedAnomalyDetector
(base_estimator: sklearn.base.BaseEstimator = tensorflow.keras.wrappers.scikit_learn.KerasRegressor, scaler: sklearn.base.TransformerMixin = MinMaxScaler(), require_thresholds: bool = True, shuffle: bool = False, window: Optional[int] = None, smoothing_method: Optional[str] = None)[source]¶ Bases:
gordo.machine.model.anomaly.base.AnomalyDetectorBase
Estimator which wraps a
base_estimator
and provides a diff error based approach to anomaly detection.It trains a
scaler
to the target after training, purely for error calculations. The underlyingbase_estimator
is trained with the original, unscaled,y
.Threshold calculation is based on a rolling statistic of the validation errors on the last fold of cross-validation.
- Parameters
base_estimator (sklearn.base.BaseEstimator) – The model to which normal
.fit
,.predict
methods will be used. defaults to py:class:gordo.machine.model.models.KerasAutoEncoder withkind='feedforward_hourglass
scaler (sklearn.base.TransformerMixin) – Defaults to
sklearn.preprocessing.RobustScaler
Used for transforming model output and the originaly
to calculate the difference/error in model output vs expected.require_thresholds (bool) – Requires calculating
thresholds_
via a call tocross_validate()
. If this is set (default True), butcross_validate()
was not called before callinganomaly()
anAttributeError
will be raised.shuffle (bool) – Flag to shuffle or not data in
.fit
so that the model, if relevant, will be trained on a sample of data accross the time range and not just the last elements according to model argvalidation_split
.window (int) – Window size for smoothed thresholds
smoothing_method (str) – Method to be used together with
window
to smooth metrics. Must be one of: ‘smm’: simple moving median, ‘sma’: simple moving average or ‘ewma’: exponential weighted moving average.
-
anomaly
(X: Union[pandas.core.frame.DataFrame, xarray.core.dataarray.DataArray], y: Union[pandas.core.frame.DataFrame, xarray.core.dataarray.DataArray], frequency: Optional[datetime.timedelta] = None) → Union[pandas.core.frame.DataFrame, xarray.core.dataset.Dataset][source]¶ Create an anomaly dataframe from the base provided dataframe.
- Parameters
X (pd.DataFrame) – Dataframe representing the data to go into the model.
y (pd.DataFrame) – Dataframe representing the target output of the model.
- Returns
A superset of the original base dataframe with added anomaly specific features
- Return type
pd.DataFrame
-
cross_validate
(*, X: Union[pandas.core.frame.DataFrame, numpy.ndarray], y: Union[pandas.core.frame.DataFrame, numpy.ndarray], cv=TimeSeriesSplit(max_train_size=None, n_splits=3), **kwargs)[source]¶ Run TimeSeries cross validation on the model, and will update the model’s threshold values based on the cross validation folds.
- Parameters
X (Union[pd.DataFrame, np.ndarray]) – Input data to the model
y (Union[pd.DataFrame, np.ndarray]) – Target data
kwargs (dict) – Any additional kwargs to be passed to
sklearn.model_selection.cross_validate()
- Returns
- Return type
dict
-
class
gordo.machine.model.anomaly.diff.
DiffBasedKFCVAnomalyDetector
(base_estimator: sklearn.base.BaseEstimator = tensorflow.keras.wrappers.scikit_learn.KerasRegressor, scaler: sklearn.base.TransformerMixin = MinMaxScaler(), require_thresholds: bool = True, shuffle: bool = True, window: int = 144, smoothing_method: str = 'smm', threshold_percentile: float = 0.99)[source]¶ Bases:
gordo.machine.model.anomaly.diff.DiffBasedAnomalyDetector
Estimator which wraps a
base_estimator
and provides a diff error based approach to anomaly detection.It trains a
scaler
to the target after training, purely for error calculations. The underlyingbase_estimator
is trained with the original, unscaled,y
.Threshold calculation is based on a percentile of the smoothed validation errors as calculated from cross-validation predictions.
- Parameters
base_estimator (sklearn.base.BaseEstimator) – The model to which normal
.fit
,.predict
methods will be used. defaults to py:class:gordo.machine.model.models.KerasAutoEncoder withkind='feedforward_hourglass
scaler (sklearn.base.TransformerMixin) – Defaults to
sklearn.preprocessing.RobustScaler
Used for transforming model output and the originaly
to calculate the difference/error in model output vs expected.require_thresholds (bool) – Requires calculating
thresholds_
via a call tocross_validate()
. If this is set (default True), butcross_validate()
was not called before callinganomaly()
anAttributeError
will be raised.shuffle (bool) – Flag to shuffle or not data in
.fit
so that the model, if relevant, will be trained on a sample of data accross the time range and not just the last elements according to model argvalidation_split
.window (int) – Window size for smooth metrics and threshold calculation.
smoothing_method (str) – Method to be used together with
window
to smooth metrics. Must be one of: ‘smm’: simple moving median, ‘sma’: simple moving average or ‘ewma’: exponential weighted moving average.threshold_percentile (float) – Percentile of the validation data to be used to calculate the threshold.
-
cross_validate
(*, X: Union[pandas.core.frame.DataFrame, numpy.ndarray], y: Union[pandas.core.frame.DataFrame, numpy.ndarray], cv=KFold(n_splits=5, random_state=0, shuffle=True), **kwargs)[source]¶ Run Kfold cross validation on the model, and will update the model’s threshold values based on a percentile of the validation metrics.
- Parameters
X (Union[pd.DataFrame, np.ndarray]) – Input data to the model
y (Union[pd.DataFrame, np.ndarray]) – Target data
kwargs (dict) – Any additional kwargs to be passed to
sklearn.model_selection.cross_validate()
- Returns
- Return type
dict
Utils¶
Shared utility functions used by models and other components interacting with the model’s.
-
gordo.machine.model.utils.
make_base_dataframe
(tags: Union[List[gordo_dataset.sensor_tag.SensorTag], List[str]], model_input: numpy.ndarray, model_output: numpy.ndarray, target_tag_list: Union[List[gordo_dataset.sensor_tag.SensorTag], List[str], None] = None, index: Optional[numpy.ndarray] = None, frequency: Optional[datetime.timedelta] = None) → pandas.core.frame.DataFrame[source]¶ Construct a dataframe which has a MultiIndex column consisting of top level keys ‘model-input’ and ‘model-output’. Takes care of aligning model output if different than model input lengths, as setting column names based on passed tags and target_tag_list.
- Parameters
tags (List[Union[str, SensorTag]]) – Tags which will be assigned to
model-input
and/ormodel-output
if the shapes match.model_input (np.ndarray) – Original input given to the model
model_output (np.ndarray) – Raw model output
target_tag_list (Optional[Union[List[SensorTag], List[str]]]) – Tags to be assigned to
model-output
if not assinged but model output matches model input,tags
will be used.index (Optional[np.ndarray]) – The index which should be assinged to the resulting dataframe, will be clipped to the length of
model_output
, should the model output less than its input.frequency (Optional[datetime.timedelta]) – The spacing of the time between points.
- Returns
- Return type
pd.DataFrame
-
gordo.machine.model.utils.
metric_wrapper
(metric, scaler: Optional[sklearn.base.TransformerMixin] = None)[source]¶ Ensures that a given metric works properly when the model itself returns a y which is shorter than the target y, and allows scaling the data before applying the metrics.
- Parameters
metric – Metric which must accept y_true and y_pred of the same length
scaler (Optional[TransformerMixin]) – Transformer which will be applied on y and y_pred before the metrics is calculated. Must have method transform, so for most scalers it must already be fitted on y.
Metadata¶
Each Machine is entitled to have Metadata, which can be set at the Machine.metadata
level inside the config, but will result in a standardized output of metadata
under user_defined
and build_metadata
. Where user_defined
can go
arbitrarily deep, depending on the amount of metadata the user wishes to enter.
build_metadata
is more predictable. During the course of building a Machine
the system will insert certain metadata given about the build time, and model
metrics (depending on configuration).
-
class
gordo.machine.metadata.metadata.
Metadata
(user_defined: Dict[str, Any] = <factory>, build_metadata: gordo.machine.metadata.metadata.BuildMetadata = <factory>)[source]¶ Bases:
object
-
build_metadata
: BuildMetadata = None¶
-
classmethod
from_dict
(kvs: Union[dict, list, str, int, float, bool, None], *, infer_missing=False) → A¶
-
classmethod
from_json
(s: Union[str, bytes, bytearray], *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) → A¶
-
classmethod
schema
(*, infer_missing: bool = False, only=None, exclude=(), many: bool = False, context=None, load_only=(), dump_only=(), partial: bool = False, unknown=None) → dataclasses_json.mm.SchemaF[~A][A]¶
-
to_dict
(encode_json=False) → Dict[str, Union[dict, list, str, int, float, bool, None]]¶
-
to_json
(*, skipkeys: bool = False, ensure_ascii: bool = True, check_circular: bool = True, allow_nan: bool = True, indent: Union[int, str, None] = None, separators: Tuple[str, str] = None, default: Callable = None, sort_keys: bool = False, **kw) → str¶
-
user_defined
: Dict[str, Any] = None¶
-
-
class
gordo.machine.metadata.metadata.
BuildMetadata
(model: gordo.machine.metadata.metadata.ModelBuildMetadata = <factory>, dataset: gordo.machine.metadata.metadata.DatasetBuildMetadata = <factory>)[source]¶ Bases:
object
-
dataset
: DatasetBuildMetadata = None¶
-
classmethod
from_dict
(kvs: Union[dict, list, str, int, float, bool, None], *, infer_missing=False) → A¶
-
classmethod
from_json
(s: Union[str, bytes, bytearray], *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) → A¶
-
model
: ModelBuildMetadata = None¶
-
classmethod
schema
(*, infer_missing: bool = False, only=None, exclude=(), many: bool = False, context=None, load_only=(), dump_only=(), partial: bool = False, unknown=None) → dataclasses_json.mm.SchemaF[~A][A]¶
-
to_dict
(encode_json=False) → Dict[str, Union[dict, list, str, int, float, bool, None]]¶
-
to_json
(*, skipkeys: bool = False, ensure_ascii: bool = True, check_circular: bool = True, allow_nan: bool = True, indent: Union[int, str, None] = None, separators: Tuple[str, str] = None, default: Callable = None, sort_keys: bool = False, **kw) → str¶
-
-
class
gordo.machine.metadata.metadata.
ModelBuildMetadata
(model_offset: int = 0, model_creation_date: Union[str, NoneType] = None, model_builder_version: str = '1.10.5', cross_validation: gordo.machine.metadata.metadata.CrossValidationMetaData = <factory>, model_training_duration_sec: Union[float, NoneType] = None, model_meta: Dict[str, Any] = <factory>)[source]¶ Bases:
object
-
cross_validation
: CrossValidationMetaData = None¶
-
classmethod
from_dict
(kvs: Union[dict, list, str, int, float, bool, None], *, infer_missing=False) → A¶
-
classmethod
from_json
(s: Union[str, bytes, bytearray], *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) → A¶
-
model_builder_version
: str = '1.10.5'¶
-
model_creation_date
: Optional[str] = None¶
-
model_meta
: Dict[str, Any] = None¶
-
model_offset
: int = 0¶
-
model_training_duration_sec
: Optional[float] = None¶
-
classmethod
schema
(*, infer_missing: bool = False, only=None, exclude=(), many: bool = False, context=None, load_only=(), dump_only=(), partial: bool = False, unknown=None) → dataclasses_json.mm.SchemaF[~A][A]¶
-
to_dict
(encode_json=False) → Dict[str, Union[dict, list, str, int, float, bool, None]]¶
-
to_json
(*, skipkeys: bool = False, ensure_ascii: bool = True, check_circular: bool = True, allow_nan: bool = True, indent: Union[int, str, None] = None, separators: Tuple[str, str] = None, default: Callable = None, sort_keys: bool = False, **kw) → str¶
-
-
class
gordo.machine.metadata.metadata.
CrossValidationMetaData
(scores: Dict[str, Any] = <factory>, cv_duration_sec: Union[float, NoneType] = None, splits: Dict[str, Any] = <factory>)[source]¶ Bases:
object
-
cv_duration_sec
: Optional[float] = None¶
-
classmethod
from_dict
(kvs: Union[dict, list, str, int, float, bool, None], *, infer_missing=False) → A¶
-
classmethod
from_json
(s: Union[str, bytes, bytearray], *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) → A¶
-
classmethod
schema
(*, infer_missing: bool = False, only=None, exclude=(), many: bool = False, context=None, load_only=(), dump_only=(), partial: bool = False, unknown=None) → dataclasses_json.mm.SchemaF[~A][A]¶
-
scores
: Dict[str, Any] = None¶
-
splits
: Dict[str, Any] = None¶
-
to_dict
(encode_json=False) → Dict[str, Union[dict, list, str, int, float, bool, None]]¶
-
to_json
(*, skipkeys: bool = False, ensure_ascii: bool = True, check_circular: bool = True, allow_nan: bool = True, indent: Union[int, str, None] = None, separators: Tuple[str, str] = None, default: Callable = None, sort_keys: bool = False, **kw) → str¶
-
-
class
gordo.machine.metadata.metadata.
DatasetBuildMetadata
(query_duration_sec: Union[float, NoneType] = None, dataset_meta: Dict[str, Any] = <factory>)[source]¶ Bases:
object
-
dataset_meta
: Dict[str, Any] = None¶
-
classmethod
from_dict
(kvs: Union[dict, list, str, int, float, bool, None], *, infer_missing=False) → A¶
-
classmethod
from_json
(s: Union[str, bytes, bytearray], *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) → A¶
-
query_duration_sec
: Optional[float] = None¶
-
classmethod
schema
(*, infer_missing: bool = False, only=None, exclude=(), many: bool = False, context=None, load_only=(), dump_only=(), partial: bool = False, unknown=None) → dataclasses_json.mm.SchemaF[~A][A]¶
-
to_dict
(encode_json=False) → Dict[str, Union[dict, list, str, int, float, bool, None]]¶
-
to_json
(*, skipkeys: bool = False, ensure_ascii: bool = True, check_circular: bool = True, allow_nan: bool = True, indent: Union[int, str, None] = None, separators: Tuple[str, str] = None, default: Callable = None, sort_keys: bool = False, **kw) → str¶
-
Builder¶
Model builder¶
-
class
gordo.builder.build_model.
ModelBuilder
(machine: gordo.machine.machine.Machine)[source]¶ Bases:
object
Build a model for a given
gordo.workflow.config_elements.machine.Machine
- Parameters
machine (Machine) –
Example
>>> from gordo_dataset.sensor_tag import SensorTag >>> from gordo.machine import Machine >>> from gordo.dependencies import configure_once >>> configure_once() >>> machine = Machine( ... name="special-model-name", ... model={"sklearn.decomposition.PCA": {"svd_solver": "auto"}}, ... dataset={ ... "type": "RandomDataset", ... "train_start_date": "2017-12-25 06:00:00Z", ... "train_end_date": "2017-12-30 06:00:00Z", ... "tag_list": [SensorTag("Tag 1", None), SensorTag("Tag 2", None)], ... "target_tag_list": [SensorTag("Tag 3", None), SensorTag("Tag 4", None)] ... }, ... project_name='test-proj', ... ) >>> builder = ModelBuilder(machine=machine) >>> model, machine = builder.build()
-
build
(output_dir: Union[os.PathLike, str, None] = None, model_register_dir: Union[os.PathLike, str, None] = None, replace_cache=False) → Tuple[sklearn.base.BaseEstimator, gordo.machine.machine.Machine][source]¶ Always return a model and its metadata.
If
output_dir
is supplied, it will save the model there.model_register_dir
points to the model cache directory which it will attempt to read the model from. Supplying both will then have the effect of both; reading from the cache and saving that cached model to the new output directory.- Parameters
output_dir (Optional[Union[os.PathLike, str]]) – A path to where the model will be deposited.
model_register_dir (Optional[Union[os.PathLike, str]]) – A path to a register, see :func:gordo.util.disk_registry. If this is None then always build the model, otherwise try to resolve the model from the registry.
replace_cache (bool) – Forces a rebuild of the model, and replaces the entry in the cache with the new model.
- Returns
Built model and an updated
Machine
- Return type
Tuple[sklearn.base.BaseEstimator, Machine]
-
static
build_metrics_dict
(metrics_list: list, y: pandas.core.frame.DataFrame, scaler: Union[sklearn.base.TransformerMixin, str, None] = None) → dict[source]¶ Given a list of metrics that accept a true_y and pred_y as inputs this returns a dictionary with keys in the form ‘{score}-{tag_name}’ for each given target tag and ‘{score}’ for the average score across all target tags and folds, and values being the callable make_scorer(metric_wrapper(score)). Note: score in {score}-{tag_name} is a sklearn’s score function name with ‘_’ replaced by ‘-‘ and tag_name corresponds to given target tag name with ‘ ‘ replaced by ‘-‘.
- Parameters
metrics_list (list) – List of sklearn score functions
y (pd.DataFrame) – Target data
scaler (Optional[Union[TransformerMixin, str]]) – Scaler which will be fitted on y, and used to transform the data before scoring. Useful when the metrics are sensitive to the amplitude of the data, and you have multiple targets.
- Returns
- Return type
dict
-
static
build_split_dict
(X: pandas.core.frame.DataFrame, split_obj: Type[sklearn.model_selection._split.BaseCrossValidator]) → dict[source]¶ Get dictionary of cross-validation training dataset split metadata
- Parameters
X (pd.DataFrame) – The training dataset that will be split during cross-validation.
split_obj (Type[sklearn.model_selection.BaseCrossValidator]) – The cross-validation object that returns train, test indices for splitting.
- Returns
split_metadata – Dictionary of cross-validation train/test split metadata
- Return type
Dict[str,Any]
-
property
cache_key
¶
-
property
cached_model_path
¶
-
static
calculate_cache_key
(machine: gordo.machine.machine.Machine) → str[source]¶ Calculates a hash-key from the model and data-config.
- Returns
A 512 byte hex value as a string based on the content of the parameters.
- Return type
str
Examples
>>> from gordo.machine import Machine >>> from gordo_dataset.sensor_tag import SensorTag >>> from gordo.dependencies import configure_once >>> configure_once() >>> machine = Machine( ... name="special-model-name", ... model={"sklearn.decomposition.PCA": {"svd_solver": "auto"}}, ... dataset={ ... "type": "RandomDataset", ... "train_start_date": "2017-12-25 06:00:00Z", ... "train_end_date": "2017-12-30 06:00:00Z", ... "tag_list": [SensorTag("Tag 1", None), SensorTag("Tag 2", None)], ... "target_tag_list": [SensorTag("Tag 3", None), SensorTag("Tag 4", None)] ... }, ... project_name='test-proj' ... ) >>> builder = ModelBuilder(machine) >>> len(builder.cache_key) 128
-
check_cache
(model_register_dir: Union[os.PathLike, str])[source]¶ Checks if the model is cached, and returns its path if it exists.
- Parameters
model_register_dir ([os.PathLike, None]) – The register dir where the model lies.
cache_key (str) –
A 512 byte hex value as a string based on the content of the parameters.
Returns
------- –
None] (Union[os.PathLike,) – The path to the cached model, or None if it does not exist.
-
static
metrics_from_list
(metric_list: Optional[List[str]] = None) → List[Callable][source]¶ Given a list of metric function paths. ie. sklearn.metrics.r2_score or simple function names which are expected to be in the
sklearn.metrics
module, this will return a list of those loaded functions.- Parameters
metrics (Optional[List[str]]) – List of function paths to use as metrics for the model Defaults to those specified in
gordo.workflow.config_components.NormalizedConfig
sklearn.metrics.explained_variance_score, sklearn.metrics.r2_score, sklearn.metrics.mean_squared_error, sklearn.metrics.mean_absolute_error- Returns
A list of the functions loaded
- Return type
List[Callable]
- Raises
AttributeError: – If the function cannot be loaded.
Local Model builder¶
This is meant to provide a good way to validate a configuration file as well as to enable creating and testing models locally with little overhead.
-
gordo.builder.local_build.
local_build
(config_str: str) → Iterable[Tuple[Optional[sklearn.base.BaseEstimator], gordo.machine.machine.Machine]][source]¶ Build model(s) from a bare Gordo config file locally.
This is very similar to the same steps as the normal workflow generation and subsequent Gordo deployment process makes. Should help developing locally, as well as giving a good indication that your config is valid for deployment with Gordo.
- Parameters
config_str (str) – The raw yaml config file in string format.
Examples
>>> import numpy as np >>> from gordo.dependencies import configure_once >>> configure_once() >>> config = ''' ... machines: ... - dataset: ... tags: ... - SOME-TAG1 ... - SOME-TAG2 ... target_tag_list: ... - SOME-TAG3 ... - SOME-TAG4 ... train_end_date: '2019-03-01T00:00:00+00:00' ... train_start_date: '2019-01-01T00:00:00+00:00' ... asset: asgb ... data_provider: ... type: RandomDataProvider ... metadata: ... information: Some sweet information about the model ... model: ... gordo.machine.model.anomaly.diff.DiffBasedAnomalyDetector: ... base_estimator: ... sklearn.pipeline.Pipeline: ... steps: ... - sklearn.decomposition.PCA ... - sklearn.multioutput.MultiOutputRegressor: ... estimator: sklearn.linear_model.LinearRegression ... name: crazy-sweet-name ... ''' >>> models_n_metadata = local_build(config) >>> assert len(list(models_n_metadata)) == 1
- Returns
A generator yielding tuples of models and their metadata.
- Return type
Iterable[Tuple[Union[BaseEstimator, None], Machine]]
Serializer¶
The serializer is the core component used in the conversion of a Gordo config file into Python objects which interact in order to construct a full ML model capable of being served on Kubernetes.
Things like the dataset
and model
keys within the YAML config represents
objects which will be (de)serialized by the serializer to complete this goal.
-
gordo.serializer.serializer.
dump
(obj: object, dest_dir: Union[os.PathLike, str], metadata: dict = None)[source]¶ Serialize an object into a directory, the object must be pickle-able.
- Parameters
obj – The object to dump. Must be pickle-able.
dest_dir (Union[os.PathLike, str]) – The directory to which to save the model metadata: dict - any additional metadata to be saved alongside this model if it exists, will be returned from the corresponding “load” function
metadata (Optional dict of metadata which will be serialized to a file together) – with the model, and loaded again by
load_metadata()
.
- Returns
- Return type
None
Example
>>> from sklearn.pipeline import Pipeline >>> from sklearn.decomposition import PCA >>> from gordo.machine.model.models import KerasAutoEncoder >>> from gordo import serializer >>> from tempfile import TemporaryDirectory >>> pipe = Pipeline([ ... ('pca', PCA(3)), ... ('model', KerasAutoEncoder(kind='feedforward_hourglass'))]) >>> with TemporaryDirectory() as tmp: ... serializer.dump(obj=pipe, dest_dir=tmp) ... pipe_clone = serializer.load(source_dir=tmp)
-
gordo.serializer.serializer.
dumps
(model: Union[sklearn.pipeline.Pipeline, gordo.machine.model.base.GordoBase]) → bytes[source]¶ Dump a model into a bytes representation suitable for loading from
gordo.serializer.loads
- Parameters
model (Union[Pipeline, GordoBase]) – A gordo model/pipeline
- Returns
Serialized model which supports loading via
serializer.loads()
- Return type
bytes
Example
>>> from gordo.machine.model.models import KerasAutoEncoder >>> from gordo import serializer >>> >>> model = KerasAutoEncoder('feedforward_symmetric') >>> serialized = serializer.dumps(model) >>> assert isinstance(serialized, bytes) >>> >>> model_clone = serializer.loads(serialized) >>> assert isinstance(model_clone, KerasAutoEncoder)
-
gordo.serializer.serializer.
load
(source_dir: Union[os.PathLike, str]) → Any[source]¶ Load an object from a directory, saved by
gordo.serializer.pipeline_serializer.dump
This take a directory, which is either top-level, meaning it contains a sub directory in the naming scheme: “n_step=<int>-class=<path.to.Class>” or the aforementioned naming scheme directory directly. Will return that unsterilized object.
- Parameters
source_dir (Union[os.PathLike, str]) – Location of the top level dir the pipeline was saved
- Returns
- Return type
Union[GordoBase, Pipeline, BaseEstimator]
-
gordo.serializer.serializer.
load_metadata
(source_dir: Union[os.PathLike, str]) → dict[source]¶ Load the given metadata.json which was saved during the
serializer.dump
will return the loaded metadata as a dict, or empty dict if no file was found- Parameters
source_dir (Union[os.PathLike, str]) – Directory of the saved model, As with serializer.load(source_dir) this source_dir can be the top level, or the first dir into the serialized model.
- Returns
- Return type
dict
- Raises
FileNotFoundError – If a ‘metadata.json’ file isn’t found in or above the supplied
source_dir
-
gordo.serializer.serializer.
loads
(bytes_object: bytes) → gordo.machine.model.base.GordoBase[source]¶ Load a GordoBase model from bytes dumped from
gordo.serializer.dumps
- Parameters
bytes_object (bytes) – Bytes to be loaded, should be the result of serializer.dumps(model)
- Returns
Custom gordo model, scikit learn pipeline or other scikit learn like object.
- Return type
Union[GordoBase, Pipeline, BaseEstimator]
From Definition¶
The ability to take a ‘raw’ representation of an object in dict
form
and load it into a Python object.
-
gordo.serializer.from_definition.
from_definition
(pipe_definition: Union[str, Dict[str, Dict[str, Any]]]) → Union[sklearn.pipeline.FeatureUnion, sklearn.pipeline.Pipeline][source]¶ Construct a Pipeline or FeatureUnion from a definition.
Example
>>> import yaml >>> from gordo import serializer >>> raw_config = ''' ... sklearn.pipeline.Pipeline: ... steps: ... - sklearn.decomposition.PCA: ... n_components: 3 ... - sklearn.pipeline.FeatureUnion: ... - sklearn.decomposition.PCA: ... n_components: 3 ... - sklearn.pipeline.Pipeline: ... - sklearn.preprocessing.MinMaxScaler ... - sklearn.decomposition.TruncatedSVD: ... n_components: 2 ... - sklearn.ensemble.RandomForestClassifier: ... max_depth: 3 ... ''' >>> config = yaml.safe_load(raw_config) >>> scikit_learn_pipeline = serializer.from_definition(config)
- Parameters
pipe_definition – List of steps for the Pipeline / FeatureUnion
constructor_class – What to place the list of transformers into, either sklearn.pipeline.Pipeline/FeatureUnion
- Returns
pipeline
- Return type
sklearn.pipeline.Pipeline
Into Definitiion¶
The ability to take a Python object, such as a scikit-learn
pipeline and convert it into a primitive dict
, which can then be inserted
into a YAML config file.
-
gordo.serializer.into_definition.
into_definition
(pipeline: sklearn.pipeline.Pipeline, prune_default_params: bool = False) → dict[source]¶ Convert an instance of
sklearn.pipeline.Pipeline
into a dict definition capable of being reconstructed withgordo.serializer.from_definition
- Parameters
pipeline (sklearn.pipeline.Pipeline) – Instance of pipeline to decompose
prune_default_params (bool) – Whether to prune the default parameters found in current instance of the transformers vs what their default params are.
- Returns
definitions for the pipeline, compatible to be reconstructed with
gordo.serializer.from_definition()
- Return type
dict
Example
>>> import yaml >>> from sklearn.pipeline import Pipeline >>> from sklearn.decomposition import PCA >>> from gordo.machine.model.models import KerasAutoEncoder >>> >>> pipe = Pipeline([('pca', PCA(4)), ('ae', KerasAutoEncoder(kind='feedforward_model'))]) >>> pipe_definition = into_definition(pipe) # It is now a standard python dict of primitives. >>> print(yaml.dump(pipe_definition)) sklearn.pipeline.Pipeline: memory: null steps: - sklearn.decomposition._pca.PCA: copy: true iterated_power: auto n_components: 4 random_state: null svd_solver: auto tol: 0.0 whiten: false - gordo.machine.model.models.KerasAutoEncoder: kind: feedforward_model verbose: false
ML Server¶
The ML Server is responsible for giving different “views” into the model being served.
Server¶
This module contains code for generating the Gordo server Flask application.
Running this module will run the application using Flask’s development webserver.
Gunicorn can be used to run the application as gevent async workers by using the
run_server()
function.
-
gordo.server.server.
adapt_proxy_deployment
(wsgi_app: Callable) → Callable[source]¶ Decorator specific to fixing behind-proxy-issues when on Kubernetes and using Envoy proxy.
- Parameters
wsgi_app (typing.Callable) – The underlying WSGI application of a flask app, for example
Notes
Special note about deploying behind Ambassador, or prefixed proxy paths in general:
When deployed on kubernetes/ambassador there is a prefix in-front of the server. ie:
/gordo/v0/some-project-name/some-target
The server itself only knows about routes to the right of such a prefix: such as
/metadata
or/predictions
when in reality, the full path is:/gordo/v0/some-project-name/some-target/metadata
This is solved by getting the current application’s assigned prefix, where
HTTP_X_ENVOY_ORIGINAL_PATH
is the full path, including the prefix. andPATH_INFO
is the actual relative path the server knows about.This function wraps the WSGI app itself to map the current full path to the assigned route function.
ie.
/metadata
-> metadata route function, by default, but updates/gordo/v0/some-project-name/some-target/metadata
-> metadata route function- Returns
- Return type
Callable
Example
>>> app = Flask(__name__) >>> app.wsgi_app = adapt_proxy_deployment(app.wsgi_app)
-
gordo.server.server.
build_app
(config: Optional[Dict[str, Any]] = None, prometheus_registry: Optional[prometheus_client.registry.CollectorRegistry] = None)[source]¶ Build app and any associated routes
-
gordo.server.server.
create_prometheus_metrics
(project: Optional[str] = None, registry: Optional[prometheus_client.registry.CollectorRegistry] = None) → gordo.server.prometheus.metrics.GordoServerPrometheusMetrics[source]¶
-
gordo.server.server.
run_cmd
(cmd)[source]¶ Run a shell command and handle CalledProcessError and OSError types
Note
This function is abstracted from
run_server()
in order to test the calling of commands that would allow the subprocess call to break, depending on how it is parameterized. For example, calling this without sending stderr to stdout will cause a segmentation fault when calling an executable that does not exist.
-
gordo.server.server.
run_server
(host: str, port: int, workers: int, log_level: str, config_module: Optional[str] = None, worker_connections: Optional[int] = None, threads: Optional[int] = None, worker_class: str = 'gthread', server_app: str = 'gordo.server.server:build_app()')[source]¶ Run application with Gunicorn server using Gevent Async workers
- Parameters
host (str) – The host to run the server on.
port (int) – The port to run the server on.
workers (int) – The number of worker processes for handling requests.
log_level (str) – The log level for the gunicorn webserver. Valid log level names can be found in the [gunicorn documentation](http://docs.gunicorn.org/en/stable/settings.html#loglevel).
config_module (str) – The config module. Will be passed with python: [prefix](https://docs.gunicorn.org/en/stable/settings.html#config).
worker_connections (int) – The maximum number of simultaneous clients per worker process.
threads (str) – The number of worker threads for handling requests.
worker_class (str) – The type of workers to use.
server_app (str) – The application to run
Views¶
A collection of implemented views into the Model being served.
Base¶
Provides the most basic view into the model. This view will
simply apply the model to the provided data and return the
model-output
along with the model-output
-
class
gordo.server.views.base.
BaseModelView
(api=None, *args, **kwargs)[source]¶ Bases:
flask_restplus.resource.Resource
The base model view.
-
X
: pandas.core.frame.DataFrame = None¶
-
endpoint
= 'base_model_view'¶
-
property
frequency
¶ The frequency the model was trained with in the dataset
-
mediatypes
()¶
-
methods
= ['POST']¶
-
post
()[source]¶ Process a POST request by using provided user data
A typical response might look like this
{ 'data': [ { 'end': ['2016-01-01T00:10:00+00:00'], 'model-output': [0.0005317790200933814, -0.0001525811239844188, 0.0008310950361192226, 0.0015755111817270517], 'original-input': [0.9135588550070414, 0.3472517774179448, 0.8994921857179736, 0.11982773108991263], 'start': ['2016-01-01T00:00:00+00:00'], }, ... ], 'tags': [ {'asset': None, 'name': 'tag-0'}, {'asset': None, 'name': 'tag-1'}, {'asset': None, 'name': 'tag-2'}, {'asset': None, 'name': 'tag-3'} ], 'time-seconds': '0.1937' }
The input tags for this model
- Returns
- Return type
typing.List[SensorTag]
The target tags for this model
- Returns
- Return type
typing.List[SensorTag]
-
y
: pandas.core.frame.DataFrame = None¶
-
-
class
gordo.server.views.base.
DownloadModel
(api=None, *args, **kwargs)[source]¶ Bases:
flask_restplus.resource.Resource
Download the trained model
suitable for reloading via
gordo.serializer.serializer.loads()
-
endpoint
= 'download_model'¶
-
get
()[source]¶ Responds with a serialized copy of the current model being served.
- Returns
Results from
gordo.serializer.dumps()
- Return type
bytes
-
mediatypes
()¶
-
methods
= {'GET'}¶
-
-
class
gordo.server.views.base.
ExpectedModels
(api=None, *args, **kwargs)[source]¶ Bases:
flask_restplus.resource.Resource
-
endpoint
= 'expected_models'¶
-
mediatypes
()¶
-
methods
= {'GET'}¶
-
-
class
gordo.server.views.base.
MetaDataView
(api=None, *args, **kwargs)[source]¶ Bases:
flask_restplus.resource.Resource
Serve model / server metadata
-
endpoint
= 'meta_data_view'¶
-
mediatypes
()¶
-
methods
= {'GET'}¶
-
Anomaly¶
The anomaly view into the model. Expects that the model being served
when accessing this route implements the anomaly()
method
in order to calculate the anomaly key(s) for the response.
-
class
gordo.server.views.anomaly.
AnomalyView
(api=None, *args, **kwargs)[source]¶ Bases:
gordo.server.views.base.BaseModelView
Serve model predictions via POST method.
Gives back predictions looking something like this (depending on anomaly model being served):
{ 'data': [ { 'end': ['2016-01-01T00:10:00+00:00'], 'tag-anomaly-scaled': [0.913027075986948, 0.3474043585419292, 0.8986610906818544, 0.11825221990818557], 'tag-anomaly-unscaled': [10.2335327305725986948, 4.234343958392+3293, 10.379394390232232, 3.32093438982743929], 'model-output': [0.0005317790200933814, -0.0001525811239844188, 0.0008310950361192226, 0.0015755111817270517], 'original-input': [0.9135588550070414, 0.3472517774179448, 0.8994921857179736, 0.11982773108991263], 'start': ['2016-01-01T00:00:00+00:00'], 'total-anomaly-unscaled': [1.3326228173185086], 'total-anomaly-scaled': [0.3020328328002392], }, ... ], 'tags': [{'asset': None, 'name': 'tag-0'}, {'asset': None, 'name': 'tag-1'}, {'asset': None, 'name': 'tag-2'}, {'asset': None, 'name': 'tag-3'}], 'time-seconds': '0.1937'}
-
endpoint
= 'anomaly_view'¶
-
mediatypes
()¶
-
methods
= ['POST']¶
-
post
()[source]¶ Process a POST request by using provided user data
A typical response might look like this
{ 'data': [ { 'end': ['2016-01-01T00:10:00+00:00'], 'model-output': [0.0005317790200933814, -0.0001525811239844188, 0.0008310950361192226, 0.0015755111817270517], 'original-input': [0.9135588550070414, 0.3472517774179448, 0.8994921857179736, 0.11982773108991263], 'start': ['2016-01-01T00:00:00+00:00'], }, ... ], 'tags': [ {'asset': None, 'name': 'tag-0'}, {'asset': None, 'name': 'tag-1'}, {'asset': None, 'name': 'tag-2'}, {'asset': None, 'name': 'tag-3'} ], 'time-seconds': '0.1937' }
-
Utils¶
Shared utility functions and decorators which are used by the Views
-
gordo.server.utils.
dataframe_from_dict
(data: dict) → pandas.core.frame.DataFrame[source]¶ The inverse procedure done by
multi_lvl_column_dataframe_from_dict()
Reconstructed a MultiIndex column dataframe from a previously serialized one.Expects
data
to be a nested dictionary where each top level key has a value capable of being loaded frompandas.core.DataFrame.from_dict()
- Parameters
data (dict) – Data to be loaded into a MultiIndex column dataframe
- Returns
MultiIndex column dataframe.
- Return type
pandas.core.DataFrame
Examples
>>> serialized = { ... 'feature0': {'sub-feature-0': {'2019-01-01': 0, '2019-02-01': 4}, ... 'sub-feature-1': {'2019-01-01': 1, '2019-02-01': 5}}, ... 'feature1': {'sub-feature-0': {'2019-01-01': 2, '2019-02-01': 6}, ... 'sub-feature-1': {'2019-01-01': 3, '2019-02-01': 7}} ... } >>> dataframe_from_dict(serialized) feature0 feature1 sub-feature-0 sub-feature-1 sub-feature-0 sub-feature-1 2019-01-01 0 1 2 3 2019-02-01 4 5 6 7
-
gordo.server.utils.
dataframe_from_parquet_bytes
(buf: bytes) → pandas.core.frame.DataFrame[source]¶ Convert bytes representing a parquet table into a pandas dataframe.
- Parameters
buf (bytes) – Bytes representing a parquet table. Can be the direct result from func::gordo.server.utils.dataframe_into_parquet_bytes
- Returns
- Return type
pandas.DataFrame
-
gordo.server.utils.
dataframe_into_parquet_bytes
(df: pandas.core.frame.DataFrame, compression: str = 'snappy') → bytes[source]¶ Convert a dataframe into bytes representing a parquet table.
- Parameters
df (pd.DataFrame) – DataFrame to be compressed
compression (str) – Compression to use, passed to
pyarrow.parquet.write_table()
- Returns
- Return type
bytes
-
gordo.server.utils.
dataframe_to_dict
(df: pandas.core.frame.DataFrame) → dict[source]¶ Convert a dataframe can have a
pandas.MultiIndex
as columns into a dict where each key is the top level column name, and the value is the array of columns under the top level name. If it’s a simple dataframe,pandas.core.DataFrame.to_dict()
will be used.This allows
json.dumps()
to be performed, wherepandas.DataFrame.to_dict()
would convert such a multi-level column dataframe into keys oftuple
objects, which are not json serializable. However this ends up working withpandas.DataFrame.from_dict()
- Parameters
df (pandas.DataFrame) – Dataframe expected to have columns of type
pandas.MultiIndex
2 levels deep.- Returns
List of records representing the dataframe in a ‘flattened’ form.
- Return type
List[dict]
Examples
>>> import pprint >>> import pandas as pd >>> import numpy as np >>> columns = pd.MultiIndex.from_tuples((f"feature{i}", f"sub-feature-{ii}") for i in range(2) for ii in range(2)) >>> index = pd.date_range('2019-01-01', '2019-02-01', periods=2) >>> df = pd.DataFrame(np.arange(8).reshape((2, 4)), columns=columns, index=index) >>> df feature0 feature1 sub-feature-0 sub-feature-1 sub-feature-0 sub-feature-1 2019-01-01 0 1 2 3 2019-02-01 4 5 6 7 >>> serialized = dataframe_to_dict(df) >>> pprint.pprint(serialized) {'feature0': {'sub-feature-0': {'2019-01-01': 0, '2019-02-01': 4}, 'sub-feature-1': {'2019-01-01': 1, '2019-02-01': 5}}, 'feature1': {'sub-feature-0': {'2019-01-01': 2, '2019-02-01': 6}, 'sub-feature-1': {'2019-01-01': 3, '2019-02-01': 7}}}
-
gordo.server.utils.
extract_X_y
(method)[source]¶ For a given flask view, will attempt to extract an ‘X’ and ‘y’ from the request and assign it to flask’s ‘g’ global request context
If it fails to extract ‘X’ and (optionally) ‘y’ from the request, it will not run the function but return a
BadRequest
response notifying the client of the failure.- Parameters
method (Callable) – The flask route to decorate, and will return it’s own response object and will want to use
flask.g.X
and/orflask.g.y
- Returns
Will either run a
flask.Response
with status code 400 if it fails to extract the X and optionally the y. Otherwise will run the decoratedmethod
which is also expected to return some sort offlask.Response
object.- Return type
flask.Response
-
gordo.server.utils.
find_path_in_dict
(path: List[str], data: dict) → Any[source]¶ Find a path in dict recursively
Examples
>>> find_path_in_dict(["parent", "child"], {"parent": {"child": 42}}) 42
- Parameters
path (List[str]) –
data (dict) –
-
gordo.server.utils.
load_metadata
(directory: str, name: str) → dict[source]¶ Load metadata from a directory for a given model by name.
- Parameters
directory (str) – Directory to look for the model’s metadata
name (str) – Name of the model to load metadata for, this would be the sub directory within the directory parameter.
- Returns
- Return type
dict
-
gordo.server.utils.
load_model
[source]¶ Load a given model from the directory by name.
- Parameters
directory (str) – Directory to look for the model
name (str) – Name of the model to load, this would be the sub directory within the directory parameter.
- Returns
- Return type
BaseEstimator
-
gordo.server.utils.
metadata_required
(f)[source]¶ Decorate a view which has
gordo_name
as a url parameter and will setg.metadata
to that model’s metadata
Model IO¶
The general model input/output operations applied by the views
-
gordo.server.model_io.
get_model_output
(model: sklearn.pipeline.Pipeline, X: numpy.ndarray) → numpy.ndarray[source]¶ Get the raw output from the current model given X. Will try to predict and then transform, raising an error if both fail.
- Parameters
X (np.ndarray) – 2d array of sample(s)
- Returns
The raw output of the model in numpy array form.
- Return type
np.ndarray
CLI¶
gordo CLI¶
Available CLIs for Gordo:
gordo¶
The main entry point for the CLI interface
gordo [OPTIONS] COMMAND [ARGS]...
Options
-
--version
¶
Show the version and exit.
-
--log-level
<log_level>
¶ Run workflow with custom log-level.
Environment variables
-
GORDO_LOG_LEVEL
Provide a default for
--log-level
build¶
Build a model and deposit it into ‘output_dir’ given the appropriate config settings.
gordo.machine.Machine.from_config
gordo build [OPTIONS] MACHINE_CONFIG [OUTPUT_DIR]
Options
-
--model-register-dir
<model_register_dir>
¶
-
--print-cv-scores
¶
Prints CV scores to stdout
-
--model-parameter
<model_parameter>
¶ Key-Value pair for a model parameter and its value, may use this option multiple times. Separate key,valye by a comma. ie: –model-parameter key,val –model-parameter some_key,some_value
-
--exceptions-reporter-file
<exceptions_reporter_file>
¶ JSON output file for exception information
-
--exceptions-report-level
<exceptions_report_level>
¶ Details level for exception reporting
- Options
EXIT_CODE | TYPE | MESSAGE | TRACEBACK
Arguments
-
MACHINE_CONFIG
¶
Required argument
-
OUTPUT_DIR
¶
Optional argument
Environment variables
-
MACHINE
Provide a default for
MACHINE_CONFIG
-
OUTPUT_DIR
Provide a default for
OUTPUT_DIR
-
MODEL_REGISTER_DIR
Provide a default for
--model-register-dir
-
EXCEPTIONS_REPORTER_FILE
Provide a default for
--exceptions-reporter-file
-
EXCEPTIONS_REPORT_LEVEL
Provide a default for
--exceptions-report-level
run-server¶
Run the gordo server app with Gunicorn
gordo run-server [OPTIONS]
Options
-
--host
<host>
¶ The host to run the server on.
- Default
0.0.0.0
-
--port
<port>
¶ The port to run the server on.
- Default
5555
-
--workers
<workers>
¶ The number of worker processes for handling requests.
- Default
2
-
--worker-connections
<worker_connections>
¶ The maximum number of simultaneous clients per worker process.
- Default
50
-
--threads
<threads>
¶ The number of worker threads for handling requests.This argument only has affects with –worker-class=gthread. Default value is 8 (4 x $(NUM_CORES))
-
--worker-class
<worker_class>
¶ The type of workers to use.
- Default
gthread
-
--log-level
<log_level>
¶ The log level for the server.
- Default
debug
- Options
critical | error | warning | info | debug
-
--server-app
<server_app>
¶ The application to run
- Default
gordo.server.server:build_app()
-
--with-prometheus-config
¶
Run with custom config for prometheus
Environment variables
-
GORDO_SERVER_HOST
Provide a default for
--host
-
GORDO_SERVER_PORT
Provide a default for
--port
-
GORDO_SERVER_WORKERS
Provide a default for
--workers
-
GORDO_SERVER_WORKER_CONNECTIONS
Provide a default for
--worker-connections
-
GORDO_SERVER_THREADS
Provide a default for
--threads
-
GORDO_SERVER_WORKER_CLASS
Provide a default for
--worker-class
-
GORDO_SERVER_LOG_LEVEL
Provide a default for
--log-level
-
GORDO_SERVER_APP
Provide a default for
--server-app
workflow¶
gordo workflow [OPTIONS] COMMAND [ARGS]...
Machine Configuration to Argo Workflow
gordo workflow generate [OPTIONS]
Options
-
--machine-config
<machine_config>
¶ Required Machine configuration file
-
--workflow-template
<workflow_template>
¶ Template to expand
-
--owner-references
<owner_references>
¶ Kubernetes owner references to inject into all created resources. Should be a nonempty yaml/json list of owner-references, each owner-reference a dict containing at least the keys ‘uid’, ‘name’, ‘kind’, and ‘apiVersion’
-
--gordo-version
<gordo_version>
¶ Version of gordo to use, if different than this one
-
--project-name
<project_name>
¶ Required Name of the project which own the workflow.
-
--project-revision
<project_revision>
¶ Revision of the project which own the workflow.
-
--output-file
<output_file>
¶ Optional file to render to
-
--namespace
<namespace>
¶ Which namespace to deploy services into
-
--split-workflows
<split_workflows>
¶ Split workflows containg more than this number of models into several workflows, where each workflow contains at most this nr of models. The workflows are outputted sequentially with ‘—’ in between, which allows kubectl to apply them all at once.
-
--n-servers
<n_servers>
¶ Max number of ML Servers to use, defaults to N machines * 10
-
--docker-repository
<docker_repository>
¶ The docker repo to use for pulling component images from
-
--docker-registry
<docker_registry>
¶ The docker registry to use for pulling component images from
-
--retry-backoff-duration
<retry_backoff_duration>
¶ retryStrategy.backoff.duration for workflow steps
-
--retry-backoff-factor
<retry_backoff_factor>
¶ retryStrategy.backoff.factor for workflow steps
-
--gordo-server-workers
<gordo_server_workers>
¶ The number of worker processes for handling Gordo server requests.
-
--gordo-server-threads
<gordo_server_threads>
¶ The number of worker threads for handling requests.
-
--gordo-server-probe-timeout
<gordo_server_probe_timeout>
¶ timeoutSeconds value for livenessProbe and readinessProbe of Gordo server Deployment
-
--without-prometheus
¶
Do not deploy Prometheus for Gordo servers monitoring
-
--prometheus-metrics-server-workers
<prometheus_metrics_server_workers>
¶ Number of workers for Prometheus metrics servers
-
--image-pull-policy
<image_pull_policy>
¶ Default imagePullPolicy for all gordo’s images
-
--with-keda
¶
Enable support for the KEDA autoscaler
-
--ml-server-hpa-type
<ml_server_hpa_type>
¶ HPA type for the ML server
- Options
none | k8s_cpu | keda
-
--custom-model-builder-envs
<custom_model_builder_envs>
¶ List of custom environment variables in
-
--prometheus-server-address
<prometheus_server_address>
¶ Prometheus url. Required for “–ml-server-hpa-type=keda”
-
--keda-prometheus-metric-name
<keda_prometheus_metric_name>
¶ metricName value for the KEDA prometheus scaler
-
--keda-prometheus-query
<keda_prometheus_query>
¶ query value for the KEDA prometheus scaler
-
--keda-prometheus-threshold
<keda_prometheus_threshold>
¶ threshold value for the KEDA prometheus scaler
-
--resources-labels
<resources_labels>
¶ Additional labels for resources. Have to be empty string or a dictionary in JSON format
-
--server-termination-grace-period
<server_termination_grace_period>
¶ terminationGracePeriodSeconds for the gordo server
-
--server-target-cpu-utilization-percentage
<server_target_cpu_utilization_percentage>
¶ targetCPUUtilizationPercentage for gordo-server’s HPA
Environment variables
-
WORKFLOW_GENERATOR_MACHINE_CONFIG
Provide a default for
--machine-config
-
WORKFLOW_GENERATOR_OWNER_REFERENCES
Provide a default for
--owner-references
-
WORKFLOW_GENERATOR_GORDO_VERSION
Provide a default for
--gordo-version
-
WORKFLOW_GENERATOR_PROJECT_NAME
Provide a default for
--project-name
-
WORKFLOW_GENERATOR_PROJECT_REVISION
Provide a default for
--project-revision
-
WORKFLOW_GENERATOR_OUTPUT_FILE
Provide a default for
--output-file
-
WORKFLOW_GENERATOR_NAMESPACE
Provide a default for
--namespace
-
WORKFLOW_GENERATOR_SPLIT_WORKFLOWS
Provide a default for
--split-workflows
-
WORKFLOW_GENERATOR_N_SERVERS
Provide a default for
--n-servers
-
WORKFLOW_GENERATOR_DOCKER_REPOSITORY
Provide a default for
--docker-repository
-
WORKFLOW_GENERATOR_DOCKER_REGISTRY
Provide a default for
--docker-registry
-
WORKFLOW_GENERATOR_RETRY_BACKOFF_DURATION
Provide a default for
--retry-backoff-duration
-
WORKFLOW_GENERATOR_RETRY_BACKOFF_FACTOR
Provide a default for
--retry-backoff-factor
-
WORKFLOW_GENERATOR_GORDO_SERVER_WORKERS
Provide a default for
--gordo-server-workers
-
WORKFLOW_GENERATOR_GORDO_SERVER_THREADS
Provide a default for
--gordo-server-threads
-
WORKFLOW_GENERATOR_GORDO_SERVER_PROBE_TIMEOUT
Provide a default for
--gordo-server-probe-timeout
-
WORKFLOW_GENERATOR_WITHOUT_PROMETHEUS
Provide a default for
--without-prometheus
-
WORKFLOW_GENERATOR_PROMETHEUS_METRICS_SERVER_WORKERS
Provide a default for
--prometheus-metrics-server-workers
-
WORKFLOW_GENERATOR_IMAGE_PULL_POLICY
Provide a default for
--image-pull-policy
-
WORKFLOW_GENERATOR_WITH_KEDA
Provide a default for
--with-keda
-
WORKFLOW_GENERATOR_ML_SERVER_HPA_TYPE
Provide a default for
--ml-server-hpa-type
-
WORKFLOW_GENERATOR_CUSTOM_MODEL_BUILDER_ENVS
Provide a default for
--custom-model-builder-envs
-
WORKFLOW_GENERATOR_PROMETHEUS_SERVER_ADDRESS
Provide a default for
--prometheus-server-address
-
WORKFLOW_GENERATOR_KEDA_PROMETHEUS_METRIC_NAME
Provide a default for
--keda-prometheus-metric-name
-
WORKFLOW_GENERATOR_KEDA_PROMETHEUS_QUERY
Provide a default for
--keda-prometheus-query
-
WORKFLOW_GENERATOR_KEDA_PROMETHEUS_THRESHOLD
Provide a default for
--keda-prometheus-threshold
-
WORKFLOW_GENERATOR_RESOURCE_LABELS
Provide a default for
--resources-labels
-
WORKFLOW_GENERATOR_SERVER_TERMINATION_GRACE_PERIOD
Provide a default for
--server-termination-grace-period
-
WORKFLOW_GENERATOR_SERVER_TARGET_CPU_UTILIZATION_PERCENTAGE
Provide a default for
--server-target-cpu-utilization-percentage
Workflow¶
The workflow component is responsible for converting a Gordo config into an Argo workflow which then runs the various components in order to build and serve the ML models.
Normalized Config¶
-
class
gordo.workflow.config_elements.normalized_config.
NormalizedConfig
(config: dict, project_name: str, gordo_version: Optional[str] = None, model_builder_env: Optional[dict] = None)[source]¶ Bases:
object
Handles the conversion of a single Machine representation in config format and updates it with any features which are ‘left out’ inside of
globals
key or the default config globals held here.-
DEFAULT_CONFIG_GLOBALS
: Dict[str, Any] = {'evaluation': {'cv_mode': 'full_build', 'metrics': ['explained_variance_score', 'r2_score', 'mean_squared_error', 'mean_absolute_error'], 'scoring_scaler': 'sklearn.preprocessing.MinMaxScaler'}, 'runtime': {'builder': {'remote_logging': {'enable': False}, 'resources': {'limits': {'cpu': 1001, 'memory': 31200}, 'requests': {'cpu': 1001, 'memory': 3900}}}, 'client': {'max_instances': 30, 'resources': {'limits': {'cpu': 2000, 'memory': 4000}, 'requests': {'cpu': 100, 'memory': 3500}}}, 'influx': {'enable': True}, 'prometheus_metrics_server': {'resources': {'limits': {'cpu': 200, 'memory': 1000}, 'requests': {'cpu': 100, 'memory': 200}}}, 'reporters': [], 'server': {'resources': {'limits': {'cpu': 2000, 'memory': 6000}, 'requests': {'cpu': 1000, 'memory': 3000}}}}}¶
-
SPLITED_DOCKER_IMAGES
: Dict[str, Any] = {'runtime': {'builder': {'image': 'gordo-model-builder'}, 'client': {'image': 'gordo-client'}, 'deployer': {'image': 'gordo-deploy'}, 'prometheus_metrics_server': {'image': 'gordo-model-server'}, 'server': {'image': 'gordo-model-server'}}}¶
-
UNIFIED_DOCKER_IMAGES
: Dict[str, Any] = {'runtime': {'builder': {'image': 'gordo-base'}, 'client': {'image': 'gordo-base'}, 'deployer': {'image': 'gordo-base'}, 'prometheus_metrics_server': {'image': 'gordo-base'}, 'server': {'image': 'gordo-base'}}}¶
-
UNIFYING_GORDO_VERSION
: str = '1.2.0'¶
-
Workflow Generator¶
Workflow loading/processing functionality to help the CLI ‘workflow’ sub-command.
-
gordo.workflow.workflow_generator.workflow_generator.
default_image_pull_policy
(gordo_version: gordo.util.version.Version) → str[source]¶
-
gordo.workflow.workflow_generator.workflow_generator.
get_dict_from_yaml
(config_file: Union[str, _io.StringIO]) → dict[source]¶ Read a config file or file like object of YAML into a dict
-
gordo.workflow.workflow_generator.workflow_generator.
load_workflow_template
(workflow_template: str) → jinja2.environment.Template[source]¶ Loads the Jinja2 Template from a specified path
- Parameters
workflow_template (str) – Path to a workflow template
- Returns
Loaded but non-rendered jinja2 template for the workflow
- Return type
jinja2.Template
Helpers¶
-
gordo.workflow.workflow_generator.helpers.
patch_dict
(original_dict: dict, patch_dictionary: dict) → dict[source]¶ Patches a dict with another. Patching means that any path defines in the patch is either added (if it does not exist), or replaces the existing value (if it exists). Nothing is removed from the original dict, only added/replaced.
- Parameters
original_dict (dict) – Base dictionary which will get paths added/changed
patch_dictionary (dict) – Dictionary which will be overlaid on top of original_dict
Examples
>>> patch_dict({"highKey":{"lowkey1":1, "lowkey2":2}}, {"highKey":{"lowkey1":10}}) {'highKey': {'lowkey1': 10, 'lowkey2': 2}} >>> patch_dict({"highKey":{"lowkey1":1, "lowkey2":2}}, {"highKey":{"lowkey3":3}}) {'highKey': {'lowkey1': 1, 'lowkey2': 2, 'lowkey3': 3}} >>> patch_dict({"highKey":{"lowkey1":1, "lowkey2":2}}, {"highKey2":4}) {'highKey': {'lowkey1': 1, 'lowkey2': 2}, 'highKey2': 4}
- Returns
A new dictionary which is the result of overlaying patch_dictionary on top of original_dict
- Return type
dict
Util¶
Project helpers, and associated functionality which have no home.
Disk Registry¶
-
gordo.util.disk_registry.
delete_value
(registry_dir: Union[os.PathLike, str], key: str) → bool[source]¶ Deletes the value with key reg_key from the registry, and returns True if it existed.
- Parameters
registry_dir (Union[os.PathLike, str]) – Path to the registry. Does not need to exist
key (str) – Key to look up in the registry.
- Returns
True if the key existed, false otherwise
- Return type
bool
-
gordo.util.disk_registry.
get_value
(registry_dir: Union[os.PathLike, str], key: str) → Optional[AnyStr][source]¶ Retrieves the value with key reg_key from the registry, None if it does not exists.
- Parameters
registry_dir (Union[os.PathLike, str]) – Path to the registry. If it does not exist we return None
key (str) – Key to look up in the registry.
- Returns
The value of key in the registry, None if no value is registered with that key in the registry.
- Return type
Optional[AnyStr]
-
gordo.util.disk_registry.
logger
= <Logger gordo.util.disk_registry (WARNING)>¶ A simple file-based key/value registry. Each key gets a file with filename = key, and the content of the file is the value. No fancy. Why? Simple, and there is no problems with concurrent writes to different keys. Concurrent writes to the same key will break stuff.
-
gordo.util.disk_registry.
write_key
(registry_dir: Union[os.PathLike, str], key: str, val: AnyStr)[source]¶ Registers a key-value combination into the register. Key must valid as a filename.
- Parameters
registry_dir (Union[os.PathLike, str]) – Path to the registry. If it does not exists it will be created, including any missing folders in the path.
key (str) – Key to use for the key/value. Must be valid as a filename.
val (AnyStr) – Value to write to the registry.
Examples
In the following example we use the temp directory as the registry >>> import tempfile >>> with tempfile.TemporaryDirectory() as tmpdir: … write_key(tmpdir, “akey”, “aval”) … get_value(tmpdir, “akey”) ‘aval’
Utils¶
-
gordo.util.utils.
capture_args
(method: Callable)[source]¶ Decorator that captures args and kwargs passed to a given method. This assumes the decorated method has a self, which has a dict of kwargs assigned as an attribute named _params.
- Parameters
method (Callable) – Some method of an object, with ‘self’ as the first parameter.
- Returns
Returns whatever the original method would return
- Return type
Any