Skip to content

Ownership

In many organizations there is often a separation of concerns between data science and engineering when it comes to data science tooling and infrastructure. Thus, with something like OpsML it's appropriate to think "who owns what?" or "where does data science and engineering fit into the lifecycle, use, and management of a system like this?". The goal of OpsML is to provide an interface into infrastructure that data scientists use and engineering owns/controls. The diagram below is one example of this separation of concerns.

Architecture of Opsml Proxy Setup

General Setup

Note - This is an example setup to give an overview of the separation between data science and engineering

OpsML is packaged and deployed as a proxy system where a server is set up and exposed via a callable api that data scientists set as an environment variable (OPSML_TRACKING_URI). When data scientists use OpsML they will interact with the server through a client API that is automatically configured when OpsML is loaded in python.

Parts Owned by Engineering

  • OpsML server that is packaged into a docker container and deployed through K8s
  • Storage system (local or cloud) that will be used to store ArtifactCard artifacts (models, data, figures, etc.)
  • Database that will be used to store ArtifactCard metadata. This will typically be a mysql or postgresql database
  • K8s and compute infrastructure for hosting applications
  • CI/CD build process

Other Considerations

  • In this scenario it is expected that the infrastructure hosting the OpsML server is also responsible for authentication and security. As an example, the host system may be placed on an internal network that is only accessible via authentication through a VPN. OpsML was built to be an ML tooling interface, not a security system. Thus, security should be configured on the host system.
  • Credentialing for external systems (storage, databases, etc.) should also be configured and embedded in the environment that hosts the OpsML server. This enables engineering to limit and control the credentials needed for OpsML. It also eliminates the need for data scientists to have to specify credentials when working with OpsML (apart for security authentication). Note - OpsML does support basic auth (single username and password) and we may expand this a on optional full authentication system in the future if there is a need.

Scenario 1: DS Workflows

This scenario covers the typical data science workflow and tasks that include exploratory analysis, model training and model evaluation. As part of this workflow, a data scientist will produce various ArtifactCards and store their attributes/metadata through client/server communication.

Scenario 2: Model Deployment

In this scenario, a data scientist or ml engineer creates the custom api logic for their model (FastApi for example) and specifies resources to deploy in a custom configuration or specification file. For this example, assume the engineering repository has set up an automated process whereby changes to the configuration file and push/tags trigger a CI/CD process that builds and serves a new model api. Upon build kickoff, the model specified in the configuration file is downloaded from the OpsML server and packaged along with the api code into a docker container. This docker container is then deployed on K8s where the api is served and ready for requests.

Environment

It is recommended to setup OpsML on each of your environments (dev/staging and prod). This is slightly different than other DS tooling packages and was done in order to follow best practices is systems/infra design. As a result, you will have separate registries across environments that will not be linked. Thus, versions across staging and prod may not be in sync, which is not necessarily an issue considering prod should be the environment used to train and deploy prod model artifacts.

Limit write access in prod

By design, so long as a data scientist has an OPSML_TRACKING_URI they should be able to read and write objects to the OpsML server. However, we usually don't want anyone to write/update a prod artifact from a non-prod environment. As an added measure of security, only requests coming from a prod environment will be allowed to write/update prod artifacts (anything can be read). This is accomplished through a verify_token dependency that checks for an OPSML_PROD_TOKEN token in your request and matches it to the OPSML_PROD_TOKEN in the prod environment. Note This is only checked if the APP_ENV is set to production.

For this functionality to work you will need to set OPSML_PROD_TOKEN env var in both the production compute environment that your data scientists use to train models and in the production environment that hosts the Opsml Server. Once these are set, OpsML will take care of the rest. It's also recommended that you use APP_ENV as the env var that specifies the current environment (dev, staging, production).