...
Security service is REST based service for user authentication against LDAP/LDAP + AWS/Azure OAuth2 depending on module configuration and cloud provider. LDAP only provides with authentication end point that allows to verify authenticity of users against LDAP instance. If you use AWS cloud provider LDAP + AWS authentication could be useful as it allows to combine LDAP authentication and verification if user has any role in AWS account
DataLab provides OAuth2(client credentials and authorization code flow) security authorization mechanism for Azure users. This kind of authentication is required when you are going to use Data Lake. If If Data Lake is not enabled you have two options LDAP or OAuth2 If OAuth2 is in use security-service validates user's permissions to configured permission scope(resource in Azure). If Data Lake is enabled default permission scope(can be configured manually after deploy DataLab) is Data Lake Store account so only if user has any role in scope of Data Lake Store Account resource he/she will be allowed to log in If Data Lake is disabled but Azure OAuth2 is in use default permission scope will be Resource Group where DataLab is created and only users who have any roles in the resource group will be allowed to log in.
...
|
---|
Create Windows server certificate
Pay attention that the last command has to be executed with administrative permissions. To achieve this the command line (cmd) should be run with administrative permissions.
|
---|
Where the <DRIVE_LETTER>
must be the drive letter where you run the DataLab.
How to run locally
...
There is a possibility to run Self-Service and Provisioning Service locally. All requests from Provisioning Service to Docker are mocked and instance creation status will be persisted to Mongo (only without real impact on Docker and AWS). Security Service can`t be running on local machine because of local LDAP mocking complexity.Both
Both services, Self-Service and Provisioning Service are dependent on datalab/provisioning-infrastructure/ssn/templates/ssn.yml configuration file. Both services have main functions as entry point, SelfServiceApplication for Self-Service and ProvisioningServiceApplication for Provisioning Service. Services could be started by running main methods of these classes. Both main functions require two arguments:
- Run mode (“server”)
- Configuration file name (“self-service.yml” or “provisioning.yml” depending on the service). Both files are located in root service directory. These configuration files contain service settings and are ready to use.
The services start up order does matter. Since Self-Service depends on Provisioning Service, the last should be started first and Self-Service afterwards. Services could be started from local IDEA (Eclipse or Intellij Idea) “Run” functionality of toolbox.
Run application flow is following:
- Create and run provisioning-service configuration:
Create Application with name provisining-service-application
- Main class:
com.epam.datalab.backendapi.ProvisioningServiceApplication
- VM options:
-Ddocker.dir=[PATH_TO_PROJECT_DIR]\infrastructure-provisioning\src\general\files\gcp
- Program arguments :
server [PATH_TO_PROJECT_DIR]\services\provisioning-service\provisioning.yml
- Working directory:
[PATH_TO_PROJECT_DIR]
- Use classpath of module:
provisioning-servise
- PAY ATTENTION: JRE should be the same jre where created server certificate
- Main class:
Create and run self-service configuration:
- Create Application with name self-service-application
- Main class:
com.epam.datalab.backendapi.SelfServiceApplication
- Program arguments :
server [PATH_TO_PROJECT_DIR]/services/self-service/self-service.yml
- Working directory:
[PATH_TO_PROJECT_DIR]
- Use classpath of module:
self-service
- PAY ATTENTION: JRE should be the same jre where created server certificate
- Main class:
- Create Application with name self-service-application
- Try to access self-service Web UI by https://localhost:8443
|
---|
Infrastructure provisioning
...
DevOps components overview
The following list shows common structure of scripts for deploying DataLab
Folder structure
|
---|
Each directory except general contains Python scripts, Docker files, templates, files for appropriate Docker image.
- base – Main Docker image. It is a common/base image for other ones.
- edge – Docker image for Edge node.
- dataengine – Docker image for dataengine cluster.
- dataengine-service – Docker image for dataengine-service cluster.
- general – OS and CLOUD dependent common source.
- ssn – Docker image for Self-Service node (SSN).
- jupyter/rstudio/zeppelin/tensor/deeplearning – Docker images for Notebook nodes.
All Python scripts, Docker files and other files, which are located in these directories, are OS and CLOUD independent.
OS, CLOUD dependent and common for few templates scripts, functions, files are located in general directory.
|
---|
These directories may contain differentiation by operating system (Debian/RedHat) or cloud provider (AWS).
Directories of templates (SSN, Edge etc.) contain only scripts, which are OS and CLOUD independent.
If script/function is OS or CLOUD dependent, it should be located in appropriate directory/library in general folder.
The following table describes mostly used scripts:
Script name/Path | Description |
---|---|
Dockerfile | Used for building Docker images and represents which Python scripts, templates and other files are needed. Required for each template. |
base/entrypoint.py | This file is executed by Docker. It is responsible for setting environment variables, which are passed from Docker and for executing appropriate actions (script in general/api/). |
base/scripts/*.py | Scripts, which are OS independent and are used in each template. |
general/api/*.py | API scripts, which execute appropriate function from fabfile.py. |
template_name/fabfile.py | Is the main file for template and contains all functions, which can be used as template actions. |
template_name/scripts/*.py | Python scripts, which are used for template. They are OS and CLOUD independent. |
general/lib/aws/*.py | Contains all functions related to AWS. |
general/lib/os/ | This directory is divided by type of OS. All OS dependent functions are located here. |
general/lib/os/fab.py | Contains OS independent functions used for multiple templates. |
general/scripts/ | Directory is divided by type of Cloud provider and OS. |
general/scripts/aws/*.py | Scripts, which are executed from fabfiles and AWS-specific. The first part of file name defines to which template this script is related to. For example: common_*.py – can be executed from more than one template. ssn_*.py – are used for SSN template. edge_*.py – are used for Edge template. |
general/scripts/os/*.py | Scripts, which are OS independent and can be executed from more than one template. |
Docker actions overview
Available Docker images and their actions:
Docker image | Actions |
---|---|
ssn | create, terminate |
edge | create, terminate, status, start, stop, recreate |
jupyter/rstudio/zeppelin/tensor/deeplearning | create, terminate, start, stop, configure, list_libs, install_libs, git_creds |
dataengine/dataengine-service | create, terminate |
Docker and python execution workflow on example of SSN node
- Docker command for building images docker.datalab-base and docker.datalab-ssn:
|
---|
Example of SSN Docker file:
|
---|
Using this Docker file, all required scripts and files will be copied to Docker container.
- Docker command for building SSN:
|
---|
- Docker executes entrypoint.py script with action create. Entrypoint.py will set environment variables, which were provided from Docker and execute general/api/create.py script:
|
---|
- general/api/create.py will execute Fabric command with run action:
|
---|
- Function run() in file ssn/fabfile.py will be executed. It will run two scripts general/scripts/aws/ssn_prepare.py and general/scripts/aws/ssn_configure.py:
|
---|
- The scripts general/scripts/<cloud_provider>/ssn_prepare.py an general/scripts/<cloud_provider>/ssn_configure.py will execute other Python scripts/functions for:
- ssn_prepate.py: 1. Creating configuration file (for AWS) 2. Creating Cloud resources.
- ssn_configure.py: 1. Installing prerequisites 2. Installing required packages 3. Configuring Docker 4. Configuring DataLab Web UI
- If all scripts/function are executed successfully, Docker container will stop and SSN node will be created.
Example of Docker commands
SSN:
|
---|
All parameters are listed in section "Self-ServiceNode" chapter.
Other images:
|
---|
How to add a new template
First of all, a new directory should be created in infrastructure-provisioning/src/.
For example: infrastructure-provisioning/src/my-tool/
The following scripts/directories are required to be created in the template directory:
|
---|
fabfile.py – the main script, which contains main functions for this template such as run, stop, terminate, etc.
Here is example of run() function for Jupyter Notebook node:
Path: infrastructure-provisioning/src/jupyter/fabfile.py
|
---|
This function describes process of creating Jupyter node. It is divided into two parts – prepare and configure. Prepare part is common for all notebook templates and responsible for creating of necessary cloud resources, such as EC2 instances, etc. Configure part describes how the appropriate services will be installed.
To configure Jupyter node, the script jupyter_configure.py is executed. This script describes steps for configuring Jupyter node. In each step, the appropriate Python script is executed.
For example:
Path: infrastructure-provisioning/src/general/scripts/aws/jupyter_configure.py
|
---|
In this step, the script infrastructure-provisioning/src/jupyter/scripts/configure_jupyter_node.py will be executed.
Example of script infrastructure-provisioning/src/jupyter/scripts/configure_jupyter_node.py:
|
---|
This script call functions for configuring Jupyter node. If this function is OS dependent, it will be placed in infrastructure-provisioning/src/general/lib/<OS_family>/debian/notebook_lib.py
All functions in template directory (e.g. infrastructure-provisioning/src/my-tool/) should be OS and cloud independent.
All OS or cloud dependent functions should be placed in infrastructure-provisioning/src/general/lib/ directory.
The following steps are required for each Notebook node:
- Configure proxy on Notebook instance – the script infrastructure-provisioning/src/general/scripts/os/notebook_configure_proxy.py
- Installing user’s key – the script infrastructure-provisioning/src/base/scripts/install_user_key.py
Other scripts, responsible for configuring Jupyter node are placed in infrastructure-provisioning/src/jupyter/scripts/
scripts directory – contains all required configuration scripts.
infrastructure-provisioning/src/general/files/<cloud_provider>/my-tool_Dockerfile – used for building template Docker image and describes which files, scripts, templates are required and will be copied to template Docker image.
infrastructure-provisioning/src/general/files/<cloud_provider>/my-tool_descriptsion.json – JSON file for DataLab Web UI. In this file you can specify:
- exploratory_environment_shapes – list of EC2 shapes
- exploratory_environment_versions – description of template
Example of this file for Jupyter node for AWS cloud:
|
---|
Additionally, following directories could be created:
templates – directory for new templates;
files – directory for files used by newly added templates only;
All Docker images are being built while creating SSN node. To add newly created template, add it to the list of images in the following script:
Path: infrastructure-provisioning/src/general/scripts/aws/ssn_configure.py
|
---|
For example:
|
---|
Azure OAuth2 Authentication
...
DataLab supports OAuth2 authentication that is configured automatically in Security Service and Self Service after DataLab deployment. Please see explanation details about configuration parameters for Self Service and Security Service below. DataLab supports client credentials(username + password) and authorization code flow for authentication.
Azure OAuth2 Self Service configuration
|
---|
where:
- tenant - tenant id of your company
- authority - Microsoft login endpoint
- clientId - id of the application that users log in through
- redirectUrl - redirect URL to DataLab application after try to login to Azure using OAuth2
- responseMode - defines how Azure sends authorization code or error information to DataLab during log in procedure
- prompt - defines kind of prompt during Oauth2 login
- silent - defines if DataLab tries to log in user without interaction(true/false), if false DataLab tries to login user with configured prompt
- loginPage - start page of DataLab application
- maxSessionDurabilityMilliseconds - max user session durability. user will be asked to login after this period of time and when he/she creates ot starts notebook/cluster. This operation is needed to update refresh_token that is used by notebooks to access Data Lake Store
To get more info about responseMode, prompt parameters please visit Authorize access to web applications using OAuth 2.0 and Azure Active Directory
Azure OAuth2 Security Service configuration
|
---|
where:
- tenant - tenant id of your company
- authority - Microsoft login endpoint
- clientId - id of the application that users log in through
- redirectUrl - redirect URL to DataLab application after try to login to Azure using OAuth2
- validatePermissionScope - defines(true/false) if user's permissions should be validated to resource that is provided in permissionScope parameter. User will be logged in onlu in case he/she has any role in resource IAM described with permissionScope parameter
- permissionScope - describes Azure resource where user should have any role to pass authentication. If user has no role in resource IAM he/she will not be logged in
- managementApiAuthFile - authentication file that is used to query Microsoft Graph API to check user roles in resource described in permissionScope