DataLab services could be ran in development mode. This mode emulates real work an does not create any resources on cloud provider environment.
Folder structure
|
---|
- infrastructure-provisioning – code of infrastructure-provisioning module;
- services – back-end services source code;
- billing – billing module for AWS cloud provider only;
- common – reusable code for all services;
- provisioning-service – Provisioning Service;
- security-service – Security Service;
- self-service – Self-Service and UI;
- settings – global settings that are stored in mongo database in development mode;
Pre-requisites
In order to start development of Front-end Web UI part of DataLab - Git repository should be cloned and the following packages should be installed:
- Git 1.7 or higher
- Python 2.7 with library Fabric v1.14.0
- Docker 1.12 - Infrastructure provisioning
Java back-end services
Java components description
Common
Common is a module, which wraps set of reusable code over services. Commonly reused functionality is as follows:
- Models
- REST client
- Mongo persistence DAO
- Security models and DAO
Self-Service
Self-Service provides REST based API’s. It tightly interacts with Provisioning Service and Security Service and actually delegates most of user`s requests for execution.
API class name | Supported actions | Description |
---|---|---|
BillingResource | Get billing invoice Export billing invoice in CSV file | Provides billing information. |
ComputationalResource | Configuration limits Create Terminate | Used for computational resources management. |
EdgeResource | Start Stop Status | Manage EDGE node. |
ExploratoryResource | Create Status Start Stop Terminate | Used for exploratory environment management. |
GitCredsResource | Update credentials Get credentials | Used for exploratory environment management. |
InfrastructureInfoResource | Get info of environment Get status of environment | Used for obtaining statuses and additional information about provisioned resources |
InfrastructureTemplatesResource | Get computation resources templates Get exploratory environment templates | Used for getting exploratory/computational templates |
KeyUploaderResource | Check key Upload key Recover | Used for Gateway/EDGE node public key upload and further storing of this information in Mongo DB. |
LibExploratoryResource | Lib groups Lib list Lib search Lib install | User’s authentication. |
SecurityResource | Login Authorize Logout | User’s authentication. |
UserSettingsResource | Get settings Save settings | User’s preferences. |
Some class names may have endings like Aws or Azure(e.g. ComputationalResourceAws, ComputationalResourceAzure, etc...). It means that it's cloud specific class with a proper API
Provisioning Service
The Provisioning Service is key, REST based service for management of cloud specific or Docker based environment resources like computational, exploratory, edge, etc.
API class name | Supported actions | Description |
---|---|---|
ComputationalResource | Create Terminate | Docker actions for computational resources management. |
DockerResource | Get Docker image Run Docker image | Requests and describes Docker images and templates. |
EdgeResource | Create Start Stop | Provides Docker actions for EDGE node management. |
ExploratoryResource | Create Start Stop Terminate | Provides Docker actions for working with exploratory environment management. |
GitExploratoryResource | Update git greds | Docker actions to provision git credentials to running notebooks |
InfrastructureResource | Status | Docker action for obtaining status of DataLab infrastructure instances. |
LibExploratoryResource | Lib list Install lib | Docker actions to install libraries on netobboks |
Some class names may have endings like Aws or Azure(e.g. ComputationalResourceAws, ComputationalResourceAzure, etc...). It means that it's cloud specific class with a proper API
Security service
Security service is REST based service for user authentication against AWS/Azure OAuth2 depending on module configuration and cloud provider.
DataLab provides OAuth2(client credentials and authorization code flow) security authorization mechanism for Azure users. This kind of authentication is required when you are going to use Data Lake. If Data Lake is enabled default permission scope(can be configured manually after deploy DataLab) is Data Lake Store account so only if user has any role in scope of Data Lake Store Account resource he/she will be allowed to log in If Data Lake is disabled but Azure OAuth2 is in use default permission scope will be Resource Group where DataLab is created and only users who have any roles in the resource group will be allowed to log in.
Front-end
Front-end components description
Web UI sources are part of Self-Service.
Sources are located in datalab/services/self-service/src/main/resources/webapp
Main pages | Components and Services |
---|---|
Login page | LoginComponent applicationSecurityService handles http calls and stores authentication tokens on the client and attaches the token to authenticated calls; healthStatusService and appRoutingService check instances states and redirect to appropriate page. |
Home page (list of resources) | HomeComponent nested several main components like ResourcesGrid for notebooks data rendering and filtering, using custom MultiSelectDropdown component; multiple modal dialogs components used for new instances creation, displaying detailed info and actions confirmation. |
Health Status page | HealthStatusComponent HealthStatusGridComponent displays list of instances, their types, statutes, ID’s and uses healthStatusService for handling main actions. |
Help pages | Static pages that contains information and instructions on how to access Notebook Server and generate SSH key pair. Includes only NavbarComponent. |
Error page | Simple static page letting users know that opened page does not exist. Includes only NavbarComponent. |
Reporting page | ReportingComponent ReportingGridComponent displays billing detailed info with built-in filtering and DateRangePicker component for custom range filtering; uses BillingReportService for handling main actions and exports report data to .csv file. |
How to setup local development environment
The development environment setup description is written with assumption that user already has installed Java8 (JDK), Maven3 and set environment variables (JAVA_HOME, M2_HOME). The description will cover Mongo installation, Mongo user creation, filling initial data into Mongo, Node.js installation
Install Mongo database
- Download MongoDB from https://www.mongodb.com/download-center
- Install database based on MongoDB instructions
- Start DB server and create accounts
|
---|
- Load collections form file datalab/services/settings/(aws|azure)/mongo_settings.json
|
---|
- Load collections form file datalab/infrastructure-provisioning/src/ssn/files/mongo_roles.json
|
---|
If this command doesn't work for you, try to check https://docs.mongodb.com/v4.2/reference/program/mongoimport/ Or, use some UI client (f.e: MongoDB Compass )
Setting up environment options
- Set option CLOUD_TYPE to aws/azure, DEV_MODE to true, mongo database name and password in configuration file datalab/infrastructure-provisioning/src/ssn/templates/ssn.yml
|
---|
- Add system environment variable DATALAB_CONF_DIR=<datalab_root_folder>/datalab/infrastructure-provisioning/src/ssn/templates or create two symlinks in datalab/services/provisioning-service and datalab/services/self-service folders for file datalab/infrastructure-provisioning/src/ssn/templates/ssn.yml.
Unix
|
---|
Windows
|
---|
- For Unix system create two folders and grant permission for writing:
|
---|
Install Node.js
- Download Node.js from https://nodejs.org/en
- Install Node.js
- Make sure that the installation folder of Node.js has been added to the system environment variable PATH
- Install latest packages
|
---|
Build Web UI components
- Change folder to datalab/services/self-service/src/main/resources/webapp and install the dependencies from a package.json manifest
|
---|
- Replace CLOUD_PROVIDER options with aws|azure in dictionary file datalab/services/self-service/src/main/resources/webapp/src/dictionary/global.dictionary.ts
|
---|
- Build web application
|
---|
Prepare HTTPS prerequisites
To enable a SSL connection the web server should have a Digital Certificate. To create a server certificate, follow these steps:
Create the keystore.
Export the certificate from the keystore.
Sign the certificate.
Import the certificate into a truststore: a repository of certificates used for verifying the certificates. A truststore typically contains more than one certificate.
Please find below set of commands to create certificate, depending on OS.
Create Unix/Ubuntu server certificate
Pay attention that the last command has to be executed with administrative permissions.
|
---|
Create Windows server certificate
Pay attention that the last command has to be executed with administrative permissions. To achieve this the command line (cmd) should be run with administrative permissions.
|
---|
Where the <DRIVE_LETTER>
must be the drive letter where you run the DataLab.
How to run locally
There is a possibility to run Self-Service and Provisioning Service locally. All requests from Provisioning Service to Docker are mocked and instance creation status will be persisted to Mongo (only without real impact on Docker and AWS).
Both services, Self-Service and Provisioning Service are dependent on datalab/provisioning-infrastructure/ssn/templates/ssn.yml configuration file. Both services have main functions as entry point, SelfServiceApplication for Self-Service and ProvisioningServiceApplication for Provisioning Service. Services could be started by running main methods of these classes. Both main functions require two arguments:
- Run mode (“server”)
- Configuration file name (“self-service.yml” or “provisioning.yml” depending on the service). Both files are located in root service directory. These configuration files contain service settings and are ready to use.
The services start up order does matter. Since Self-Service depends on Provisioning Service, the last should be started first and Self-Service afterwards. Services could be started from local IDEA (Eclipse or Intellij Idea) “Run” functionality of toolbox.
Run application flow is following:
- Create and run provisioning-service configuration:
Create Application with name provisining-service-application
- Main class:
com.epam.datalab.backendapi.ProvisioningServiceApplication
- VM options:
-Ddocker.dir=[PATH_TO_PROJECT_DIR]\infrastructure-provisioning\src\general\files\gcp
- Program arguments :
server [PATH_TO_PROJECT_DIR]\services\provisioning-service\provisioning.yml
- Working directory:
[PATH_TO_PROJECT_DIR]
- Use classpath of module:
provisioning-servise
- PAY ATTENTION: JRE should be the same jre where created server certificate
- Main class:
Create and run self-service configuration:
- Create Application with name self-service-application
- Main class:
com.epam.datalab.backendapi.SelfServiceApplication
- Program arguments :
server [PATH_TO_PROJECT_DIR]/services/self-service/self-service.yml
- Working directory:
[PATH_TO_PROJECT_DIR]
- Use classpath of module:
self-service
- PAY ATTENTION: JRE should be the same jre where created server certificate
- Main class:
- Create Application with name self-service-application
- Try to access self-service Web UI by https://localhost:8443
|
---|
Infrastructure provisioning
DevOps components overview
The following list shows common structure of scripts for deploying DataLab
Folder structure
|
---|
Each directory except general contains Python scripts, Docker files, templates, files for appropriate Docker image.
- base – Main Docker image. It is a common/base image for other ones.
- edge – Docker image for Edge node.
- dataengine – Docker image for dataengine cluster.
- dataengine-service – Docker image for dataengine-service cluster.
- general – OS and CLOUD dependent common source.
- ssn – Docker image for Self-Service node (SSN).
- jupyter/rstudio/zeppelin/tensor/deeplearning – Docker images for Notebook nodes.
All Python scripts, Docker files and other files, which are located in these directories, are OS and CLOUD independent.
OS, CLOUD dependent and common for few templates scripts, functions, files are located in general directory.
|
---|
These directories may contain differentiation by operating system (Debian/RedHat) or cloud provider (AWS).
Directories of templates (SSN, Edge etc.) contain only scripts, which are OS and CLOUD independent.
If script/function is OS or CLOUD dependent, it should be located in appropriate directory/library in general folder.
The following table describes mostly used scripts:
Script name/Path | Description |
---|---|
Dockerfile | Used for building Docker images and represents which Python scripts, templates and other files are needed. Required for each template. |
base/entrypoint.py | This file is executed by Docker. It is responsible for setting environment variables, which are passed from Docker and for executing appropriate actions (script in general/api/). |
base/scripts/*.py | Scripts, which are OS independent and are used in each template. |
general/api/*.py | API scripts, which execute appropriate function from fabfile.py. |
template_name/fabfile.py | Is the main file for template and contains all functions, which can be used as template actions. |
template_name/scripts/*.py | Python scripts, which are used for template. They are OS and CLOUD independent. |
general/lib/aws/*.py | Contains all functions related to AWS. |
general/lib/os/ | This directory is divided by type of OS. All OS dependent functions are located here. |
general/lib/os/fab.py | Contains OS independent functions used for multiple templates. |
general/scripts/ | Directory is divided by type of Cloud provider and OS. |
general/scripts/aws/*.py | Scripts, which are executed from fabfiles and AWS-specific. The first part of file name defines to which template this script is related to. For example: common_*.py – can be executed from more than one template. ssn_*.py – are used for SSN template. edge_*.py – are used for Edge template. |
general/scripts/os/*.py | Scripts, which are OS independent and can be executed from more than one template. |
Docker actions overview
Available Docker images and their actions:
Docker image | Actions |
---|---|
ssn | create, terminate |
edge | create, terminate, status, start, stop, recreate |
jupyter/rstudio/zeppelin/tensor/deeplearning | create, terminate, start, stop, configure, list_libs, install_libs, git_creds |
dataengine/dataengine-service | create, terminate |
Docker and python execution workflow on example of SSN node
- Docker command for building images docker.datalab-base and docker.datalab-ssn:
|
---|
Example of SSN Docker file:
|
---|
Using this Docker file, all required scripts and files will be copied to Docker container.
- Docker command for building SSN:
|
---|
- Docker executes entrypoint.py script with action create. Entrypoint.py will set environment variables, which were provided from Docker and execute general/api/create.py script:
|
---|
- general/api/create.py will execute Fabric command with run action:
|
---|
- Function run() in file ssn/fabfile.py will be executed. It will run two scripts general/scripts/aws/ssn_prepare.py and general/scripts/aws/ssn_configure.py:
|
---|
- The scripts general/scripts/<cloud_provider>/ssn_prepare.py an general/scripts/<cloud_provider>/ssn_configure.py will execute other Python scripts/functions for:
- ssn_prepate.py: 1. Creating configuration file (for AWS) 2. Creating Cloud resources.
- ssn_configure.py: 1. Installing prerequisites 2. Installing required packages 3. Configuring Docker 4. Configuring DataLab Web UI
- If all scripts/function are executed successfully, Docker container will stop and SSN node will be created.
Example of Docker commands
SSN:
|
---|
All parameters are listed in section "Self-ServiceNode" chapter.
Other images:
|
---|
How to add a new template
First of all, a new directory should be created in infrastructure-provisioning/src/.
For example: infrastructure-provisioning/src/my-tool/
The following scripts/directories are required to be created in the template directory:
|
---|
fabfile.py – the main script, which contains main functions for this template such as run, stop, terminate, etc.
Here is example of run() function for Jupyter Notebook node:
Path: infrastructure-provisioning/src/jupyter/fabfile.py
|
---|
This function describes process of creating Jupyter node. It is divided into two parts – prepare and configure. Prepare part is common for all notebook templates and responsible for creating of necessary cloud resources, such as EC2 instances, etc. Configure part describes how the appropriate services will be installed.
To configure Jupyter node, the script jupyter_configure.py is executed. This script describes steps for configuring Jupyter node. In each step, the appropriate Python script is executed.
For example:
Path: infrastructure-provisioning/src/general/scripts/aws/jupyter_configure.py
|
---|
In this step, the script infrastructure-provisioning/src/jupyter/scripts/configure_jupyter_node.py will be executed.
Example of script infrastructure-provisioning/src/jupyter/scripts/configure_jupyter_node.py:
|
---|
This script call functions for configuring Jupyter node. If this function is OS dependent, it will be placed in infrastructure-provisioning/src/general/lib/<OS_family>/debian/notebook_lib.py
All functions in template directory (e.g. infrastructure-provisioning/src/my-tool/) should be OS and cloud independent.
All OS or cloud dependent functions should be placed in infrastructure-provisioning/src/general/lib/ directory.
The following steps are required for each Notebook node:
- Configure proxy on Notebook instance – the script infrastructure-provisioning/src/general/scripts/os/notebook_configure_proxy.py
- Installing user’s key – the script infrastructure-provisioning/src/base/scripts/install_user_key.py
Other scripts, responsible for configuring Jupyter node are placed in infrastructure-provisioning/src/jupyter/scripts/
scripts directory – contains all required configuration scripts.
infrastructure-provisioning/src/general/files/<cloud_provider>/my-tool_Dockerfile – used for building template Docker image and describes which files, scripts, templates are required and will be copied to template Docker image.
infrastructure-provisioning/src/general/files/<cloud_provider>/my-tool_descriptsion.json – JSON file for DataLab Web UI. In this file you can specify:
- exploratory_environment_shapes – list of EC2 shapes
- exploratory_environment_versions – description of template
Example of this file for Jupyter node for AWS cloud:
|
---|
Additionally, following directories could be created:
templates – directory for new templates;
files – directory for files used by newly added templates only;
All Docker images are being built while creating SSN node. To add newly created template, add it to the list of images in the following script:
Path: infrastructure-provisioning/src/general/scripts/aws/ssn_configure.py
|
---|
For example:
|
---|
Azure OAuth2 Authentication
DataLab supports OAuth2 authentication that is configured automatically in Security Service and Self Service after DataLab deployment. Please see explanation details about configuration parameters for Self Service and Security Service below. DataLab supports client credentials(username + password) and authorization code flow for authentication.
Azure OAuth2 Self Service configuration
|
---|
where:
- tenant - tenant id of your company
- authority - Microsoft login endpoint
- clientId - id of the application that users log in through
- redirectUrl - redirect URL to DataLab application after try to login to Azure using OAuth2
- responseMode - defines how Azure sends authorization code or error information to DataLab during log in procedure
- prompt - defines kind of prompt during Oauth2 login
- silent - defines if DataLab tries to log in user without interaction(true/false), if false DataLab tries to login user with configured prompt
- loginPage - start page of DataLab application
- maxSessionDurabilityMilliseconds - max user session durability. user will be asked to login after this period of time and when he/she creates ot starts notebook/cluster. This operation is needed to update refresh_token that is used by notebooks to access Data Lake Store
To get more info about responseMode, prompt parameters please visit Authorize access to web applications using OAuth 2.0 and Azure Active Directory
Azure OAuth2 Security Service configuration
|
---|
where:
- tenant - tenant id of your company
- authority - Microsoft login endpoint
- clientId - id of the application that users log in through
- redirectUrl - redirect URL to DataLab application after try to login to Azure using OAuth2
- validatePermissionScope - defines(true/false) if user's permissions should be validated to resource that is provided in permissionScope parameter. User will be logged in onlu in case he/she has any role in resource IAM described with permissionScope parameter
- permissionScope - describes Azure resource where user should have any role to pass authentication. If user has no role in resource IAM he/she will not be logged in
- managementApiAuthFile - authentication file that is used to query Microsoft Graph API to check user roles in resource described in permissionScope