DSS Technial Architecture

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 70

DSS Technical Architecture

In Production

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku


Overview
Scalability / High Availability
Instance Architecture
Product Architecture
DSS Processes
Table of Contents
Global Architecture
IT Integration
DSS Security
DSS Code Environments
Cloud Design Patterns

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku


Common Challenges
1. “My data analysts need a platform 5. “Our data analysts have no way of
that empowers them to develop a collaborating with each other and our
wide array of analytics” data science team”

2. “Data prep is taking way too much 6. “We are using multiple platforms and
time away from developing advanced tools for data prep, model
analytics that drive our business” development and deployment”

3. “I don’t have a central analytics 7. “Putting models into production can be


platform that allows me to put in data time consuming and when we want to
access and users controls” make changes the process starts all
over again”
4. “We are having trouble connecting to
all of our different data stores.”
© Dataiku – 2020 – Confidential and proprietary information
DSS Architecture Overview

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku 4


Dataiku DSS Architecture
Ready For Production
Personas Served

DESIGN All the features to collaboratively prepare,


Env analyze, and model Data Business
Scientist Analyst

Operationalization
Orchestrate data workflows in production:
AUTOMATION
advanced workflow automations,
Env performance monitoring… System Database
Administrator Administrator

API DEPLOYER Visual interface for managing API Scoring


Service infrastructure. System Web
Administrator Developer

API services for integration in operational


API SCORING
systems: execute real-time predictions,
Service access key services and data. Web
Developer

© Dataiku – 2020 – Confidential and proprietary information 5


Dataiku DSS at the Core Of Your Data
Platform
Dataiku DSS is the central platform where the analytical workflows and machine learning models supporting your data projects can be created
SOURCE DATA and ran from end-to-end. These projects will benefit both from the built-in features of DSS and its ability to be extended using custom code (R, CONSUMPTION
Python, SQL, Spark…) to address any specific need (connectivity, business logic, complex modelling…).

CRM DATABASES COLLABORATION

DSS DESIGN NODE DSS AUTOMATION NODE Reporting &


Dashboards
Finance
✓ Project bundling and deployment
DATA ML MODEL ✓ Advanced automation scenario
DEPLOYMENT
PREPARATION BUILDING ✓ Reporting and monitoring
Transactions ✓ Management API’s

Operational
LOG FILES

Event logs Systems


EXPLORATION / MODEL
ACQUISITION
ANALYTICS ASSESSMENT

Customer
Touch Points In-Memory Processing …

DSS API NODE


Real-time
FOR CODERS
Operations Applications
✓ Deploy model through REST API’s
API’s

FOR CLICKERS ✓ Model versioning


✓ HA & load balancing
..or Calculations Push-Down ✓ Logging
External Data

LARGE SCALE DATA Big Data / distributed Analytical databases / DWH Dev. or Prod.
STORAGE & PROCESSING HDFS MPP Environments
SYSTEMS

© Dataiku – 2020 – Confidential and proprietary information 6


Integrating DSS with your Infrastructure
DSS is an “on-premises” deployment
Dataiku DSS Consumption
Illustration with Design Node

Built-in DSS Connectors

Interact via their


web browser
Custom Python
exporters / plugins
Linux server

A FEW NOTES:
- DSS is installed on your own
Hadoop client Connect Custom Python
infrastructure (whether your own data
center or in the cloud) libs & conf. through JDBC connectors
- DSS does not ingest the data, instead it R/W/X R/W/X
will connect to your infrastructure and Data Storage &
push down the calculations on it to avoid Processing
data movements. But…
- ... DSS can do local processing, including
in Python or R, hence the hosting server
may need enough memory/CPU.
- When integrated with a Hadoop cluster,
DSS is usually on an edge node / gateway
Hadoop / HDFS & Analytical SQL DB’s / API’s, 3rd party app & data,
node. distributed processing Operational SQL DB’s DWH custom data sources

© Dataiku – 2020 – Confidential and proprietary information 7


Scalability and High Availability

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku 8


Scalability and High-Availability
Scalability for Design Node / Automation Node is done
by splitting based on:
Project / Team LB

Security
Maturity (early stage vs production oriented)
Active Passive
Dedicated environment for : DSS DSS

Resource consuming projects


Develop / maintain plugins Shared
(or replicated w/ sync)

High Availability for Design Node / Automation Node File System

Active/Passive high availability


Use of a shared file system between nodes

© Dataiku – 2020 – Confidential and proprietary information 9


Scalability and High-Availability
API Scoring Nodes
API Nodes are stateless
Support Active/Active high availability
LB
API Performance
The number of API nodes required depends of the target
QPS (Query Per Second) :
Optimized models (java, spark, or SQL engines; see A A A
documentation), estimate to 100 to 2000 QPS
For non-optimized models, estimate 5-50 qps per node
If using an external RDBMS for enrichment, it must also have
HA

© Dataiku – 2020 – Confidential and proprietary information 10


Instance Architecture

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku 11


Leverage Your Infrastructure
Run in Memory
Run in Database Python, R, …
Enterprise SQL,
Analytic SQL
Distributed ML
By default, DSS Run In Cluster Mllib, H2O, …

automatically chooses Spark, Impala, Hive, …

the most effective


execution engine
amongst the engines
available next to the
input data of each Data Lake
computation. Cassandra,
HDFS, …
ML in Memory
S3 Python Scikit-Learn, R, …

Database Data
Vertica, File System Data
Greenplum,
Redshift, Host File System,
PostgreSQL, Remote File System,
… …
© Dataiku – 2020 – Confidential and proprietary information 12
Where Can Processing Occur
Local In Hadoop / Spark In Kubernetes &
In SQL Database
Server AWS EMR / …. Docker

Data Preparation YES YES YES YES


(Interactive / Recipe in Workflow) Spark, Hive, Impala (with Spark over K8s)

Coding: Python, R, Scala YES


(Notebook / Recipe in Workflows)
YES YES Custom code with DSS YES
API

SQL Analytics N/A YES YES YES


(Notebook / Recipe in Workflow) (Hive, Impala, Pig, SparkSQL) (with Spark over K8s)

Visualization YES YES YES YES


(Charts) (most charts) (most charts) (most charts with Spark)

YES YES YES YES


Machine Learning: Training scikit-learn, XGBoost,
MLlib, Sparkling Water Vertica ML
scikit-learn, XGBoost,
Keras/Tensorflow Keras/Tensorflow

YES YES YES YES


Machine Learning: Inference Depending on algorithms Depending on algorithms
Depending on
Depending on algorithms
algorithms

© Dataiku – 2020 – Confidential and proprietary information 13


Backup/Disaster Recovery
Backup and Recovery Discussion
Periodic backup of DATADIR (contains all config/DSS state)
Consistent live backup requires snapshots (disk-level for cloud and NAS/SAN, or OS-
level with LVM)
Industry standard backup procedure applies

© Dataiku – 2020 – Confidential and proprietary information 14


Enterprise Scale Sizing
Recommendation
Design environments are generally consume more memory than other
Design env because it’s the collaborative environment for design prototyping and
128-256 GB experiments.

Automation node will run maintain and monitor project workflows and
Automation env models in production. Since most actions are batches you can partition
64-128 GB the activity in the 24 hours and optimize resource consumption.
(+ 64 GB in preprod) You can also use a nonproduction automation node to validate your
project before going to production

Scoring nodes are real time production nodes for scoring or labeling with
Scoring env
prediction models. A single node doesn’t require a lot of memory but
4+ GB per
these nodes are generally deployed on dedicated clusters of containers
fleet of n nodes
or virtual machines
Memory usage on the DSS server side can be controlled at the Unix level leveraging Cgroup linux capabilities
Database resource management can be done on the DB side at the user level when per-user credentials mode is activated
© Dataiku – 2020 – Confidential and proprietary information 15
Single Server Installation
Design/Automation Managed File
node System

Data
Set
s )
http(
Core
browser
Data

ing
Set

am
tre
SS
DS
Data
job Set

user

© Dataiku – 2020 – Confidential and proprietary information 16


DSS Instance + Hadoop
Design/Automation
HDFS
node

Data
Set
s )
http(
Core
browser Data
Set

Data
Set

job worker worker


submit
user (RPC)

© Dataiku – 2020 – Confidential and proprietary information 17


Hybrid model: DSS Instance + Hadoop + SQL
HDFS

Design/Automation
node
Data
Set

Data
Set

s)
http(
Data
Set
Core
browser *Fast Path connections available where
provided (e.g. RedShift + S3, Spark+Cloud,
worker worker Spark+HDFS, Teradata + Hadoop...).
C )
( RP Otherwise, data is streamed through DSS
mit

Path*
Fast
su b using abstracted representations to ensure
compatibility across natively compatible
storage systems
SQL
job
user su b
mit
(JDB
C)

Table Table Table

© Dataiku – 2020 – Confidential and proprietary information 18


Hybrid model: DSS Instance + Hadoop +
Spark SPARK

Cluster Manager

Design/Automation
node
worker worker

executor executor

s ) task task task task


http(
Core
browser

HDFS

Data
Set

Data
job Set
user Data
Set

© Dataiku – 2020 – Confidential and proprietary information 19


Product Architecture

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku 20


DSS Design
Client
application Hadoop Proxy

DATA
Public DSS Core
API Job Kernel
Job Kernel

Users UI
Configuration Logs

Notebook
Kernels
External
HTTP(S)
services
Main design environment with multiple concurrent users:
● RAM (if no Hadoop/Spark cluster) HTTP(S)
○ small: 2-3 people on <5 GB data: 32-64 GB LDAP,
○ medium: 2-3 with heavy machine learning: 64-128 GB JDBC,
○ large: 5+ on large data, with ML: 256+ RMI,
● 8 to 32 core (if no Hadoop/Spark cluster; 1 core per simultaneous active user) RPC
● Config: 128-256 GB of disk for DSS
● Storage for your data if stored locally, ~10x size of raw data
● SSD recommended
© Dataiku – 2020 – Confidential and proprietary information 21
DSS Automation
Hadoop Proxy
Client
application
Public
API Configuration
DSS Core
Job Kernel
Job Kernel
DATA
Viewers UI Orchestrator Logs

Admin External
services
HTTP(S) HTTP(S)
LDAP,
JDBC,
Batch Production environment RMI,
- 64-128 GB of RAM (if no Hadoop/Spark cluster)
- 8 to 32 core (if no Hadoop/Spark cluster) RPC
- 128-256 GB of disk for DSS
(+ Storage for your data if stored locally )

© Dataiku – 2020 – Confidential and proprietary information 22


DSS Scoring

Models

Client Models

Data
Public
API DSS Core
Application Code

Code
Configuration
Configuration

HTTP(S) HTTP(S),
JDBC,
Real time Production environment RMI,
- 4-8 GB of RAM (per instance)
RPC
- 2 to 4 core (per instance)
- 32 GB disk (per instance)

© Dataiku – 2020 – Confidential and proprietary information 23


Open Ports
Base Installations
Design: User’s choice of base TCP port (default 11200) + next 9 consecutive ports
Only the first of these ports needs to be opened out of the machine. It is highly recommended to firewall the other ports
Automation: User’s choice of base TCP port (default 12200) + next 9 consecutive ports
Only the first of these ports needs to be opened out of the machine. It is highly recommended to firewall the other ports
API: User’s choice of base TCP (default 13200) + next 9 consecutive ports
Only the first of these ports needs to be opened out of the machine. It is highly recommended to firewall the other ports

Supporting Installations
Data sources: JDBC entry point; network connectivity
Hadoop: ports + workers required by specific distribution; network connectivity
Spark: executor + callback (two way connection) to DSS

Privileged Ports
DSS itself cannot run on ports 80 or 443 because it does not run as root, and cannot bind to these privileged ports.
The recommended setup to have DSS available on ports 80 or 443 is to have a reverse proxy (nginx or apache) running on the
same machine, forwarding traffic from ports 80 / 443 to the DSS port.
(https://doc.dataiku.com/dss/latest/installation/proxies.html)

© Dataiku – 2020 – Confidential and proprietary information 24


Data Directory
The data directory contains:

The configuration of Data Science Studio, including all user-generated


configuration (datasets, recipes, insights, models, ...)
Log files for the server components
Log files of job executions
Various caches and temporary files
A Python virtual environment dedicated to running the Python components of
Data Science Studio, including any user-installed supplementary packages
Data Science Studio startup and shutdown scripts and command-line tools
Depending on your configuration, the data directory can also contain some
managed datasets. Managed datasets can also be created outside of the data
directory with some additional configuration.

It is highly recommended that you reserve at least 100 GB of space for the data
directory

© Dataiku – 2020 – Confidential and proprietary information 25


Data Directory
FileSystem Add to
Repositories
Binary files from DSS dedicated backup

installation package DSS binaries dataiku-dss-x.x.x Optional

DSS workspace initialized Workspace


during installation and (triggered during
installation )
$DSS_HOME

evolving over time with


projects and configuration Configuration and
metadata (critical) $DSS_HOME/config

DSS core component


activity logs $DSS_HOME/run

($DSS_HOME/run) and Logs DSS

user triggered jobs $DSS_HOME/jobs


($DSS_HOME/jobs)

© Dataiku – 2020 – Confidential and proprietary information 26


DSS Processes

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku


DSS Components and Processes
Starting the DSS Design/Automation Node
4 processes are spawned
Supervisor: process manager
Nginx server listening to installation port
Backend server listening to installation port + 1
Ipython (Jupyter) server listening to installation
port + 2
The next slides detail the role of each
server and where they sit in the overall DSS
architecture.

© Dataiku – 2020 – Confidential and proprietary information 28


DSS Core (Design and Automation)

© Dataiku – 2020 – Confidential and proprietary information 29


DSS Components and Processes
NGINX

Handles all interactions with the end


user through its web browser. It acts as
a HTTP proxy for forwarding requests to
all other DSS components. It binds to
the DSS port number specified at
install.

Protocol: HTTP(s) and websockets.

© Dataiku – 2020 – Confidential and proprietary information 30


DSS Components and Processes
BACKEND

Metadata server of DSS

● Interact with config folder


● Prepare preview
● Explore (e.g. charts aggregation)
● git
● public api
● schedule
● scenarios

It binds to the DSS port number


specified at install +1.

Backend is a single point of failure. It won't go down alone! Hence it is supposed to handle as little actual processing as
possible. Backend can spawn child processes: custom scenario steps/triggers, Scala validation, API node DevServer, macros,
etc.
© Dataiku – 2020 – Confidential and proprietary information 31
DSS Components and Processes
IPYTHON (JUPYTER)

It handles interactions with R , Python


and Scala notebook kernels using the
ZMQ protocol.

It binds to the DSS port number


specified at install +2.

© Dataiku – 2020 – Confidential and proprietary information 32


DSS Components and Processes
JOB EXECUTION KERNEL (JEK)

Handles dependencies computation


and recipes running on DSS engine. For
other engines and code recipes, it will
launch its child processes: Python, R,
Spark, SQL, etc.

© Dataiku – 2020 – Confidential and proprietary information 33


DSS Components and Processes
FUTURE EXECUTION KERNEL (FEK)

Handles non-jobs related background


tasks that may be dangerous, such as:

● metrics computation. It can launch


child Python processes for custom
Python metrics.
● sample building for machine
learning and charts.
● Machine learning preparation
steps.

© Dataiku – 2020 – Confidential and proprietary information 34


DSS Components and Processes
ANALYSIS SERVER

Handles Python-based machine


learning training, as well as data
preprocessing.

WEBAPP BACKEND

Handles current user-created webapp


backends (Python Flask Backend,
Python Bokeh and R Shiny)

© Dataiku – 2020 – Confidential and proprietary information 35


Global Architecture

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku 36


Global Architecture Design
Considerations
Type of production integration
Batch-oriented workflows : data preparation, feature engineering, model training, and
batch scoring on trained models
Required automation node
Interaction with other component like ETL flow can be managed with an enterprise orchestrator

Real-time scoring : deployment of machine learning models behind API


Required scoring node and generally automation node to implement automatic retrain and
update
Sizing is mainly driven by number of queries per second you need to manage and number of
services

© Dataiku – 2020 – Confidential and proprietary information 37


Topics to Consider During Architecture
Design
Model life cycle management :
Define the process to going from design to the automation and API Nodes
Design can be influenced by different team ownership on part of the process
Complexity and number of staging environments can vary depending of the criticality of the
job

Advanced deployment recommendations:


Build a separate dev environment containing design, automation and API node to test both the
projects and the deployment
Identify the appropriate number of test/staging environments for production
The complete dev environment can be on the same physical layer
For production environment
API nodes need to be deployed on at least 2 separate servers
Automation node should run on its own server

© Dataiku – 2020 – Confidential and proprietary information 38


Scaling the Design Node
We recommend to split projects across multiple DSS instances based on
several criteria :
Team and project organization
Expected collaboration
Security and Data Access Management
Type and maturity of the project: early stage experimentation vs production ready
projects

We recommend a separate environment:


For heavy resource consuming ML experiments, to avoid slowing down the work of then
entire team with only one job. Note: this can be avoided with the use of Docker/K8s.
To develop/maintain and test plugins

© Dataiku – 2020 – Confidential and proprietary information 39


Centralized Skills
The “Center of Excellence” Model

Integration UAT Center of Excellence Pre-Prod Production

Design Design Design Automation Automation


API Automation API API API
API

IT IT Advanced Dev Ops Dev Ops


Selected Analytics Selected
Users Teams Users

© Dataiku – 2020 – Confidential and proprietary information 40


Data Sources

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku 41


What is the Catalog within DSS
DSS Catalog Goal: Allow fast, explained, repeatable, governed access to
assets created or available to the DSS user
Power of Default Sharing
All assets created within DSS are indexed into an ElasticSearch for immediate use
Power of Default Documentation and Tagging
All assets created within DSS can have documentation and tags associated with them
allowing for easy search and reuse of existing assets
Capability to visually define a Common Data Model
A common model can be exposed through projects and some ability to expose via a
DSS data API

© Dataiku – 2020 – Confidential and proprietary information 42


Metadata Management within DSS
For DSS Created Assets
Information capture at the point of artifact creation
Tagging available for all artifacts
Ongoing development: Parent-child tagging and reporting (all the projects with this
connection, create a tag)
Capture of Summary, Description and ToDo list information for each artifact
Creation / Last Modification Information
Capture of Analytic Asset Lineage
Ability to push in and pull out metadata via the Public REST API
Meanings: Business-defined terms and quality measuring rules for datasets
For Externally Created Assets
Indexing of existing SQL Sources (name and connection only)
Ability to push in and pull out metadata via the Public REST API
Future Release
Custom Fields within the Data Catalog for all artifacts

© Dataiku – 2020 – Confidential and proprietary information 43


Data Quality within DSS
For Dataset Assets
Define Meanings: Business-defined terms and quality measuring rules for datasets
UI-based Embedded Graphs and Statistics
Dataset Metrics: setup simple and/or complex probes and checks of data to ascertain changes
over time
Statistically sound sampling techniques to ascertain issues over large datasets
Access to simple but yet powerful graphing capabilities to look for anomalies within datasets
Future Release
New Statistical Exploration of Datasets

© Dataiku – 2020 – Confidential and proprietary information 44


File-based Data Sources
File-based Data Sources File-based Data Formats
Type Read Write Format Read Write
Upload your files yes yes Delimited values (CSV, TSV, …) yes yes
Server filesystem yes yes Fixed width yes no
HDFS yes yes Excel (from Excel 97) yes only via export
Amazon S3 yes yes Avro yes yes
Google Cloud Stora yes yes Custom format using regular expression yes no
ge
Azure Blob Storage yes yes XML yes no
FTP yes yes JSON yes no
SSH (SCP and SFT yes yes ESRI Shapefiles yes no
P)
HTTP yes no MySQL Dump yes no
Apache Combined log format yes no

Hadoop-specific Data Formats

Parquet yes yes


Hive SequenceFile yes yes
Hive RCFile yes yes
Hive ORCFile yes yes
© Dataiku – 2020 – Confidential and proprietary information 45
SQL and NoSQL-based Data Sources
SQL Data Sources NoSQL Data Sources
Type Read Write Type Read Write
MySQL yes yes MongoDB yes yes
MySQL yes yes Cassandra yes yes
PostgreSQL yes yes ElasticSearch yes yes
Vertica yes yes
Amazon Redshift yes yes
Beta and Best Effort Data Sources
Greenplum yes yes
Teradata yes yes Type Read Write
IBM Netezza yes yes IBM DB2 yes (Beta) yes (Beta)
SAP HANA yes yes Exasol yes (Beta) yes (Beta)
Oracle yes yes MemSQL. yes (Beta) yes (Beta)
Exadata yes yes Other SQL databases (JDBC best effort, not
no, generally
driver) guaranteed
Microsoft SQL Serv yes yes
er Twitter (Streaming API) yes no
Google BigQuery yes yes
Snowflake yes yes Custom Python
Custom Python or
Generic APIs or R code,
R code, plugins
plugins

© Dataiku – 2020 – Confidential and proprietary information 46


IT Integration

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku 47


Single Sign On
DSS supports the following SSO protocols
SAMLv2
SPNEGO / Kerberos
SP-Initiated SSO
DSS acts as a SAML Service Provider (SP), which can authenticate to an Identity
Provider (IdP)
Supported
OKTA
PingFederate PingIdentity
Azure Active Directory
Google G.Suite

© Dataiku – 2020 – Confidential and proprietary information 48


Key Integration Points
Key Points of Security Integration
Configures easily for use with HTTPS
Supports LDAP, LDAPS, and LDAP/TLS
Supports SSO (SAMLv2 and SPNEGO)

Relies on impersonation where applicable


sudo on Unix
proxy user on Hadoop / Oracle
constrained delegation for SQL server
Otherwise personal credential for other DBs
Complete audit trail exportable to external system
Permissions and multi-level authorization dashboard

© Dataiku – 2020 – Confidential and proprietary information 49


Proxies
Why proxies with DSS?
Expose DSS on a different host/port that its native installation (reverse proxy)
DSS is installed on a server without direct outgoing internet access (proxy)
WebSocket
DSS uses the WebSocket protocol for part of its user interface.
Direct or Reverse Proxies configured between DSS and its users must support
WebSocket
Example:
Nginx 13.13 and above
Apache 2.4.5 and above
Amazon Web Service

© Dataiku – 2020 – Confidential and proprietary information 50


Audit Trail
Audit Trail Details
DSS includes an audit trail
that logs all actions
performed by users

Log files are available on


the server

DSS sends data using log4j


library : All log4j appenders
can be used (Kafka,…)

© Dataiku – 2020 – Confidential and proprietary information 51


DSS Security

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku 52


DSS Security: Concepts
DSS Security Concepts
Security based mainly on projects and groups
Users belong to groups
Groups are managed only by global administrators
The two main security atoms of DSS are the project and the connection
Groups are granted permissions:
On connections
On projects
Global (instance-wide) permissions

© Dataiku – 2020 – Confidential and proprietary information 53


Motivation and Concept
REGULAR SECURITY USER ISOLATION FRAMEWORK SECURITY

• Data access is partitioned by connection security • In addition to the previous capabilities, now possible to
rules prevent a hostile code from overriding access rules
• Project access is partitioned by project security rules • Identity of the logged-in DSS user is propagated for
execution of local and remote (Hadoop and container)
• User groups are split with specific permission for
code
executing code
• Allows traceability and reliance on the internal access
• All code is executed with the DSS service account
controls of the clusters
• Therefore: in this model, a hostile coder can write
• Also permits more per-user resource control
arbitrary code which runs as the DSS service account

© Dataiku – 2020 – Confidential and proprietary information 54


Security Models
Basic Security User Isolation Framework Security

Dataiku enforces security per project and Dataiku enforces security per project and
Dataiku objects
connections connections

Authentication Password-based or LDAP based


Password-based or LDAP based authentication
authentication

DSS Host All users actions run using a service Each user runs processes using her/his own
Python, R, bash … account account

Each user connects to each database using


Databases Data Single database account for all users
her/his own account

Hadoop Use a single service account authenticated Each user launches Hadoop Jobs as her/his
Spark, Impala, Hive … with Kerberos own account using Kerberos

Dataiku Native Audit trail


Audit Trail Dataiku native Audit trail
Hadoop Audit trail

© Dataiku – 2020 – Confidential and proprietary information 55


Comparing Security Models
Regular Security User Isolation Framework
Feature
Security
Access control on projects Yes Yes

Access control on connections Yes Yes

Enforcement of permissions to execute code Yes Yes

Execution of local code Single service user End users

Execution of Hadoop and Spark code Single service user End users

Connecting to secure Hadoop clusters (Kerberos) Yes Yes

HDFS ACLs to enforce permissions even against code


No Yes
execution

Traceability of all actions, including code execution Yes Yes

Non-repudiable audit log Yes Yes

Hadoop-level traceability of actions (Cloudera Navigator,


Single service user End users
Atlas, ...)

© Dataiku – 2020 – Confidential and proprietary information 56


Project-level Security

© Dataiku – 2020 – Confidential and proprietary information 57


Security with Hadoop
Without Security

U1 U2 U3 HADOOP

DSS system user Hadoop user


DSS
Group A user
DATA

Audit Tools
Group B user
DSS

job1 job2 job3 job1 job2 job3

© Dataiku – 2020 – Confidential and proprietary information 58


Standard Hadoop Security

© Dataiku – 2020 – Confidential and proprietary information 59


User Isolation Framework Security

© Dataiku – 2020 – Confidential and proprietary information 60


Requirements for User Isolation Framework
To impersonate, you need privileges
User Isolation Framework security requires:
Root access on the machine where DSS runs
Required for impersonation of local processes
‘Proxyuser’ privilege on the Hadoop cluster for the DSS service account
Cluster-wide privilege
Effectively gives DSS account control over the end-user accounts on the cluster

© Dataiku – 2020 – Confidential and proprietary information 61


Guidelines

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku 62


Full Lifecycle of Project - Example
IT Sandbox Design

IT Team Design Automation/API Design Data


Automation/API
1. Test Solution updates 1. Test Solution updates 1. Design the project 4. Test project’s automation Team
2. Develop Add-ons 2. Define automation process
3. Create project’s bundle

API Deployer manages API movement


across the environments

Production Pre-Production

Automation/API Automation/API
IT Team IT Team
1. Run the project on production 1. Validate project’s automation
2. Monitor performances and results

End users

© Dataiku – 2020 – Confidential and proprietary information 73


DSS Code Environments

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku


Code Environments in DSS
DSS allows you to create an arbitrary number of code environments !
→ A code environment is a standalone and self-contained environment
to run Python or R code
→ Each code environment has its own set of packages
→ In the case of Python environments, each environment may also use
its own version of Python

→ You can set a default code env per project


→ You can choose a specific code env for
any Python/R recipe
→ You can choose a specific code env for
the visual ML

© Dataiku – 2020 – Confidential and proprietary information


Code Environments in DSS: Intro
➢ DSS allows for Data Scientists to create and manage their own Python and R coding
environments, if given permission to do so by an Admin (Group Permissons)
➢ These Envs can be activated and deactivated for different pieces of code/levels in DSS
including
○ Projects, web apps, notebooks, and plugins
➢ To create/ manage Code Envs: Click the Gear -> Administration -> Code Envs

© Dataiku – 2020 – Confidential and proprietary information


Code Environments in DSS:
Installing Packages to your Env

To Install Packages to your


ENV
▪ Click on your ENV in the list of
Code ENVS
▪ Go to ‘Packages to Install’ section
▪ Type in the packages you wish to
install line by line, like how you
would for a requirements.txt file
▪ Click Save and Update
Standard pip syntax applies
here
i.e. -e /local/path/to/lib will install
a local python package not
availalble on pypi
Review installed packages in
“Installed Packages”
© Dataiku – 2020 – Confidential and proprietary information
Code Environments in DSS:
Other Options

Permissions
Allow groups to use the code
env and define their level of
use: i.e. use only, can
manage/update
Container Exec
Build docker images that
include the libraries of your
code env
Build for specific container
configs or all configs
Logs
Review any errors in install
code env

© Dataiku – 2020 – Confidential and proprietary information


Code Environments in DSS:
Non-Standard PyPi/CRAN Servers

By default, DSS will connect to public


repositories (PyPi/Conda/CRAN) in order to
download libraries for code env.
This is undesirable in some customer
deployments:
air-gapped installed
customers with restrictions on library use
Admins can set up specific mirrors for use in
code environments
ADMIN > SETTINGS > MISC > Code env extra
options
Set CRAN mirror URL, extra options for
pip/conda as needed. Follow standard
documentation.
example: --index-url for pip

© Dataiku – 2020 – Confidential and proprietary information


©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku

You might also like