Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Cloud computing + containerization for

scientists

William Trimble
2022 BioDS workshop
Who has had the “grumpy sysadmin”
problem?
• Student: Hey, look there’s this piece of software
that claims to process exactly the kind of data I
need to process.
Who has had the “grumpy sysadmin”
problem?
• Sysadmin: You’re not installing those unholy
packages from the University of We-Don’t-
Sanitize-Our-Inputs on my server.
Who has had the “grumpy sysadmin”
problem?
• Sysadmin: You’re not installing those unholy
packages from the University of We-Don’t-
Sanitize-Our-Inputs on my server.

• Install on your laptop?

• Get some private environment on the server?

• Get your own server?


Meta-tools
Two tools: containers (docker, singularity) and cloud computing (AWS,
Azure)

Both classes of metatools have considerable commercial support (big


companies happy to have you learn what they can do), so tutorials on
docker and AWS are very easy to find.
Containerization :

Run code, but


Limit the resources the code can touch
Package the code with an environment (file
system, operating system) that it works in.
Allow multiple containers to talk
to each other
What can you do with this?
• Applications & servers. Provide the data & you can have a blog
server, database server, Jenkins testing robot, git server, storage
node…
• Complex or fragile applications with elaborate dependencies (get
someone else to take care of dependency hell)
• Reproducible computing environment: make the program less like a
program and more like an instrument.
• Portability advantage: debug on laptop, run in monster computing
environment.
Use docker to do something simple but useful
• https://www.codegrepper.com/code-
examples/shell/how+to+convert+pem+to+ppk+in+ubuntu

• I ran ssh-keygen to make a public/private key pair, but it is in openssh


format, and it won’t work with putty, the windows ssh client.
• There is a package for linux that will change the format in one line.
puttygen id_biods.pem -o id_biods.ppk
Start a container…
mkdir ~/testdir && cd testdir && cp
~/Desktop/id_biods.pem
docker run --rm -v $(PWD):/work -it ubuntu

docker
run
--rm cleanup when done
-v $(PWD):/work allow access to current directory
-it ubuntu run the ubuntu image interactively
Converting id_biods.pem to id_biods.ppk in a
docker container
Once inside my docker container, I need to install the linux package
putty.

apt-get update
apt-get install –y putty
cd /work
ls
puttygen id_biods -o id_biods.ppk
ls
What is the catch?
• Consumes a lot of hard drive space, potentially heavy on network I/O
if you are building images from scratch all the time.
• Containers are weird. It was hard enough learning how to use a
command-line tool, now you are saying I have to wrap it into an
eggroll with a copy of an entire linux operating system?
• Need servers to host docker machine images.
Step-by-step docker for bioinformatics
Melbourne Bioinformatics has an excellent step-by-step docker tutorial,
• https://www.melbournebioinformatics.org.au/tutorials/tutorials/doc
ker/docker/
Docker engine is a little paranoid

We need to explicitly authorize


containers to access the file
system and system ports. (Why?)
Don’t need to install on host os.
Containers with single
applications (+ dependencies) and
more elaborate computing
environments can be found.
So.. let’s run a jupyter server in docker…
• https://jupyter-docker-stacks.readthedocs.io/en/latest/

# Download the “blank jupyter server” image:


docker pull jupyter/datascience-notebook

docker run -it --rm -p 8888:8888 -v


"${PWD}":/home/jovyan/work jupyter/datascience-
notebook

# This should start a jupyter server on


localhost:8888
docker run [docker options] <IMAGE NAME>
[image arguments…]
docker run
-it # interactive
--rm # delete on exit
-p 8888:8888 # allow port 8888
-v "${PWD}":/home/jovyan/work # allow $PWD
jupyter/datascience-notebook # name of image

NOTE: because image arguments is a variable-length field, it must follow the image name at the end of the line.
Use cases
• Web server in a container (mediawiki, blogging platform..)
• Compute environment (jupyter server)
• Database in a container (empty or pre-loaded)
• Utility in a container (single-purpose docker container)

• portability advantage
Bad stuff?
• Since the images are like tarballs, you don’t necessarily know what is in
them (and no one is inclined to inspect linux hard drives for safety!)
• If you could see the commands used to build the image, or could build
the image yourself you would download &install 500Mb from the linux
distribution sources instead of a 500Mb image from dockerhub.
• This is the procedure of building an image.
• It is safer to download a Dockerfile and build an image than to run an
image…
• (But we all install things from github without vetting them.) Whether
you would be forgiven for running something from dockerhub without
vetting it depends on the severity of the damage done.
Expiration date
• Dockerfiles need maintenance every 2-3 years.
• The APIs of some of the dependencies may have shifted, the
repositories may have changed the names of the packages you need,
and the OS version will go out of support so you need to dust off your
dockerfile so it still works in 2024.
Inside the docker container, you get to be king
• Docker by default logs you in as ‘root’ of this miniature linux machine
• AWS by default has a non-root account (ubuntu, ec2-user..)
• Some commands require sudo prefix
• Potential damage is limited to the explicit permissions
What resources do I need?
• Memory
• Storage space
• Network connection

• Which of these things are we already paying for?

• How recently have we run into “out of storage space” or “out of


memory” problems?
What resources do I need?
• Memory
• Storage space
• Network connection

• Computing is not free, and these resources pay nonzero amounts of rent
when it exceeds a certain scale.

• Code (and tutorials) are usually small, cheap, can be hosted for free.
• Training data is either privacy-encumbered or easy to download again.
• Final-results type data.. if is valuable enough to retain indefinitely, you
need to pay rent to have it preserved.
Computing has been commodified

Amazon EC2
What is cloud computing?
• https://aws.amazon.com/ec2/pricing/

• You give them a public key, a credit card, and


an order for a part of a server

• They give you an IP address and control of a


server

• “EC2 usage are billed on one second


increments, with a minimum of 60 seconds.”
https://www.theregister.com/2022/05/02/cloud_market_share_q1_2022/
EC2 pricing
On-demand Vcpu Memory Storage Network
Instance name
hourly rate performance
a1.medium $0.0255 1 2 GiB EBS Only Up to 10 Gigabit

a1.large $0.051 2 4 GiB EBS Only Up to 10 Gigabit

a1.xlarge $0.102 4 8 GiB EBS Only Up to 10 Gigabit

a1.2xlarge $0.204 8 16 GiB EBS Only Up to 10 Gigabit

a1.4xlarge $0.408 16 32 GiB EBS Only Up to 10 Gigabit

https://aws.amazon.com/ec2/pricing/on-demand/
Yeah, but what does it mean for me, the
scientist?
• When you run out of some essential resource you need for analysis,
you ask your sponsor for funding to rent computers to solve your
problem.
• You have to learn to steward ephemeral compute resources.
• Renting computers means paying by the hour in exchange for not
owning the depreciation and maintenance.
• Tending cloud servers has a different feedback cycle / pace from other
tasks. (Billing by the minute/by the hour will do that to you)
Ephemeral resources?
• Do you have a customized environment? Prompt, ssh keys,
shortcuts?
• The cloud computing paradigm insists on
• A separation between application environment & configuration and data
• Generic computing environments
• That are created and destroyed
So.. let’s go.
• If you want AWS nodes yourself, you must start by giving Jeff Bezos
your credit card number.
• Then you must generate SSH keys. When AWS creates instances, it
installs your public keys on all your nodes. This is how you are going
to control your (linux) nodes.
• The first year you sign up, AWS will give you some free compute.
“free tier eligible” (warning: these servers have so little resources
that somethings things don’t work)

• So let me place an order for some servers you can use this afternoon.
Setting up a jupyter notebook server on EC2
• There are some breadcrumbs at https://dataschool.com/data-
modeling-101/running-jupyter-notebook-on-an-ec2-server/ and at
https://docs.aws.amazon.com/dlami/latest/devguide/setup-
jupyter.html

• I’ll create EC2 nodes with ports 22, 443, and 8888 open
You will need the private key
• -----BEGIN OPENSSH PRIVATE KEY-----
• b3BlbnNzaC1rZXktdjEAAAAACmFlczI1Ni1jdHIAAAAGYmNyeXB0AAAAGAAAABAt7M+E4o
• BTU1x4fRkNB7UmAAAAEAAAAAEAAAGXAAAAB3NzaC1yc2EAAAADAQABAAABgQDmcsE+9lAA
• jj3hFALiAQX7Ae5Bx/4RMp2fNDwxucSWeHFgRpw+nD4RZItqqbrwBVBIyH5AwlK27W/AIG
• 0STvplFu/unHKXV+e0BJd7+jYZ0oblKjGKUXnaovMWEn1qRNan11t6GOaxZrw4cAfGxhF5
• u+tP6PL8Ou7OKaGU1NhBat/RyOQGCSgkbr7D8D+Q1z4I8CcQ0n8SAzKPi4UwVGLR1f5XaC
• R3Kis+m6C0F4ViHG1wWQ6Muio1OWW4hVdCXq5ZC3DmFGmPmOqO5VSjxxN/L7RfmXLvpPGp
• Ph7mR8dpvUlamebpfac9ZQ5G8Kv8tO+88psCdNL4pAxnf858hq+wiUfdZ/ezr/tZAqCJir
• Tp3yvmv+y/tOhkAS99aL7H4IpnFGaeDNmQiyxM97FOYzCByxIUlMrsAkAhPGepC9kmytKP
• xWvfSCqxNpEZMm/4mBe7AZudUrOejzYkTl2DRp+S+eotKtKsuiaLNvVLd67rHsPZF9kuDv
• FoUdDcVC085K0AAAWQDLP98qaueFhLjnrutwxg1abLVHE1e4O/c8i1HVGv98Sizp1Akk0u
• hoOBclOC5ayi4YyxpZhghFVueme7fKQ+oZhxe3/h/4t2YifydD7ZkFRrBlrBjSqQORg94v
• ZqtB/6pPjVWWmT5YEAtMnFjvSoRu+vdz9oom4RRvgPohH+kXIpHVKuTAvPR3MjOj39ugt2
• gGzBy4M6BPh3PuU5R1LlNh7VKvFPknvK7YuRR7Jk7NFr5lCrtvckrTbFGi6mqwxrmddeFU
• FZSfKORRjXkqt52R0YFbXozweyTLmCnHBoHboYay3rz6YL0r+nLjAZbmRySxhwgGPWG2qJ
• otkXSoAR3Sv/dQ6OXHeSrvo3q6ek3Wn/frl3e7vroWAxBByosQvNYAfdbu70dE8LdXr2Tg
• 69b2BdGreb8vfczIoSGdZhQFRsF4yFHKg0DtiAQixIPlawK2DKQOysizCCYwL/FcQiRCvO
• uDsWtb8PNqjsbGCIdmQX6AetMY1SxhCiAHCpoLYYbqse2aeVfQMWHTSJrjDiLH68Wp0UkP
• XRyhiVK65rNJHQZJKsbZg5o0TjXTd7qcaqR6kRqeNtYKpfMsLL3PHOHK3zmDbEIS9iuJwD
• z1Ht1cHHssCpl2iKstOJIh1GNwSWWqxAmL6wVUlLn83lPG9+uT/yhoj9K/9eat1NytXpUN
• +efsc7hgL9CzQFR5MT1HpWI+WAU05TEr2LD++D6T1GfeoC28PoYQJtd7FnVJQrcvEGv41J
• vOWARNDidfuG+ShGZ679vqQc1W26LqQAPRghLTp7QXAXwV2tW80GRyRjIBVeFnn9hiB+Mh
• uUfCwRZ5bz5OCC3XKfEAtfDABz7FTOelXFHb7fHO+2TWKuH7F1+Ot+gLdVa7wFWyhA6Kyx
• 1z5oGNV6yijoGOFEIvPhZ9h5+E4ejEdXnDhdsjxqqcUbFZxVu4QpvdhOLr6zUAnvyx+cc2
• FDWtJK7mbmgyP8k4uBPOv7+On+XRPJY/DeyL/xMccSLw9Vw7xorV92zllB4ZX+DfRagwZK
• 8Jo5XIjuvUt29rHYQp+L2PjxPP1f1hL2GeOOmQQMddksYlkFqxbJDj5raq5J2QFoKuTOld
• Y60lKW9t4HyxG+EnrzyRlgsHw4AqlH4vKWRkzuwyODhPAYeboNltJegHA440LM0ZvUPgKl
Getting into our nodes: half the battle

ssh -i /Users/username/.ssh/id_biods [email protected]

ssh
-i
/Users/username/.ssh/id_biods path to local private key
ubuntu@ username on remote
18.205.151.213 remote hostname
Install jupyter on our (blank) node
sudo apt-get update
sudo apt-get install –y jupyter jupyter-core
Change a few bits to make jupyter behave:

cat >> ~/.jupyter/jupyter_notebook_config.py


<<EOF
conf = get_config()
conf.NotebookApp.ip = '0.0.0.0’
conf.NotebookApp.port = 8888
EOF

jupyter notebook password


Create requirements for SSL
# following
https://docs.aws.amazon.com/dlami/latest/devguid
e/setup-jupyter-config.html
# create self-signed certificate

cd ~
mkdir ssl
cd ssl
openssl req -x509 -nodes -days 365 -newkey
rsa:2048 -keyout mykey.key -out mycert.pem
Start server…
jupyter notebook --certfile=~/ssl/mycert.pem --
keyfile ~/ssl/mykey.key

navigate to
https://18.205.151.xxx:8888
Customize $HOME/.ssh/config

Host P2
Hostname 18.205.151.213
User ubuntu
Identityfile /Users/wltrimbl/.ssh/id_biods

# Replaces
ssh -i /Users/wltrimbl/.ssh/id_biods ubuntu@
18.205.151.213
# with
ssh P2

You might also like