Knit launches YARN applications from Python.
Knit provides Python methods to quickly launch, monitor, and destroy distributed programs running on a YARN cluster, such as is found in traditional Hadoop environments.
Knit was originally designed to deploy Dask applications on YARN, but can deploy more general, non-Dask, applications as well.
Using conda, Knit can also deploy fully-featured Python environments within YARN containers, sending along useful libraries like NumPy, Pandas, and Scikit-Learn to all of the containers in the YARN cluster.
conda to install:
conda install knit -c conda-forge # or pip install knit
Launch Generic Applications¶
knit with valid ResourceManager/Namenode IP/Ports and create a command string to run
in all YARN containers
from knit import Knit k = Knit(autodetect=True) # autodetect IP/Ports for YARN/HADOOP
Create a software environment with necessary packages
env = k.create_env('my-environment', packages=['python=3.5', 'scikit-learn','pandas'], channels=['conda-forge']) # specify anaconda.org channels
Run a command within that environment on multiple containers
cmd = 'python -c "import sys; print(sys.version_info);"' app_id = k.start(cmd, num_containers=2, env=env) # start application
start method also takes parameters to define the resource requirements
of the application like
Knit makes it easy to launch Dask on Yarn:
from knit.dask_yarn import DaskYARNCluster env = k.create_env('my-environment', packages=['python=3.6', 'scikit-learn', 'pandas', 'dask'], channels=['conda-forge']) # specify anaconda.org channels cluster = DaskYARNCluster(env=env) cluster.start(nworkers=10, memory=4096, cpus=2) from dask.distributed import Client client = Client(cluster) # Connect local Dask client to cluster
If you want to connect to the remote Dask cluster from your local computer as is done in the last line then your local and remote environments should be similar.
Yarn clusters typically lack strong Python environments with common libraries like NumPy, Pandas, and Scikit Learn. To resolve this, Knit creates redeployable conda environments that can be shipped along with your Yarn job, effectively bringing a fully-featured Python software environment to your Yarn cluster.
To achieve this Knit uses redeployable conda environments. Every time you create a new environment Knit will use conda locally to manage and download dependencies, and then will wrap those packages into a self-contained zip file that can be shipped to Yarn applications.
Knit is not a full featured YARN solution. Knit focuses on the common case in computational workloads of starting a distributed process on many workers for a relatively short period of time. It does not provide fine-grained access to all Yarn functionality.