Frameworks & Templates


ImageBind (opens in a new tab) learns a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation.


We provide the mosec inference service template.

  • CLI
# CLI doesn't support multiple words now
export MODELZ_API_KEY=mzi-abcdefg...
modelz inference --deployment imagebind-XXX --serde msgpack model=imagebind-text input="A dog"
  • Python Client
import modelz
APIKey = "mzi-abcdefg..."
cli = modelz.ModelzClient(deployment="imagebind-XXX", key=APIKey)
input = {"model": "imagebind-text", "input": ["A dog", "doggery", "puppy"]},
resp = cli.inference(params=data, serde="msgpack")
embeddings = resp["data"]
for emb in embeddings:

The template supports image, audio, and video data embedding. Please refer to our example (opens in a new tab) for more embedding, or better type hint (opens in a new tab) of modelz Request and Response.