Speeding up the cold start of your inference server is a key aspect of optimizing the performance. Here are some tips you could take to address this.

Keeping models in image

As we know, the cold start is a critical aspect of inference that can impact the performance and user experience. There are two core factors that contribute to this challenge:

  • Image pulling: When you deploy a model, image will be pulled from the Docker registry. This process can take some time.
  • Model pulling: The model should also be pulled from Huggingface Hub (opens in a new tab).

Let's take stable diffusion v1.5 as an example. The inference server image is about 4GB. While the full runwayml/stable-diffusion-v1-5 (opens in a new tab) model on Huggingface Hub is approximately 20GB.

To mitigate the impact of the size of your model and related dependencies on the performance of your inference, we recommend that you keep your model in the image. This will reduce the time it takes to pull the model and the overall cold start time. Mosec could help you achieve this with --dry-run option.

WORKDIR workspace
# keep the model in the image
RUN python main.py --dry-run
# disable huggingface update check to accelerate the cold start
ENTRYPOINT [ "python", "main.py" ]
CMD [ "--timeout", "20000" ]

You could add RUN python main.py --dry-run in your Dockerfile.

Reduce the size of your image

It's still beneficial to reduce the size of your image wherever possible.

There are several strategies you could employ to reduce the size of your image, including:

  • Use a lightweight base image: Choose a base image that is lightweight, so that you are not carrying excess weight in your image.
  • Try envd (opens in a new tab) instead of Docker: If you're not familiar with Docker or building Docker images, then envd (opens in a new tab) could be a great tool to use. It provides a simplified interface for building images and can be used to build lightweight model serving images optimized for your specific use case.
  • Remove unnecessary files: Make sure that your image only includes the files and dependencies required to run your model - this can help reduce the overall size of your image.

Pre-loading or Pre-warming

One technique that can help reduce load times and improve the overall efficiency of your model serving is to load your models when initializing the server. This approach is commonly known as pre-loading or pre-warming the server.

This is enabled by default in Mosec. The worker initializes the pipeline from a pre-trained model. The model will be downloaded during the initialization.

Besides, you could also pre-warm the server with self.example. The example prompts will be used to warm up the server.

class StableDiffusion(Worker):
    def __init__(self):
        self.pipe = StableDiffusionPipeline.from_pretrained(
            "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
        device = "cuda" if torch.cuda.is_available() else "cpu"
        self.pipe = self.pipe.to(device)
        self.example = ["useless example prompt"] * 4  # warmup (bs=4)
    def forward(self, data: List[str]) -> List[memoryview]:
        logger.debug("generate images for %s", data)
        res = self.pipe(data)
        logger.debug("NSFW: %s", res[1])
        images = []
        for img in res[0]:
            dummy_file = BytesIO()
            img.save(dummy_file, format="JPEG")
        return images