Next-Level Model Investigation: Midjourney, Disco Diffusion, DALL-E Flow
Comparing higher-test consumer models.
The Story So Far
Hi there! If you aren’t up to speed on the project, this is a continuation of my work in illustrating a children’s book about my late puppy. I’ve evaluated simpler consumer-grade models for this purpose. I’ve also released a first draft of the book illustrated by using GPT-3 to generate text prompts and then plugging them into Craiyon.
Disco Diffusion: Complexity overload; power at a price
Using Disco Diffusion is daunting. It feel a little like sitting down to pilot a plane. It even comes with a 30 page long manual. Here’s a colab notebook with the current version of the model, 5.4:
Disco Diffusion 5.4 Colab Notebook
This notebook contains code for videos, 3d images, even VR output! Suffice to stay, it’s stuffed to the gills with bells and whistles. The base notebook is decently good at hiding all the extra stuff if you don’t need it. However, when adapting the model for a redshoes.ai client, editing the code became extremely frustrating due to all the extra clutter.
Prompt
“An adorable watercolor painting of a bernese mountain dog puppy sitting next to a fat australian cattle dog, in a field of grass as children and other dogs play in the distance behind them, Trending on artstation."
"two dogs that are best friends","a sunny day in the park"
"children's book art"
“Trending on artstation” is a bit of a silly addition to the end of the prompt that comes with the out of the box example. Here is what my first pass looked like with 50 steps for each of 5 batches:
So, obviously not optimal. I’m on a time crunch today and will come back to Disco Diffusion in the future, but I think there’s a good reason most of the examples you see for it are landscapes- it seems to struggle a lot with drawing figures.
As it states repeatedly in the manual, and as can be seen on subreddit, prompt and settings fine tuning are extremely important to proper function of Disco Diffusion. Currently, it’s more art than science- you need a lot of practice before you can start getting good results, which is painful on a model that takes so long to run.
Pros
The biggest improvement of Disco Diffusion for my usecase is the ability to set any image resolution, rather than being stuck with the previous 256x256 nightmare of Craiyon (Dall-e Mega).
Disco Diffusion saves its outputs right into google drive, which is the simplest and easiest filesystem implementation I’ve used to date.
Despite everything going on in the colab, the model itself isn’t that hard to fine tune. We’ve seen decent results with even small (<1K) datasets at redshoes.
Extremely knowledgeable and active community on both Reddit and Discord.
Cons
The model is extremely feature rich, which gets overwhelming quickly when you need to edit the code and don’t care about the extra features. I did find a simplified notebook online, but I haven’t had the time to try it yet.
The gdrive feature I mentioned earlier is frustrating if you just want to go and copy-paste images into a chat to show your friends, since you need to download them first.
Prompt creation and settings tuning is much more important than in other consumer-grade models
Takes a long time to run, even for reduced step and batch sizes.
Dall-e Flow: Best for learning. Craiyon’s big brother, still a little janky.
DFlow has the most similar workflow to my previous experiments, which is comforting. Unfortunately, it also means that the starting model powering everything is Craiyon, which has its limitations. Garbage in garbage out, as they say (no offense Craiyon! I still love you ❤️). To be fair to Craiyon, it is still the only model that manages to produce images that are actually recognizable as specific dog breeds.
Dall-e Flow Colab Notebook
DFlow is “human in the loop” so you shouldn’t just hit “run all.” In order to get the best results you want to select which image from each stage is your favorite, so the next stage can iterate and improve on it.
Prompt
“An adorable watercolor painting of a bernese mountain dog puppy sitting next to a fat australian cattle dog, in a field of grass as children and other dogs play in the distance behind them"
First, you run the Craiyon model to get a set of initial images from which to choose.
Then, you enter the number of the image you liked the best. In this case, I told the model to improve image 0 (it’s 0 indexed). What’s nice here is that it will fill in background if the image has a lot of whitespace.
I unfortunately couldn’t get the final step of upscaling to work due to the server being down. It works the same as the middle step where you choose your favorite. I’ll be updating the article later to show the final results of a run.
DFlow also comes with instructions on how to self-deploy the model, which is obviously very attractive for a lot of reasons. I haven’t tried it yet, but I’ll be dockerizing and hosting it or another model soon and will likely post the results. I’m not sure whether a local model would still have the same uptime issues I experienced in the final step, hopefully not.
Pros
Best in-notebook walkthrough. If you’re looking to learn about slotting different models together to eventually build something of your own, I would start here.
Solves the “white room” problem (characters but no backgrounds) I experienced during my initial experiments with Craiyon.
Still better at creating recognizable dog breeds, even if the art is a little more rough.
Also contains instructions for dockerizing/running an instance locally.
Cons
Less artistically impressive than the other models.
DFlow needs to talk to a server to do the final upscaling, a task that can fail due to uptime, memory, and disk space issues.
Midjourney: The smoothest user flow yet, at the price of customization.
Disclaimer (Beta): Midjourney is still in Beta and I’m using it for free. Do not take this as an in depth review of their services. This article is me comparing multiple ML models and seeing how they size up for my current purpose of illustrating a children’s book.
Midjourney gives you images in runs of four and then lets you either produce variations or upscale the individual images. It’s like DFlow but with a smoother interface (and better results).
I really want to hate Midjourney because it doesn’t leave its code open to the public for us to tinker with, but the user experience of being able to select results from a run that you like and refine them is pretty superior to the scattershot approaches of the majority of the diffusion models.
Prompt
watercolor painting of two dogs: a fat australian cattledog sitting far away from a big bernese mountain dog puppy at the park during a sunny day
Variations on #2:
Variations on #1:
Upscale of #1:
Once upscaled, you have the option to make more variations, upscale to the max, or “light upscale redo”
Let’s try a light upscale redo:
I took a variation on this and tested max upscale, for my purposes this level of resolution isn’t necessary, but I could see it being a silver bullet for many usecases.
Pros
Extremely straightforward and simple user flow.
High quality output.
The negative prompting command in Midjourney is probably going to be quickly copied by any competitors. (You can tell it “—no trees”). There’s a bunch of commands in the User Manual that offer limited customization.
the /prefer suffix command is very useful for cases like mine, and is essentially the same very basic solution I was using for the persistence problem previously.
Cons
Lack of access to code means you aren’t able to fine tune the model- this won’t be a major issue for 90% of users.
Still confined to beta as of July 2022. The only way to get in that isn’t getting off the beta list is to have a friend who did get an invite signed up for the paid version and then invite you. This is really roundabout and I’m not sure of the team’s reasons for doing things this way- it seems like a straightforward store page would be easier.
Using discord as a user interface and putting the returns in public channels makes it obnoxious to find your image- there’s lots of scrolling and having your post bumped around. I was frustrated having a bunch of cool images pop into my workspace- I want to focus on the thing I’m making, and it’s very easy for me to get distracted by other people working directly on top of me.
However, this is only a “free beta” problem. If you subscribe you can message their bot directly.
Like OpenAI, Midjourney is very obviously gearing up to go pay-to-play after it concludes the free trial period. Not only that, but for unlimited use the monthly subscription will start at $30.
Conclusions
If I were looking for a plug-and-play tool to hit the ground running with, Midjourney would be the hands-down pick. However, its difficulty of access is currently going to prevent many prospective users from doing so. If you want to learn about AI and get your hands on some actual code, this isn’t the model for you.
If you’re just starting out and want to get experience with a well explained and simple model, Dall-e Flow is for you. It’s powered by Craiyon, which most people are now familiar with, and is relatively fast compared to Disco Diffusion. It’s a really good example of slotting different models together in a flow and also has good support for running it other places than just Colab.
If you’re a bit more grizzled and have a lot of time on your hands, I would recommend Disco Diffusion. It’s powerful and feature rich at the expense of a steep learning curve. Despite that, it’s still open enough to fine tune!