If you have gone on Twitter or Linkedin in the last month you have probably seen the insane hype surrounding generative AI. These models and tools such as ChatGPT, MidJourney, and StableDiffusion are all the rage, and everyone is trying to figure out how to leverage them and their output.
One of our clients had a major pain point/problem - how to generate incremental art from an existing comic IP to use in a TCG. We realized that generative AI art could be a perfect tool to solve this problem.
TL;DR - A summary of the lessons we learned:
- MidJourney doesn’t give you enough control to hit a consistent style.
- You need to be extremely specific with the dataset for training to get good results.
- LoRAs are easier to tune than full model checkpoints, and give you a faster path to high quality output.
- Pick the right base checkpoint. Starting from base Stable Diffusion 1.5 or 2.1 is unlikely to yield the results you want without a ton of training.
- With ControlNet and a bit of tuning, the sky is really the limit in terms of style, composition, and consistency.
The Why
Deckforge is a platform for building digital TCGs and, like many platforms, getting enough good quality artwork to start things moving is a real challenge. In Deckforge’s case the real challenge was creating enough art in a consistent style for a game. Each game has at least 120 cards and each card needs high quality, unique art that conforms to the style of that game. Many IP owners balked at producing extra art for a TCG, but Ai art proved to be a possible solution.
Deckforge was working on a new game based on the Voyce.Me manga “The Mad Gate”. Luckily, the fact that this was based on an existing comic meant that there was already a lot of art. Unfortunately, there was lots of art of a few main characters, but not necessarily enough of other minions and monsters to flesh out a TCG. This seemed like the perfect use case for generative AI Art.
Lesson #1 - MidJourney isn’t the right fit
The current frontrunner in the AI art game is MidJourney, a Discord bot that generates images based on the plain English (mostly) prompts you give it. MidJourney can do some pretty amazing things. How amazing? See our blog post here. Every image in there is generated with MidJourney., So it made sense for us to start there.
Getting MidJourney to hit a solid consistent anime style was hard, to say nothing about getting it to match the existing art...
The first attempts were not very inspiring. Getting it to hit a solid consistent anime style was hard, to say nothing about getting it to match the existing art we had for the game. Here are some examples while they are definitely interesting art, the style is all over the place:
These don’t really look like they are from the same game, plus the 2nd and 3rd ones don't even look like anime. Even including a reference style image in the prompt didn’t help much. Back to the drawing board we go.
Lesson #2 - Tag your training data verbosely when training on Stable Diffusion
It was pretty clear at this point that MidJourney wasn’t giving us the control and style consistency we needed. We tried some other configurations, but didn’t get much further, so we needed to take a more DIY approach.
The current cutting edge of text-to-image models you can download and modify yourself is StableDiffusion. While StableDiffusion isn’t quite as capable as MidJourney out of the box, it’s extremely customizable. So much so that there is a really steep learning curve to get it working. There is a hosted version by the parent company, but this isn’t customizable so we had to spin up our own instance on AWS and get to work.
There are a wide variety of options for training (or fine tuning) a model like Stable Diffusion. The current best practice is DreamBooth, an open source offering from Google. There are a wide range of configurations for training, and a few different high level approaches. To start we took the existing StableDiffusion 1.5 base model (called a checkpoint) and a folder of about 30 existing images from the comic series and got to work.
There are a wide variety of options for training a model like Stable Diffusion, but the current best practice is DreamBooth
First we had to make sure our data was in good shape. To do this, we tagged all the existing images with “caption files”, basically text files describing the content of the images. The goal is to be as specific as possible since anything not described in the captions would become part of the style we trained in. If all your training images have a person with blue hair and you don’t mention “blue hair” in your captions, all the people your model produces are going to have blue hair. Not ideal.
Another key tip we learned early is that for anime style images it’s best to use the tagging system rather than longer form narrative prompts. Having a caption of “1boy, grass, sky, holding sword, blue hair, backpack” will yield better results than “A man with blue hair standing in a field holding a sword wearing a backpack”.
For anime style images it’s best to use the tagging system rather than longer form narrative prompts
After getting our data ready we were ready to start training. After a few false starts with parameters and frustrating UI, we had a usable model. As you can see from the examples below, the style was solid but the composition and general quality were a mess:
People hovering in the air, vague anatomy even by anime standards, and whatever the heck that “tandem dragon” is. Encouraging as a start, but not production ready by any means.
We took a step back and updated training data to get even more specific and verbose in our tagging. We’re talking this verbose: “1boy, male focus, moon, solo, white hair, colored skin, full moon, glowing, water, blue skin, glowing eyes, standing, night, ocean, feet out of frame”. This made some big improvements in terms of things like messed up anatomy and confused composition.
Lesson #3 - A good base checkpoint goes a long way
Then we took a look at our base model. This was our next big “ah ha” moment. We were starting from the extremely general base StableDiffusion model, but there were other models out there that already had some understanding of anime, if not The Mad Gate’s specific style.
We grabbed a better suited checkpoint and let the GPUs rip. About an hour later we had a new model. This time we chose to train a “LoRA”, which is a bit different from a checkpoint. Notably it will allow you to dial up and down how much of the style you want to apply. We took our model and did some systematic testing to find the right options for our new LoRA. The new results were an order of magnitude better!
For reference here are some images from the actual The Mad Gate Comic as a baseline:
And here are some examples of AI generated ones we produced with our new LoRA model:
Actually I am lying, these are reversed. The first set is the LoRA images. Bet you didn’t guess that. We knew we had a model that worked and could generate the art we needed to finish adding more cards to the game, but could we take it further?
Lesson #4 - The sky's the limit with the right combination of tools
At this point we had things pretty dialed in. All that was left to do was to see if we could get some consistency out of the model. A classic issue with these kinds of generated images is that it’s impossible to keep everything looking like it goes together. There are folks trying to generate whole comics and their real issue is keeping things consistent from panel to panel. Outfits, settings, even characters.
We sprinkled in some ControlNet and things really started to get mindblowing
To test this idea I generated a set of underwater ruin-inhabiting lizard people with a bit of ancient Egyptian flair. We sprinkled in some ControlNet (a tool for controlling composition of the generated images) and a few other bits of secret sauce. Here are some of the results:
As you can see we were able to produce extremely consistent results while still maintaining The Mad Gate style and getting diversity of poses etc. Remember these alligator people didn’t exist anywhere in the training data. They are entirely new output of the LoRA that just looks to be from the universe of The Mad Gate.
Remember these alligator people didn’t exist anywhere in the training data, yet they look like they belong in the world of "The Mad Gate"
So you want to generate some art?
We are excited to work with other clients in the emerging field and push the limits of what is possible with generative AI. Feel free to reach out. My email is: max@pitonlabs.com. This whole field is rapidly changing and what’s possible is growing every day. We are especially excited about the idea of packaging some of these cutting edge tools up into easy to use packages for our clients, and making this amazing technology more accessible.