THC Science

Stable Diffusion
	An image generated by Stable Diffusion based on the text prompt "a photograph of an astronaut riding a horse"
Developer(s)	StabilityAI
Initial release	August 22, 2022
Stable release	1.5 (model),[unreliable source?] / August 31, 2022
Repository	github.com/CompVis/stable-diffusion
Written in	Python
Operating system	Any that support CUDA kernels
Type	Text-to-image model
License	Creative ML OpenRAIL-M
Website	stability.ai

Stable Diffusion is a machine learning, text-to-image model to generate digital images from natural language descriptions. The underlying approach^[3] was developed at LMU Munich and then extended by a collaboration of StabilityAI, LMU, and Runway with support from EleutherAI and LAION.^[4]^[5]^[6] The model can be used for other tasks too, like generating image-to-image translations guided by a text prompt.^[7]

Stable Diffusion's code and model weights have been released publicly, and it can run on most consumer hardware equipped with a modest GPU. This marked a departure from previous proprietary text-to-image models such as DALL-E and Midjourney which were accessible only via cloud services.^[8]

Stability AI, the company behind Stable Diffusion, is in talks to raise capital at a valuation of up to one billion dollars as of September 2022.^[9]

Usage

The Stable Diffusion model supports the ability to generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output,^[5] and the redrawing of existing images which incorporate new elements described within a text prompt (a process commonly known as guided image synthesis^[10]) through the use of the model's diffusion-denoising mechanism.^[5] In addition, the model also allows the use of prompts to partially alter existing images via inpainting and outpainting, when used with an appropriate user interface that supports such features, of which numerous different open source implementations exist.^[11]

Stable Diffusion is recommended to be run with 10GB or more VRAM, however users with less VRAM may opt to load the weights in float16 precision instead of the default float32 to lower VRAM usage.^[12]

Text to image generation

The text to image sampling script within Stable Diffusion, known as "txt2img", consumes a text prompt, in addition to assorted option parameters covering sampling types, output image dimensions, and seed values, and outputs an image file based on the model's interpretation of the prompt.^[5] Full prompt examples here. Generated images are tagged with an invisible digital watermark to allow users to identify an image as generated by Stable Diffusion,^[5] although this watermark loses its effectiveness if the image is resized or rotated.^[13] The Stable Diffusion model is trained on a dataset consisting of 512×512 resolution images,^[5] meaning that txt2img output images are optimally configured to be generated at 512×512 resolution as well, and deviating from this size can result in poor quality generation outputs.^[12]

Each txt2img generation will involve a specific seed value which affects the output image; users may opt to randomise the seed in order to explore different generated outputs, or use the same seed to obtain the same image output as a previously generated image.^[12] Users are also able to adjust the number of inference steps for the sampler; a higher value takes a longer duration of time, however a smaller value may result in visual defects.^[12] Another configurable option, the classifier-free guidance scale value, allows the user to adjust how closely the output image adheres to the prompt;^[14] more experimentative or creative use cases may opt for a lower value, while use cases aiming for more specific outputs may use a higher value.^[12]

Negative prompts are a feature included in some user interface implementations of Stable Diffusion which allow the user to specify prompts which the model should avoid during image generation, for use cases where undesirable image features would otherwise be present within image outputs due to the positive prompts provided by the user, or due to how the model was originally trained.^[5]

File:Algorithmically-generated AI artwork of Hakurei Reimu (part 1 of 4).png

File:Algorithmically-generated AI artwork of Hakurei Reimu (part 2 of 4).png

File:Algorithmically-generated AI artwork of Hakurei Reimu (part 3 of 4).png

File:Algorithmically-generated AI artwork of Hakurei Reimu (part 4 of 4).png

Demonstration how different positive and negative prompts affect the output of images generated by the Stable Diffusion model. Each individual row represents a different prompt fed into the model, and the variation of art style between each row directly correlates with the presence or absence of certain phrases and keywords. Partial snippets of prompts are as follows:

First row: art style of artgerm and greg rutkowski
Second row: art style of makoto shinkai and akihiko yoshida and hidari and wlop
Third row: art style of Michael Garmash

More:

- Fourth row: Charlie Bowater and Lilia Alvarado and Sophie Gengembre Anderson and Franz Xaver Winterhalter, by Konstantin Razumov, by Jessica Rossier, by Albert Lynch
- Fifth row: art style of Jordan Grimmer, Charlie Bowater and Artgerm
- Sixth row: art style of ROSSDRAWS, very detailed deep eyes by ilya kuvshinov
- Seventh row: game cg japanese anime Jock Sturges Kyoto Animation Alexandre Cabanel Granblue Fantasy light novel pixiv
- Eighth row: art style of Sophie Anderson, and greg rutkowski, and albert lynch
- Ninth row: art style of Konstantin Razumov, and Jessica Rossier, and Albert Lynch
- Tenth row: hyper realistic anime painting sophie anderson Atelier meruru josei isekai by Krenz cushart by Kyoto Animation official art
- Eleventh row: art style of wlop and michael garmash
- Twelfth row: art style of greg rutkowski and alphonse mucha
- Thirteenth row: by Donato Giancola, Sophie Anderson, Albert Lynch

Image modification

Stable Diffusion includes another sampling script, "img2img", which consumes a text prompt, path to an existing image, and strength value between 0.0 and 1.0, and outputs a new image based on the original image that also features elements provided within the textual prompt; the strength value denotes the amount of noise added to the output image, with a higher value producing images with more variation, however may not be semantically consistent with the prompt provided.^[5] Image upscaling is one potential use case of img2img, among others.^[5]

Inpainting and outpainting

Additional use-cases for image modification via img2img are offered by numerous different front-end implementations of the Stable Diffusion model. Inpainting involves selectively modifying a portion of an existing image delineated by a user-provided mask, which fills the masked space with newly generated content based on the provided prompt.^[11] Conversely, outpainting extends an image beyond its original dimensions, filling the previously empty space with content generated based on the provided prompt.^[11]

Step 1: An image is generated from scratch using txt2img.

Step 2: Via outpainting, the bottom of the image is extended by 512 pixels and filled with AI-generated content.

Step 3: In preparation for inpainting, a makeshift arm is drawn using the paintbrush in GIMP.

Step 4: An inpainting mask is applied over the makeshift arm, and img2img generates a new arm while leaving the remainder of the image untouched.

Demonstration of inpainting and outpainting techniques using img2img within Stable Diffusion.

License

Unlike models like DALL-E, Stable Diffusion makes its source code available,^[15]^[5] along with pre-trained weights. Its license prohibits certain use cases, including crime, libel, harassment, doxxing, "exploiting ... minors", giving medical advice, automatically creating legal obligations, producing legal evidence, and "discriminating against or harming individuals or groups based on ... social behavior or ... personal or personality characteristics ... [or] legally protected characteristics or categories".^[16]^[17]

Training

Stable Diffusion was trained on pairs of images and captions taken from LAION-5B, a publically available dataset derived from Common Crawl data scraped from the web. The dataset was created by LAION, a German non-profit which receives funding from Stability AI.^[18]^[19] The model was initially trained on a large subset of LAION-5B, with the final rounds of training done on "LAION-Aesthetics v2 5+", a subset of 600 million captioned images which an AI predicted that humans would give a score of at least 5 out of 10 when asked to rate how much they liked them.^[18]^[20] This final subset also excluded low-resolution images and images which an AI identified as carrying a watermark.^[18]

The model was trained using 256 Nvidia A100 GPUs on Amazon Web Service for a total of 150,000 GPU-hours, at a cost of $600,000.^[21]^[22]^[23]

References

^ Mostaque, Emad (2022-06-06). "Stable Diffusion 1.5 beta now available to try via API and #DreamStudio, let me know what you think. Much more tomorrow…". Twitter. Archived from the original on 2022-09-27.
^ "Side-by-side comparison of Stable Diffusion 1.5 (Test update of 31/08/2022) and Dall-E 2". 2 September 2022.
^ Rombach; Blattmann; Lorenz; Esser; Ommer (June 2022). High-Resolution Image Synthesis with Latent Diffusion Models (PDF). International Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA. pp. 10684–10695. arXiv:2112.10752.
^ "Stable Diffusion Launch Announcement". Stability.Ai. Archived from the original on 2022-09-05. Retrieved 2022-09-06.
^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j "Stable Diffusion Repository on GitHub". CompVis - Machine Vision and Learning Research Group, LMU Munich. 17 September 2022. Retrieved 17 September 2022.
^ "Revolutionizing image generation by AI: Turning text into images". LMU Munich. Retrieved 17 September 2022.
^ "Diffuse The Rest - a Hugging Face Space by huggingface". huggingface.co. Archived from the original on 2022-09-05. Retrieved 2022-09-05.
^ "The new killer app: Creating AI art will absolutely crush your PC". PCWorld. Archived from the original on 2022-08-31. Retrieved 2022-08-31.
^ Cai, Kenrick. "Startup Behind AI Image Generator Stable Diffusion Is In Talks To Raise At A Valuation Up To $1 Billion". Forbes. Retrieved 2022-09-10.
^ Meng, Chenlin; He, Yutong; Song, Yang; Song, Jiaming; Wu, Jiajun; Zhu, Jun-Yan; Ermon, Stefano (August 2, 2021). "SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations". arXiv. arXiv. doi:10.48550/arXiv.2108.01073.
^ ^a ^b ^c "Stable Diffusion web UI". GitHub.
^ ^a ^b ^c ^d ^e "Stable Diffusion with 🧨 Diffusers". Hugging Face official blog. August 22, 2022.
^ "invisible-watermark README.md". GitHub.
^ Ho, Jonathan; Salimans, Tim (July 26, 2022). "Classifier-Free Diffusion Guidance". arXiv. arXiv. doi:10.48550/arXiv.2207.12598.
^ "Stable Diffusion Public Release". Stability.Ai. Archived from the original on 2022-08-30. Retrieved 2022-08-31.
^ "Ready or not, mass video deepfakes are coming". The Washington Post. Archived from the original on 2022-08-31. Retrieved 2022-08-31.
^ "License - a Hugging Face Space by CompVis". huggingface.co. Archived from the original on 2022-09-04. Retrieved 2022-09-05.
^ ^a ^b ^c Baio, Andy (30 August 2022). "Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion's Image Generator". Waxy.org.
^ Heikkilä, Melissa (16 September 2022). "This artist is dominating AI-generated art. And he's not happy about it". MIT Technology Review.
^ "LAION-Aesthetics | LAION". laion.ai. Archived from the original on 2022-08-26. Retrieved 2022-09-02.
^ Mostaque, Emad (August 28, 2022). "Cost of construction". Twitter. Archived from the original on 2022-09-06. Retrieved 2022-09-06.
^ "Stable Diffusion v1-4 Model Card". huggingface.co. Retrieved 2022-09-20.{{cite web}}: CS1 maint: url-status (link)
^ "This startup is setting a DALL-E 2-like AI free, consequences be damned". TechCrunch. Retrieved 2022-09-20.{{cite web}}: CS1 maint: url-status (link)

External links

[1] Mostaque, Emad (2022-06-06). "Stable Diffusion 1.5 beta now available to try via API and #DreamStudio, let me know what you think. Much more tomorrow…". Twitter. Archived from the original on 2022-09-27.

[2] "Side-by-side comparison of Stable Diffusion 1.5 (Test update of 31/08/2022) and Dall-E 2". 2 September 2022.

[3] Rombach; Blattmann; Lorenz; Esser; Ommer (June 2022). High-Resolution Image Synthesis with Latent Diffusion Models (PDF). International Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA. pp. 10684–10695. arXiv:2112.10752.

[stable-diffusion-launch-4] "Stable Diffusion Launch Announcement". Stability.Ai. Archived from the original on 2022-09-05. Retrieved 2022-09-06.

[stable-diffusion-github-5] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j "Stable Diffusion Repository on GitHub". CompVis - Machine Vision and Learning Research Group, LMU Munich. 17 September 2022. Retrieved 17 September 2022.

[6] "Revolutionizing image generation by AI: Turning text into images". LMU Munich. Retrieved 17 September 2022.

[7] "Diffuse The Rest - a Hugging Face Space by huggingface". huggingface.co. Archived from the original on 2022-09-05. Retrieved 2022-09-05.

[pcworld-8] "The new killer app: Creating AI art will absolutely crush your PC". PCWorld. Archived from the original on 2022-08-31. Retrieved 2022-08-31.

[9] Cai, Kenrick. "Startup Behind AI Image Generator Stable Diffusion Is In Talks To Raise At A Valuation Up To $1 Billion". Forbes. Retrieved 2022-09-10.

[10] Meng, Chenlin; He, Yutong; Song, Yang; Song, Jiaming; Wu, Jiajun; Zhu, Jun-Yan; Ermon, Stefano (August 2, 2021). "SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations". arXiv. arXiv. doi:10.48550/arXiv.2108.01073.

[webui_showcase-11] "Stable Diffusion web UI". GitHub.

[diffusers-12] "Stable Diffusion with 🧨 Diffusers". Hugging Face official blog. August 22, 2022.

[13] "invisible-watermark README.md". GitHub.

[14] Ho, Jonathan; Salimans, Tim (July 26, 2022). "Classifier-Free Diffusion Guidance". arXiv. arXiv. doi:10.48550/arXiv.2207.12598.

[stability-15] "Stable Diffusion Public Release". Stability.Ai. Archived from the original on 2022-08-30. Retrieved 2022-08-31.

[washingtonpost-16] "Ready or not, mass video deepfakes are coming". The Washington Post. Archived from the original on 2022-08-31. Retrieved 2022-08-31.

[17] "License - a Hugging Face Space by CompVis". huggingface.co. Archived from the original on 2022-09-04. Retrieved 2022-09-05.

[Waxy-18] Baio, Andy (30 August 2022). "Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion's Image Generator". Waxy.org.

[MIT-LAION-19] Heikkilä, Melissa (16 September 2022). "This artist is dominating AI-generated art. And he's not happy about it". MIT Technology Review.

[LAION-Aesthetics-20] "LAION-Aesthetics | LAION". laion.ai. Archived from the original on 2022-08-26. Retrieved 2022-09-02.

[21] Mostaque, Emad (August 28, 2022). "Cost of construction". Twitter. Archived from the original on 2022-09-06. Retrieved 2022-09-06.

[stable-diffusion-model-card-1-4-22] "Stable Diffusion v1-4 Model Card". huggingface.co. Retrieved 2022-09-20.{{cite web}}: CS1 maint: url-status (link)

[techcrunch-model-23] "This startup is setting a DALL-E 2-like AI free, consequences be damned". TechCrunch. Retrieved 2022-09-20.{{cite web}}: CS1 maint: url-status (link)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

@@ Line 30: / Line 30: @@
 === Text to image generation ===
-The text to image sampling script within Stable Diffusion, known as "txt2img", consumes a text prompt, in addition to assorted option parameters covering sampling types, output image dimensions, and seed values, and outputs an image file based on the model's interpretation of the prompt.<ref name="stable-diffusion-github"/> Generated images are tagged with an invisible [[digital watermark]] to allow users to identify an image as generated by Stable Diffusion,<ref name="stable-diffusion-github"/> although this watermark loses its effectiveness if the image is resized or rotated.<ref>{{cite web|url=https://github.com/ShieldMnt/invisible-watermark/blob/main/README.md|title=invisible-watermark README.md|website=GitHub}}</ref> The Stable Diffusion model is trained on a dataset consisting of 512×512 resolution images,<ref name="stable-diffusion-github"/> meaning that txt2img output images are optimally configured to be generated at 512×512 resolution as well, and deviating from this size can result in poor quality generation outputs.<ref name="diffusers">{{cite web|date=August 22, 2022|url=https://huggingface.co/blog/stable_diffusion|title=Stable Diffusion with 🧨 Diffusers|website=Hugging Face official blog}}</ref>
+The text to image sampling script within Stable Diffusion, known as "txt2img", consumes a text prompt, in addition to assorted option parameters covering sampling types, output image dimensions, and seed values, and outputs an image file based on the model's interpretation of the prompt.<ref name="stable-diffusion-github"/> [https://aitextpromptgenerator.com/ Full prompt examples here]. Generated images are tagged with an invisible [[digital watermark]] to allow users to identify an image as generated by Stable Diffusion,<ref name="stable-diffusion-github"/> although this watermark loses its effectiveness if the image is resized or rotated.<ref>{{cite web|url=https://github.com/ShieldMnt/invisible-watermark/blob/main/README.md|title=invisible-watermark README.md|website=GitHub}}</ref> The Stable Diffusion model is trained on a dataset consisting of 512×512 resolution images,<ref name="stable-diffusion-github"/> meaning that txt2img output images are optimally configured to be generated at 512×512 resolution as well, and deviating from this size can result in poor quality generation outputs.<ref name="diffusers">{{cite web|date=August 22, 2022|url=https://huggingface.co/blog/stable_diffusion|title=Stable Diffusion with 🧨 Diffusers|website=Hugging Face official blog}}</ref>
 Each txt2img generation will involve a specific seed value which affects the output image; users may opt to randomise the seed in order to explore different generated outputs, or use the same seed to obtain the same image output as a previously generated image.<ref name="diffusers"/> Users are also able to adjust the number of inference steps for the sampler; a higher value takes a longer duration of time, however a smaller value may result in visual defects.<ref name="diffusers"/> Another configurable option, the classifier-free guidance scale value, allows the user to adjust how closely the output image adheres to the prompt;<ref>{{cite journal|date=July 26, 2022|first1=Jonathan|last1=Ho|first2=Tim|last2=Salimans|title=Classifier-Free Diffusion Guidance|publisher=arXiv|journal=arXiv|url=https://arxiv.org/abs/2207.12598|doi=10.48550/arXiv.2207.12598}}</ref> more experimentative or creative use cases may opt for a lower value, while use cases aiming for more specific outputs may use a higher value.<ref name="diffusers"/>

THC Science

Bringing Science to the Cannabis Conversation!

Trichome

Revision as of 15:21, 28 September 2022