Summary
Our study focuses on text-to-image systems, such as Dall-E, Midjourney, or Stable Diffusion. Their principle is to generate a visual result from a textual prompt written in a language close to the user's natural language (reminding the Gaildrat’s definition of declarative modelling [3]).
Different studies have questioned the degree of human creativity that may exist or be lacking in the use of text-to-image systems[4, 5], as the user delegates part of the illustration work to the generative engine, by submitting their request. In this context, the text-to-image system only considers the semantic value of the prompt to proceed with the content generation.
Objective
We aim to extend the text-to-scene framework to consider new aspects in the input text: the form of the text (its layout, intrinsic content, etc.) and the experience while typing. We consider that the text proposed to the machine is based on four components: ideation (not studied here), writing conditions, text form, and text meaning. Regarding the writing conditions, we distinguish between the context (external to the author, such as the location, time period, elements present, etc.) and the experience of typing (emotions, keyboard interaction, duration of typing, etc.).
In an artistic approach, we aim to use the collected data to alter the result of a first text-to-image generation stage. This alteration should be slight enough not to break the link between the sense of the prompt and the observable final result. Our proposal, called "Typing-to-Image" is still under development in the context of my PhD thesis in digital arts. The description presented here is that of its intentions and the studies that led to this proposal, in a creative and illustrative approach.
Method
The typing-to-scene system aims to highlight the text triptych “Sense-Form-Experience”. The sense is already treated by “text-to” approaches. We define the form of the text through several types of elements:
- Text layout: Line breaks, indents, spaces...
- Statistics: counting words, characters, lines, paragraphs…
- Hidden elements (steganographic approach such as the Equidistant Letter Sequences [1]), reminding about Gadamer's research on interpretations in terms of what the text tells about itself [2].
Regarding the experience of typing the text, we focus on different types of captured data:
- Time-related elements: date/time, typing duration…
- Physiological-emotional elements: video detection of facial emotions, heart rate…
- Keyboard events: text corrections, typing activity...
From these data, we define conditional cases leading to alterations of the image (with glitches, pixel color and brightness modifications, marks in filigree...).
Result and observations
The visually altered image carries within it the trace of the input text's particularities that were used to produce it. Usual evaluations of text-to-image/scene systems are mostly based on the semantic match between the prompt and the image content, suggesting that some texts have characteristics less suitable generation engine[6, 7]. For example, long and complex texts, descriptions with multiple scenes (multiple locations or temporalities in the same text), and non-representable texts (conceptual questions, opinions, etc.) represents difficulties. The integration of the text's form and typing experience gives additional elements more fitted to these types of text.
Conclusion
Typing-to-image proposes to incorporate human involvement in the typing experience, highlighting a part of the sensitivity that distinguishes imported text (like copy-paste) from spontaneous text. Typing-to-image also explores information about what the text says about itself, marked by intentional or unintentional elements of the author. The workflow does not start anymore with the validation of the prompt, but as soon as the user is faced with the prompt.
[1] Bar-Natan, D. and McKay, B. 1999. Equidistant Letter Sequences in Tolstoy’s “War and Peace.” (1999).
[2] Gadamer, H.-G. 1977. Philosophical hermeneutics. Univ of California Press.
[3] Gaildrat, V. 2007. Declarative Modelling of Virtual Environments, Overview of issues and Applications. Proceedings of the International Conference on …. (Jan. 2007).
[4] Oppenlaender, J. 2022. The Creativity of Text-to-Image Generation. Proceedings of the 25th International Academic Mindtrek Conference (New York, NY, USA, Nov. 2022), 192–202.
[5] Russo, I. 2022. Creative Text-to-Image Generation: Suggestions for a Benchmark. Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities (Taipei, Taiwan, Nov. 2022), 145–154.
[6] Ulinski, M. et al. 2018. Evaluating the WordsEye text-to-scene system: imaginative and realistic sentences. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018).
[7] Yashaswini, S. and Shylaja, S.S. 2021. Metrics for Automatic Evaluation of Text from NLP Models for Text to Scene Generation. European Journal of Electrical Engineering and Computer Science. 5, 4 (Jul. 2021), 20–25. DOI:https://doi.org/10.24018/ejece.2021.5.4.341.
Back