5th International Conference

Digital Culture & AudioVisual Challenges

Interdisciplinary Creativity in Arts and Technology

Hybrid - Corfu/Online, May 12-13, 2023

ShareThis
From text-to-image to typing-to-image : Emphasizing User Involvement
Date and Time: 12/05/2023 (09:30-11:15)
Location: Online
Lionel Laloum
Keywords: Text-to-Image, Typing-to-Image, Declarative modelling, Text conception, Digital Art, Generative Art

Summary

Our study focuses on text-to-image systems, such as Dall-E, Midjourney, or Stable Diffusion. Their principle is to generate a visual result from a textual prompt written in a language close to the user's natural language (reminding the Gaildrat’s definition of declarative modelling [3]).
Different studies have questioned the degree of human creativity that may exist or be lacking in the use of text-to-image systems[4, 5], as the user delegates part of the illustration work to the generative engine, by submitting their request. In this context, the text-to-image system only considers the semantic value of the prompt to proceed with the content generation.

Objective

We aim to extend the text-to-scene framework to consider new aspects in the input text: the form of the text (its layout, intrinsic content, etc.) and the experience while typing. We consider that the text proposed to the machine is based on four components: ideation (not studied here), writing conditions, text form, and text meaning. Regarding the writing conditions, we distinguish between the context (external to the author, such as the location, time period, elements present, etc.) and the experience of typing (emotions, keyboard interaction, duration of typing, etc.).

In an artistic approach, we aim to use the collected data to alter the result of a first text-to-image generation stage. This alteration should be slight enough not to break the link between the sense of the prompt and the observable final result. Our proposal, called "Typing-to-Image" is still under development in the context of my PhD thesis in digital arts. The description presented here is that of its intentions and the studies that led to this proposal, in a creative and illustrative approach.

Method

The typing-to-scene system aims to highlight the text triptych “Sense-Form-Experience”. The sense is already treated by “text-to” approaches. We define the form of the text through several types of elements:
- Text layout: Line breaks, indents, spaces...
- Statistics: counting words, characters, lines, paragraphs…
- Hidden elements (steganographic approach such as the Equidistant Letter Sequences [1]), reminding about Gadamer's research on interpretations in terms of what the text tells about itself [2].

Regarding the experience of typing the text, we focus on different types of captured data:
- Time-related elements: date/time, typing duration…
- Physiological-emotional elements: video detection of facial emotions, heart rate…
- Keyboard events: text corrections, typing activity...

From these data, we define conditional cases leading to alterations of the image (with glitches, pixel color and brightness modifications, marks in filigree...).

Result and observations

The visually altered image carries within it the trace of the input text's particularities that were used to produce it. Usual evaluations of text-to-image/scene systems are mostly based on the semantic match between the prompt and the image content, suggesting that some texts have characteristics less suitable generation engine[6, 7]. For example, long and complex texts, descriptions with multiple scenes (multiple locations or temporalities in the same text), and non-representable texts (conceptual questions, opinions, etc.) represents difficulties. The integration of the text's form and typing experience gives additional elements more fitted to these types of text.

Conclusion

Typing-to-image proposes to incorporate human involvement in the typing experience, highlighting a part of the sensitivity that distinguishes imported text (like copy-paste) from spontaneous text. Typing-to-image also explores information about what the text says about itself, marked by intentional or unintentional elements of the author. The workflow does not start anymore with the validation of the prompt, but as soon as the user is faced with the prompt.


[1] Bar-Natan, D. and McKay, B. 1999. Equidistant Letter Sequences in Tolstoy’s “War and Peace.” (1999).
[2] Gadamer, H.-G. 1977. Philosophical hermeneutics. Univ of California Press.
[3] Gaildrat, V. 2007. Declarative Modelling of Virtual Environments, Overview of issues and Applications. Proceedings of the International Conference on …. (Jan. 2007).
[4] Oppenlaender, J. 2022. The Creativity of Text-to-Image Generation. Proceedings of the 25th International Academic Mindtrek Conference (New York, NY, USA, Nov. 2022), 192–202.
[5] Russo, I. 2022. Creative Text-to-Image Generation: Suggestions for a Benchmark. Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities (Taipei, Taiwan, Nov. 2022), 145–154.
[6] Ulinski, M. et al. 2018. Evaluating the WordsEye text-to-scene system: imaginative and realistic sentences. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018).
[7] Yashaswini, S. and Shylaja, S.S. 2021. Metrics for Automatic Evaluation of Text from NLP Models for Text to Scene Generation. European Journal of Electrical Engineering and Computer Science. 5, 4 (Jul. 2021), 20–25. DOI:https://doi.org/10.24018/ejece.2021.5.4.341.

Lionel Laloum
Lionel J. Laloum is a PhD student in digital arts at Université Paris 8, focusing on 3D scene generation from a textual input. He is graduated with Master’s degree in artificial intelligence systems from Université Paris Dauphine and in game development studies from ESGI – Ecole Supérieure de Génie Informatique (graduate school of computer engineering) in Paris. He previously worked on different subjects, such as symbolic data analysis applied to medical administrative sector (in France) and how AI can be applied in Trading Card Games.

Back
   
Text To SpeechText To Speech Text ReadabilityText Readability Color ContrastColor Contrast
Accessibility Options