Considering the Nature of Multimodal Language from a Crosslinguistic Perspective

Language in its primary face-to-face context is multimodal (e.g., Holler and Levinson, 2019; Perniss, 2018). Thus, understanding how expressions in the vocal and visual modalities together contribute to our notions of language structure, use, processing, and transmission (i.e., acquisition, evolution, emergence) in different languages and cultures should be a fundamental goal of language sciences. This requires a new framework of language that brings together how arbitrary and non-arbitrary and motivated semiotic resources of language relate to each other. Current commentary evaluates such a proposal by Murgiano et al (2021) from a crosslinguistic perspective taking variation as well as systematicity in multimodal utterances into account.

There is growing consensus in language studies that language in its primary face-to-face context (as is the case both phylogenetically and ontogenetically) is multimodal (e.g., Holler and Levinson, 2019;Perniss, 2018). That is, expressions in the visual modality, such as visible communicative movements (i.e., gestures) universally accompanying spoken languages and sign languages of Deaf Communities are as intrinsic to the nature of language as expressions in the vocal modality are (e.g., Goldin-Meadow & Brentari, 2015;Özyürek, & Woll, 2019;Perniss et al., 2015). It is possible that the multimodal nature of language and thus its flexible uses through the visual and/or vocal modalities enabled human language faculty to spread around the world so successfully and to adapt to many biologically and socially driven individual, environmental and cultural differences and circumstances. Thus, understanding how expressions not only in vocal but also in the visual modality contribute to our notions of language structure, use, processing, and its transmission (i.e., acquisition, evolution, emergence) in different languages and cultures should be a fundamental goal of language sciences. Murgiano et al.'s (2020) paper provides an important step in this direction compiling comprehensive evidence and promoting a "language as situated view" as a way to study language from a multimodal perspective and in face-to-face contextual uses. Furthermore the paper opposes this to a "language as system" view defined at the population level and characterized by the use of "arbitrary symbols governed by rules". In this commentary I will argue that, even though I agree with the general premises of "language as situated view", such multimodal contextual uses mentioned in the paper can also be systematic and governed by the specific typological characteristics of different languages as well as by culturally defined and grounded communicative conventions. Thus, a strong dichotomy between "systematic" versus "situated" views of language might not be a good characterization of multimodal language if we consider the crosslinguistic variation (and systematicity) in the non-arbitrary and indexical uses of multimodal language in children or in adults (also see Lupyan and Winter, 2018 for systematicity in iconic spoken words).
As also mentioned in the target paper, traditionally, most models, theories and definitions of language have taken specific characteristics of expressions in speech or written text as primary and have emphasized arbitrary, categorical, linear, combinatorial and unichannel nature of language as the core and unique features of language (e.g., Hockett, 1960). These assumptions have also shaped many of our psycholinguistic, neurolinguistic and computational models and approaches to language. Importantly however, visible iconic and indexical (i.e., pointing), simultaneous and multichannel aspects of language specific to expressions in visual modality have been mostly considered not to be core defining features or even sometimes as "fossils" of language evolution (see Jackendoff, 1999 for such a view for gestures). Even though in the field of sign linguistics many aspects of sign languages, such as their arbitrary, categorical, sequential and hierarchical patternings have been considered on a par with spoken languages (Sandler and Lilo-Martin, 2006), those aspects that do not easily fit these definitions have been debated in terms of their linguistic nature (e.g., Cormier, Fenlon, & Schembri, 2015).
As Murgiano et al. (2021) paper points out there is growing evidence showing that especially visible iconic and indexical aspects of both spoken and sign languages are frequent when we investigate language use in face-to-face context and play a fundamental core in language acquisition and processing. This is attributed to their non-systematic nature giving easy and "visible cues" to their meaning. Based on this evidence authors make the claim that we should make a distinction between "language as a system view "defined by arbitrariness and a rule governed system and "language as a situated system view" characterized by use of nonarbitrary symbols having a function in the acquisition and processing of language and that adapting the latter which embeds the former would provide a more inclusive view of language.
However, many findings in the review are reported from a general and "universal" perspective and do not take the crosslinguistic variations in multimodal uses into account (e.g., Kita and Özyürek, 2003;Özyürek et al, 2005, 2008Özyürek, 2017, 2018aAzar et al, 2020). Based on the latter evidence I would caution first of all making a strong dichotomy between the two views-even though the authors emphasize the situated view embedding the systematic view, albeit without specifying the relations between them. Secondly, I would argue that defining the iconic and indexical uses as described in the paper as "directly" mapping form to referent using a "general and universal cognitive mapping mechanism" and thus facilitating language use, acquisition and processing need to be nuanced.
Even though iconic gestures can reflect aspects of the referent in a motivated way, this is not the same across languages. Crosslinguistic research has shown that speakers' iconic gestures can be shaped differently in line with the typological differences in how information is packaged across languages (Kita and Özyürek, 2003). For example, while English speakers mostly gesture about a ball rolling down the stairs in one gesture expressing both manner and path of the event, Turkish and Japanese speakers usually depict manner of rolling in one gesture and path only in another one. These differences reflect how events are typologically expressed in different languages, such as in one clause in English and in two clauses in Japanese and Turkish. Furthermore these differences in iconic gestural depictions are not available to children immediately, and are learned as systematic language-specific patterns over time, pointing out that iconicity is not a completely universally accessible feature outside of the specific language it is embedded or integrated in (Özyürek et al., 2008). It has been also found that Turkish-speaking children use iconic gestures much earlier than reported for English speaking children due to using verbs earlier, in line with Turkish being a verb-framed language (Furman, Küntay, Özyürek, 2014). Also, there is growing evidence that pointing gestures are embedded within language-specific demonstrative systems (see Cooperrider et al. (2021) for how pointing is integrated in different ways within signed and spoken language) and they vary in terms of whether they encode spatial characteristic and/or joint attention between speaker and addressee. For example, Turkish speakers are reported to use pointing to referents in context more with one demonstrative (su) used when addresses' attention is not on the referent, than others (bu, o) that encode distance. Children have been also found to learn pointing with one demonstrative (su) later than with others (bu,o). Thus pointing and demonstratives can be coupled in conventional and systematic ways in different languages with different learning trajectories (Peeters and Özyürek, 2016;Küntay andÖzyürek, 2006-also see Azar, Backus, Özyürek, 2019 for language specific points to space accompanying pronouns in reference tracking in Turkish discourse). There is also growing research on general multimodal grammars where both visible indexical and iconic ways of communicating are embedded in grammars of different spoken languages (e.g., Floyd, 2016). Furthermore within sign languages, it has been shown that access to sign iconicity is a subjective, culture-specific process tightly linked to signers' experience with their own sign language (Occhino, Anible, Wilkinson, & Morford, 2017). Finally, there is also crosslinguistic variation in the way iconic structures are recruited in spatial language in different sign languages (Perniss, Zwitserlood, Özyürek, 2015). Novel research is needed to see whether visible iconicity and indexicality universally facilitate acquisition and processing across languages.
Thus as we start widening our view of language as a multimodal system and think of ways to integrate iconic and indexical uses within core language system, we should consider the crosslinguistic variation in this domain from the beginning. Rather than assuming non-arbitrary uses as less systematic and less rule-governed, we should take a crosslinguistic approach to understand systematicity and variation in human language structures, use, processing and transmission based on integration and flexible use of different multimodal semiotic resourcesthat possibly makes language the adaptive system it is across the globe.

ETHICS AND CONSENT
No ethical approval and/or consent was required.

COMPETING INTERESTS
The author has no competing interests to declare.