Syntactic Complexity of Learning Content in Italian for COVID-19 Frontline Responders: A Study on WHO’s Emergency Learning Platform

Giuseppe Samo
Department of Linguistics, Beijing Language and Culture University,
Xue Yuan Road, 15, 10083 Beijing (People’s Republic of China)
Email: samo@blcu.edu.cn
ORCID iD: https://orcid.org/0000-0003-3449-8006
Research Interests: Syntax, quantitative methods, computational models.

Ursula Yu Zhao
Department of Digital Health and Innovation, World Health Organization
Avenue Appia, 20, 1211 Geneva (Switzerland)
Email: zhaoy@who.int
ORCID iD: https://orcid.org/0000-0002-9655-4379
Research Interests: Public Health Policy, Digital Health, Knowledge Transfer in Multilingualism.

Gaya Gamhewage
WHO emergencies programme, World Health Organization
Avenue Appia, 20, 1211 Geneva (Switzerland)
Email: gamhewageg@who.int
ORCID iD: https://orcid.org/0000-0003-2536-9173
Research Interests: Capacity building, Learning, Training.

Abstract. The goal of this paper is to offer a model to quantify the level of complexity of the linguistic content of a corpus in Italian extracted from OpenWHO, WHO’s health emergency learning platform (Rohloff et al. 2018; Zhao et al. 2019). The nature of the computational ranking costs of a typology of relativization strategies is investigated. To reach this goal, the results of the corpus are compared with other three syntactic annotated corpora from Italian belonging to different genres (news, social media, encyclopedic entries, legal). The results show that online learning contents in public health reduce complex structures in syntactic terms. The case study presented here provides a methodology to quantify syntactic and computational complexity in corpus studies.

Key words: Syntax, Italian, Relatives, Covid-19, Learning Platforms

JEL Code: G35

Copyright © 2019 Name Surname. Published by Vilnius University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Pateikta / Submitted on 13.12.20

1. The complexity of the knowledge content on health emergency learning platforms: focus on a syntactic strategy

The recent COVID-19 pandemic has increased “the demand for trusted information to help frontline personnel and communities respond to the outbreak” (Gamhewage et al. 2020, p. 257). While it is financially insignificant to replicate and disseminate information (David & Foray 2002), it takes much more effort to reproduce knowledge through learning (Foray 2001). To be specific, effective knowledge reproduction requires information accessibility of the learners, intended as a computational cost in retrieving content. Characteristics such as simplicity, exhaustiveness and computational costs for the addressees, or users in the context of online learning, are keys to reproducing actionable knowledge.

Linguistic barriers have been observed reducing the accessibility of users, thus halting the knowledge transfer process, given that knowledge is created and reproduced through human language. The complexity of the knowledge content often determines the required efforts and potential challenges during the language translation process. This is particularly true in the context of health emergencies in which the medical information needs to be accessed by the affected population where a local substandard variety is spoken (Translators without borders, 2017). Being able to evaluate the linguistic complexity of the source content and the linguistic variability of the target population could potentially facilitate the knowledge transfer process during a public health or societal emergency event.

An accessible domain of linguistics for quantifiable variables in corpus-based studies, keeping a crosslinguistic dimension, is the domain of syntax. What formal linguistics can do is to provide a definition of a quantifiable computational (syntactic) complexity, measured in terms of complex syntactic structures and operations (Cinque & Rizzi 2016 for an overview), to be investigated in various genres targeting different addressees.

Are we able to quantify such dimensions of variation? What would be possible measurements to linguistically quantify the levels of complexity in the health learning contents?

The paper aims to explore the above-mentioned questions by investigating the level of complexity of the linguistic content extracted from the open learning platform OpenWHO hosted by World Health Organization (Rohloff et al. 2018; Zhao et al. 2019). It serves also as a case study of assessing computational complexity of a specific syntactic construction in Italian language, represented by the frequency of a typology of relative clauses.

To accomplish our goals, we first analyse the nature of the computational ranking costs of a typology of relativization strategies, narrowing the analysis on Italian in order to provide both qualitative and quantitative insights.

In this work, we quantify a single case of syntactic complexity discussed in results in psycholinguistics and cognitive studies (Friedmann et al. 2009). We focus on a set of structures that are proven successful to show clear asymmetries in terms of parsing in studies of language development and language pathology (Martini et al. 2019, for an overview). We adopt the computational point of view stating that “differentials in frequency are hallmarks of underlying grammatical properties” (Merlo 2016; Samo & Merlo 2019, p. 3), and postulate that the more complex a syntactic configuration in terms of cognitive impact is, the less frequent would occur in a specific dataset.

We first determine a quantifiable notion of complexity by selecting one specific type of constructions as a grammatical (in terms of accepted and naturally occurring) “complex” structure, which is the relativization by embedding on the subject (e.g., “the authors that write this sentence are linguists”) and objects (e.g., “this sentence that the authors are writing is an example”).

We then determine a quantifiable dimension of complexity, which is represented by the frequency of relative clauses, both on subjects and objects, in restrictive and non-restrictive cases. Relative clauses convey different two layers of complexity. A first level is associated with the fact that in every relative clause, one of the components (the relativizer) logically represents a variable that needs to be parsed and indexed. Deliberately keeping technicalities that are beyond the scope of this paper, studies in psycholinguistics (Frauenfelder et al. 1980) have shown that subject relatives are computationally easier to parse than object relatives. Object relatives, though grammatical for adult speakers, are parsed slower and result partially ungrammatical in early grammars (Friedmann et al. 2009) and in specific language impairments and atypical development populations (on aphasia, Grillo 2008; on Autism Spectrum Disorder, Durrleman et al. 2015). In a similar vein, these structures challenge neural-network-based techniques in Natural Language Processing (Merlo & Ackermann 2018). The theoretical account developed in Rizzi (1990, 2013) states that this is due to the fact that the subject represents an intervener for the extraction process from the locus of generation of the object (the “right” of the verb in a Subject-Verb-Object language) to the very beginning of the relative clause (which displays an Object-Subject-Verb order). On the other hand, subject relatives are better mastered since no intervention is at play (Friedmann et al. 2009).

In this paper, we conduct a study on Italian. In Italian (as previously given in the English examples), object relative clauses involve the displacement of syntactic constituents. The peculiarity of Italian is that the construction of object relatives modifies the canonical order (Subject-Verb-Object) in three possible patterns: (i) Object-relativizer-Subject-Verb (la frase che l’autore sta scrivendo ‘the sentence that the author is writing’), (ii) Object-relativizer-Verb-Subject (la frase che sta scrivendo l’autore, displaying a post-verbal subject), or (iii) Object-relativizer-Verb (la frase che sta scrivendo, with a null subject). In all the cases, we observe a reordering of the canonical syntactic architecture of Italian.

Our linking hypothesis is to formulate a strictly mutual interaction between the complexity of a structure and the frequency of complex structures in a specific corpus extracted from the OpenWHO emergency learning platform, whose content addresses COVID-19 frontline responders.

In this work, we analyse syntactic complexity at both an intratextual level and a cross-textual level. Two research questions are investigated.

The first research question focuses on the frequencies of subject and object relatives in the emergency platform’s content. We expect that more complex structures (object relatives) should result in fewer occurrences with respect to less complex configurations (subject relatives). This hypothesis can be stated in H₁.

H₁: The raw counts of subject relatives are expected to be greater than the raw counts of subject relatives in the content of an emergency platform.

As for the cross-textual analysis, a comparison with other text genre conveying information (e.g., news, encyclopedic entries, social media) is required. The second research question targets the typology of text genre. If compared with other types of genres, a corpus for health emergencies should ideally reduce complexity. To do so, we build a simple computational model of expected and observed counts. We can state the second hypothesis in H₂.

H₂: The textual content extracted from OpenWHO should be less complex than other genres (e.g., news, encyclopedic entries, social media).

In section 2, we provide a description of the materials and methods used for the study. In section 3, we restate our hypothesis according to the presented materials and methods. In section 4, we demonstrate how the syntactic complexity found in the OpenWHO corpus in Italian is reduced and differs with respect to corpora conveying information content via news, encyclopedic entries and social media. Finally, section 5 concludes.

2. Materials & Methods

2.1 Materials

The experimental corpus under investigation is entirely extracted from a subset of online learning materials published on WHO’s health emergency learning platform OpenWHO (Rohloff et al. 2018). A definition of the platform is given in Zhao et al. (2019, p. 261), described as the “open online platform, introducing massive online learning into health emergency response”.

All the materials are open and freely available. Our dataset (6,974 tokens) consists of the materials concerning the technical content tailored for health professionals in the response to Covid-19 (Infection Prevention and Control for COVID-19 Virus) in Italian.^[1] Since the same content is present in other twenty varieties^[2], a manual analysis to detect the grammaticality and the readability has been put forward by the first author. We refer henceforth to this corpus as who.

From a lexical entries’ point of view, the technicalities related to lexical the medical domain are obviously naturally occurring. A first interesting observation is related to the lexical entries to denote the virus triggering the pandemic. From a frequency point of view, we detect a similar distribution in the usage of both the term Covid-19 (occurrences 29, 0.0109/tokens) and Coronavirus (occurrences 19, 0.0064/tokens). On the other hand, we do not retrieve any occurrences of Virus Corona / Viruscorona, which might be the expected morphological realization in Italian, following prescriptive rules of word formation.

From a morphological/morphosyntactic perspective, we observe a rich usage of infinitive forms (attivare ‘to activate’, iniziare ‘to start’) and modals. On the other hand, a rare number of imperatives (lavati le mani ‘wash your hands’) is deployed in the text. As for syntax, the situation is generally simple. We observe an extremely reduced frequency of local variables as anaphoric pronouns and embedding structures.

We compare the experimental corpus results with three different “control” groups: (i) an Italian text derived from a multilingual text vehiculating knowledge, (ii) a large-scale corpus of production in Italian in news, encyclopedic entries and legal texts and, finally, (iii) a corpus representative of social media in Italian. We selected a set of syntactically annotated corpora (treebanks) to automatically retrieve the elements under investigation. The first treebank, or control group (i), is the Italian version of the parallel treebank pud^[3]v.2.5 (23, 731 tokens, genre: news, encyclopedic entries) containing strings of texts extracted from news and encyclopedic entries syntactically annotated following the guidelines of Universal Dependencies (Nivre 2015; Zeman et al. 2020). This treebank, as our corpus, belongs to a class of the same textual content translated into different varieties (21 languages). The second investigated treebank, or control group (ii), is the isdt^[4] (298,343 tokens; genre: legal, news, wiki; Bosco et al. 2013) treebank, corresponding to a set of Italian legal, news and encyclopedic entries from Italian and represent a big-scale treebank. Finally, the third treebank, or control group (iii), is based on textual material extracted from social media in Italian, twittiro (henceforth, twit; 29,605 tokens, genre: social media; Cignarella et al. 2019). A manual analysis is conducted by the first author in the relevant cleaning of the data.

2.2 Methods

In the aspect of methods, a double-step analysis is conducted on the who corpus. We first perform the task in the software R (R Development Core Team, 2016), creating functions to count the occurrences of the relevant relativizer, followed by a manual investigation. The results then provides subject relatives of the type of the naturally occurring example extracted from the corpus i sistemi che dovrebbero essere disponibili per permettere una risposta rapida ‘the systems that should be in place to enable a rapid response’, and object relatives of the type ‘Chi si prende cura del paziente e i membri della famiglia dovrebbero (se possibile) essere informati sul tipo di cure che dovrebbero fornire ‘Caregivers and family members should (if possible) be advised on the type of care they are supposed to be providing’.

An automatic retrieval of the elements is then operated on the three control datasets of pud, isdt and twit with the tool Grewmatch (http://match.grew.fr/)^[5] before a qualitative analysis of the relativizer elements is executed. The queries are designed to search for a relativizer item and its argumental nature.^[6]

The results will be then described in terms of increase/decrease ratio of an element D, calculated on the basis of expected and observed counts, inspired by Samo & Merlo (2019). Let us take for example, the difference between the who and the pud corpora. To be specific, we let E^WHO/PUD be the expected counts on the basis of pud, and O^WHO the observed counts in the who corpus. Let C^PUDbe the total number of observations scaled by the size of the corpus of pud. Let T^WHO be the total number of tokens of who. The expected counts of the relative-clause feature value occurring in a sentence are then calculated as E^WHO/PUD = C^PUDT^WHO. Finally, D^WHO/PUDis given by the formula (E^WHO/PUD- O^WHO)/ E^WHO/PUD.

In section 3, we restate our hypotheses presented in section 1 on the basis of the relevant materials, while results are discussed in section 4.

3. Quantifying Hypotheses

The asymmetry between subject relatives and object relatives should quantitatively result in different frequencies. Therefore, we expect that subject relatives should occur at a higher frequency than non-subject relatives. According to our materials, we can therefore restate H₁.

H₁: The raw counts of subject relatives are expected to be greater than the raw counts of object relatives in all the corpora under investigation. A ratio given by the number of the object relatives divided by the number of subject relatives should be minimized for the who corpus.

As it concerns the second hypothesis, the predictions are formulated in terms of differences in numerical observations. In this case, we study the difference between the expected counts and the observed counts. We expect that a quantitative analysis on the who corpus shall show a reduced complexity compared to the other treebanks of texts when delivering knowledge. We thus expect to quantify these results in detecting a smaller frequency of the relative clauses (for both subject and object-oriented relative clauses) in the who corpus with respect to other treebanks. We can thus restate H₂.

H₂: The coefficient of the complex structures (D^WHO/PUD, D^{WHO/
ISDT}, D^{WHO/ TWIT}) for both subject and object relatives should be negative.

Both hypothesis H₁ and H₂ are to be contrasted to a null hypothesis that would predict that the nature of a knowledge transfer platform does not correlate to the imputed values of the relation expected/observed counts.

4. Results and discussion

Table 1 confirms H₁ since the raw counts of subject relatives are greater than the raw counts of object relatives in all the corpora under investigation. Focusing on the asymmetry between subject and object relative clauses, we confirm the observation that different corpora show very large fluctuations in distributions of grammatical relative clauses (confirming trends observed in Roland et al., 2007), as it can be observed by the proportions of subject and object relatives in Table 1. A more important piece of evidence lies in the ratio between subject and object relatives. Independent of the raw numbers of relative clauses, the ratio is extremely similar in all the three “control groups” (pud 0.58, isdt 0.59, twit 0.62). As predicted, this ratio drops dramatically in the corpus who (0.04), indicating a reduction of complexity in the specific corpus. The findings confirm the predictions that complex structures as object relatives are avoided (if necessary) in the context of learning in emergencies.

Table-1: Distribution of relatives

	pud	isdt	twit	who
Tokens	23731	298343	29605	6744
Subject Relatives	186	2454.59	117	24
%Subject Relatives	0.00784	0.00823	0.00395	0.00356
Object Relatives	109	1455.6	71	1
%Object Relatives	0.00459	0.00488	0.00240	0.00015
Ratio (OR/SR)	0.58	0.59	0.62	0.04

Source: Authors Calculation. Subjects and Objects Relatives in the treebanks under investigation.

Table 2 confirms H₂. All the Ds under investigation are negative. While the reduction of subject relatives varies among the corpora, the decrease of expected object relatives shows a percentage higher than 93%.^[7]

Table-2: Observed vs. expected counts.

OpenWHO / News, Wiki
	O^WHO	E^WHO/PUD	D^WHO/PUD	Binomial p	z-p
Subject Rel.	24	52.9	-0.4037	0.00000207	0.000028
Object Rel.	1	31	-0.9677	< 0.00000001	< 0.000001
OpenWHO/ Legal, News, Wiki
	O^WHO	E^WhO/ISDT	D^WHO/ISDT	Binomial p	z-p
Subject Rel.	24	55.5	-0.5676	0.00000004	0.000002
Object Rel.	1	32.9	-0.9696	< 0.00000001	< 0.000001
OpenWHO, Twitter
	O^WHO	E^WHO/TWIT	D^WHO/TWIT	Binomial p	z-p
Subject Rel	24	26.7	-0.1011	0.06850393	0.316441
Object Rel.	1	16.2	-0.9395	0.00000003	0.000015

Source: Authors Calculation. Observed counts, expected counts, D coefficients and binomial test for subject and object relatives. Results confirming the hypothesis are given in bold.

Both hypotheses are therefore confirmed. We hereby conclude that the emergency platform presented in this study reduces its ‘syntactic’ complexity to possibly transfer knowledge in a fast and efficient way.

5. Conclusion

To sum up, reduced computational costs in terms of syntactic complexity (focusing on relative clauses) are clearly observed in the source data studied in this paper, compared with those of corpora conveying information in other genres (news, encyclopedic entries, social media material, etc.). Moreover, theoretical syntax offers a quantifiable measure to detect layers of complexity in texts, which possibly leads to an improved speed of knowledge transfer processes in emergency contexts.

The simple computational model presented in this paper contributes to providing a relatively simple methodology for detecting layers of complexity in specific genres such as learning platforms. We envision potential applications in adopting syntactic complexity as a measure in classification tasks.

In future works, the inclusion of a multilingual dimension could be considered to research on syntactic strategies and observe whether such a syntactic complexity is actually reduced crosslinguistically. A comparison with a variety of health emergencies or types of corpora made for medical guidelines would also be valuable.

The findings discussed in this paper explore the important role of how corpus studies could potentially play in the knowledge transfer process for a specialised domain such as public health. The conclusion that a digital learning platform such as OpenWHO delivers the maximum amount of information in a less complex nature of language. Last but not least, this study offers opportunity of further cross linguistic analysis in the domain of formal linguistics and provides an entry point for learning experts to determine the required digital learning standard in the context of multilingualism.

References

BOSCO, C. et al., 2013. Converting Italian Treebanks: Towards an Italian Stanford Dependency Treebank. In: 7th Linguistic Annotation Workshop and Interoperability with Discourse, The Association for Computational Linguistics, 61–69. Available from: https://www.aclweb.org/anthology/W13-2308 (accessed on June 24th, 2020).

CIGNARELLA, A. et al., 2018. Application and analysis of a multi-layered scheme for irony on the Italian Twitter Corpus TWITTIRO. In: LREC 2018, Eleventh International Conference on Language Resources and Evaluation, European Language Resources Association (ELRA), 4204–4211.

DAVID, P., FORAY, D. 2002. An Introduction to the Economy of the Knowledge Society. International Social Science Journal, 54: 9–23. doi: 10.1111/1468-2451.00355.

DURRLEMAN, S. et al, 2015, Complex syntax in autism spectrum disorders: a study of relative clauses. International Journal of Language & Communication Disorders, 50(2), 260–267.

FORAY D., 2004. Economics of knowledge. Cambridge MA: MIT press.

FRAUENFELDER, U.H. et al., 1980. Monitoring around the relative clause. Journal of Verbal Learning and Verbal Behavior, 19(3), 328–337.

FRIEDMANN, N. et al., 2009. Relativised relatives: Types of intervention in the acquisition of A-bar dependencies. Lingua, 119, 67–88.

GAMHEWAGE, G., et al., 2020. Fast-tracking WHO’s COVID-19 technical guidance to training for the frontline 264, Weekly Epidemiological Record 23/24.Available from: https://www.who.int/wer/2020/wer9523-24/en/ (accessed on June 24th, 2020).

GRILLO, N., 2008. Generalized minimality: syntactic underspecification in Broca’s aphasia, PhD Dissertation, University of Utrecht.

MARTINI, K., et al., 2019. Syntactic complexity in the presence of an intervener: the case of an Italian speaker with anomia. Aphasiology. DOI: 10.1080/02687038.2019.1686744.

MERLO, P., 2016. Quantitative computational syntax: some initial results. Italian Journal of Computational Linguistics, 2(1), 11–29.

MERLO, P., ACKERMANN, F., 2018. Vectorial semantic spaces do not encode human judgments of intervention similarity. In: Proceedings of the 22nd Conference on Computational Natural Language Learning, The Association for Computational Linguistics, 392–401. Available from: https://www.aclweb.org/anthology/K18-1038 (accessed on June 24th, 2020).

NIVRE, J., 2015. Towards a Universal Grammar for Natural Language Processing. In: A. GELBUKH. Computational Linguistics and Intelligent Text Processing: CICLing 2015, Lecture Notes in Computer Science. Cham: Springer, 3–16.

R DEVELOPMENT CORE TEAM, 2016. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing.

RIZZI, L., 1990. Relativized Minimality. Cambridge MA: MIT Press.

RIZZI, L., 2013. Introduction: Core computational principles in natural language syntax. Lingua, 130, 1–13.

RIZZI, L., CINQUE, G., 2016. Syntactic categories and functional categories. Annual Review of Linguistics, 2, 139–163.

ROHLOFF, T., et al., 2018. OpenWHO: Integrating Online Knowledge Transfer into Health Emergency Response. In: EC-TEL (Practitioner Proceedings ). CEUR workshop proceedings (ISSN 1613-0073). Available from: http://ceur-ws.org/Vol-2193/ (accessed on June 24th, 2020)

ROLAND D., et al., 2007. Frequency of basic English grammatical structures: A corpus analysis. Journal of memory and language, 57, 348–379.

SAMO, G., MERLO, P., 2019. Intervention effects in object relatives in English and Italian: a study in quantitative computational syntax. In: Proceedings of Quasy (Syntaxfest, Paris 26-28 August 2019), The Association for Computational Linguistics, 46–56. Available from: https://www.aclweb.org/anthology/W19-7906/ (accessed on June 24th, 2020).

TRANSLATORS WITHOUT BORDERS, 2017. Does Translated Health-Related Information Lead to Higher Comprehension? A Study of Rural and Urban Kenyans. Available from: https://translatorswithoutborders.org/wp-content/uploads/2016/08/TWB_WoR_ImpactStudy_FINAL.pdf (retrieved on September 26th, 2020).

ZHAO, Y. et al., 2019. OpenWHO traffic analysis: Can we predict non-profit course reach by dissemination channel? In: EMOOCs-WIP. CEUR workshop proceedings (ISSN 1613-0073), 261–266. Available from: http://ceur-ws.org/Vol-2356/ (retrieved on June 24th, 2020).

ZEMAN, D. et al., 2020, Universal Dependencies 2.6. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL). Faculty of Mathematics and Physics - Charles University. Available from: http://hdl.handle.net/11234/1-3226 (retrieved on June 24th, 2020).

Giuseppe Samo: Associate Professor, Department of Linguistics, Beijing Language and Culture University, Xue Yuan Road 15, 10083 Beijing (People’s Republic of China), Email: samo@blcu.edu.cn

Ursula Yu Zhao: Technical officer, Department of Digital Health and Innovation, World Health Organization, Avenue Appia 20, 1211 Geneva (Switzerland), Email: zhaoy@who.int

Gaya Gamhewage: Unit Head, Learning and capacity development unit, WHO Emergencies Programme, World Health Organization, Avenue Appia 20, 1211 Geneva (Switzerland), Email: gamhewageg@who.int

^[1]The materials are freely downloadable at the following link. https://openwho.org/courses/COVID-19-PCI-IT (accessed and downloaded on May 24th, 2020). The existence of the same text in other varieties lead us to define the material under investigation as extracted from a parallel corpus.

^[2]This datum is retrieved on January 9th, 2021.

^[3]The relevant documents on the treebank could be found here https://github.com/UniversalDependencies/UD_Italian-PUD/blob/master/README.md (accessed January 09th, 2021).

^[4]The documents of the treebank are to be found here: https://universaldependencies.org/treebanks/it_isdt/index.html (accessed January 09th, 2020).

^[5]The tool adopted in the investigation (accessed May 28th, 2020) provided only the first 1000 occurrences of the query, a coefficient has been calculated on the basis of the occurrences. This coefficient is calculated to provide a better understanding of a predictive tool. The trees are used as coefficient instead of subjects to keep an analysis in terms. Being I an imputed count, F the frequency of the result, and C the percentage of the exploitation of the corpus, the imputed count is derived from the formula. I = F ∙ (1- C))/C + F.

^[6]The query [pattern { a -[acl:relcl]-> b; b -[nsubj | obj ]-> c }] aimed to retrieve a variable a as a governor of an element b whose dependency labels a relativizer.

^[7]Whether the expected and observed counts presented here might or reject the hypothesis is established by a binomial test. The binomial test gives us the probability of k successes (relative clauses) in N independent trials (utterances), given a base probability p (the probability given by the distribution ofrelatives) of an event. Binomial p indicates the probability of the observed counts under a binomial distribution (the binomial test). z-p gives us the (one-tailed) probability of exactly the observed counts.