The thing with languages

Guy Shababo
Jun 7, 2023
2 min read

Sitting right now in the 75th International Communication Association conference in Toronto, surrounded by experts, is a good time to say something on the kind of things I do here. The paper that I will be presenting tomorrow is focused on the intersection between language and technology. In short we are saying that linguistic differences are more important than people assume, and that particularly with the aid of technology - we have a tendency to miss important stuff. This kind of knowledge is intuitive but tend to fall between the cracks:

Linguists rarely deal with this kind of generalizations that we make here, and are generally more interested in concrete and specific issues; the computation people tend to assume that with greater computation power and better algorithms one can solve any existing problem; and the end users, the researchers themselves, tend to see technology as a means to an end.

Here comes Korean studies.

In short, most tools for text analysis are designed by and for English speakers. English is quite an unusual language, as far as languages go. For example, it generally ignores gender and got only two (whereas most European languages tend to see three, as Latin does). It also got very low morphological index: It does not provide a lot of additional information by changing words.

Chosun Ilbo June 28 1950. Some Korean newspapers, present a unique problem by mixing both Korean and Classical Chines at once. To simply replace the Chinese with Korean means to lose an important layer of the text.

Korean studies scholars look at two unusual languages. Both very different from English. Korean is morphologically rich and provides some cool information such as the register - how much we respect the one we address and how formal we speak. Classical Chinese on the other hand is extremely analytic, providing close to no morphological information. Classical Chinese or Literary Sinetic is also flexible with its word order.

Of all the manipulations we do on text, the most sensitive is probably what we call "preprocessing": The cleaning process of the text. This deserves a post of its own, but just in a nutshell, when we clean a text we also remove information. This information is not the same between languages. For example, when we change the verb "went" to "go" we remove the tense. However, when we change 앉았습니다 to 앉다 we also remove the fact that the speaker is talking in a formal, polite way.

How do we solve this? I have no simple recipe but it seems that the first step is, as always, awareness.

Guy Shababo

The thing with languages

Recent Posts

Comments