The dominance of English language in Computational Linguistics and Automatic Text Processing have taught us that preprocessing is generally a good idea. In particular, lemmatization (the transformation of a word to its canonical form, usually the least marked form) is good. It reduces the dimensions of the the problem and the size of the data, and allows better generalization. That is, when we process English text, it is more useful to converge breaking, broke and breaks into a single form 'break'. This is particularly true when dealing with bag of word algorithms such as Topic Modeling.
But what about Korean?
Korean invokes a particular set of issues that are worth considering. One, for example is the usage of honorifics. Consider for example the title here. The address form "대통령님" has double honorific, and it can be removed to group with the normal "대통령". In both case it is equivalent to the American "Mr. President". However it invokes a sense of extra polite form that we need to be aware of when analyzing political texts. Same is true for any honorific or polite marker. There's a big difference between 말씀하다 and 말하다, and it has to do with what Alan Dundes called the 'texture' in his seminal "Texture, Text, and Context of the Folklore Text vs. Indexing".
Another great example is the tendency of some news outlets to write 美 or 美국 instead of 미국 (for example here). It is literally the same as 미국 (here). Both are pronounces miguk and mean America. For all practical purposes we can preprocess these and put all three forms in the same bag. But is it always the right thing to do? In this case, the hanja form invokes a sense of traditionalism, and correlates to newspapers from early days of journalism in Korea.
In this picture we can see the old form "米國" that is less used now. It has an interesting history of its own. This special issue was printed on the 28th of June, 1950, announcing that the forces of North Korea liberated Seoul. It might have been printed by North Korean sympathizers or even by the North Korean forces themselves. Clearly, it does not appear in the official archives.
Back to our business: Should we lemmatize Korean and lose important subtext? The answer, as always, is depending on what you we trying to achieve. In some cases yes, and in others perhaps not. We might want to retain some of the additional information. One way doing that is augmentation: Keeping additional information in the form of tags for example, along with the data.
コメント