About: On the art of taming and exploiting parallel tags in a multilingual corpus

Facets (new session)
Description
Metadata
Settings
- owl:sameAs
- Inference Rule:

About: On the art of taming and exploiting parallel tags in a multilingual corpus Goto Sponge NotDistinct Permalink

An Entity of Type : http://linked.opendata.cz/ontology/domain/vavai/Vysledek, within Data Space : linked.opendata.cz associated with source document(s)

Attributes	Values
rdf:type	skos:Concept http://linked.opendata.cz/ontology/domain/vavai/Vysledek
rdfs:seeAlso	http://utkl.ff.cuni.cz/~rosen/public/2010_unitags_slavicorp.pdf
Description	Multilingual parallel corpora can be annotated with monolingual tools, such as morphosyntactic taggers. However, even taggers for typologically similar languages often use incompatible tagsets, which results in conceptual and formal variety of tags within a single corpus. Retraining taggers on data annotated with a common tagset is not a realistic option. Differences between tagsets are often rooted in different linguistic perspectives rather than in real distinctions between the languages, which means good chances to find a common ground. Moreover, a different perspective may provide additional information missing in one tagset but present in another. Our first goal is to delegate the task of dealing with multiple tagsets to an abstract interlingual representation of linguistic categories. Ideally, each tag in every language-specific tagset used in the corpus is linked to a position in a tangled hierarchy of concepts. To accommodate the different perspectives, the hierarchy takes three views of word class. The Czech tag for a relative pronoun is decoded as a category with the properties of inflectional adjective, syntactic noun, and semantic pronoun, each with its appropriate morphological characteristics. Comparison of different tagsets reveals mismatches, where tags are seen as ambiguous wrt concepts. Such mismatches are properly represented, which allows for a principled mapping strategy between languages-specific tagsets, and for intuitive and underspecified queries. The hierarchy can be built and the mismatches partially resolved using Formal Concept Analysis (Ganter & Wille, 1999). Our second goal is to refine existing morphosyntactic annotation by projecting distinctions in one tagset onto a conceptually different tagset. The hierarchy and automatic word-to-word alignment is used to learn from word tokens in another language. We show results of an experiment for different languages and tagsets, including untagged texts. Multilingual parallel corpora can be annotated with monolingual tools, such as morphosyntactic taggers. However, even taggers for typologically similar languages often use incompatible tagsets, which results in conceptual and formal variety of tags within a single corpus. Retraining taggers on data annotated with a common tagset is not a realistic option. Differences between tagsets are often rooted in different linguistic perspectives rather than in real distinctions between the languages, which means good chances to find a common ground. Moreover, a different perspective may provide additional information missing in one tagset but present in another. Our first goal is to delegate the task of dealing with multiple tagsets to an abstract interlingual representation of linguistic categories. Ideally, each tag in every language-specific tagset used in the corpus is linked to a position in a tangled hierarchy of concepts. To accommodate the different perspectives, the hierarchy takes three views of word class. The Czech tag for a relative pronoun is decoded as a category with the properties of inflectional adjective, syntactic noun, and semantic pronoun, each with its appropriate morphological characteristics. Comparison of different tagsets reveals mismatches, where tags are seen as ambiguous wrt concepts. Such mismatches are properly represented, which allows for a principled mapping strategy between languages-specific tagsets, and for intuitive and underspecified queries. The hierarchy can be built and the mismatches partially resolved using Formal Concept Analysis (Ganter & Wille, 1999). Our second goal is to refine existing morphosyntactic annotation by projecting distinctions in one tagset onto a conceptually different tagset. The hierarchy and automatic word-to-word alignment is used to learn from word tokens in another language. We show results of an experiment for different languages and tagsets, including untagged texts. (en)
Title	On the art of taming and exploiting parallel tags in a multilingual corpus On the art of taming and exploiting parallel tags in a multilingual corpus (en)
skos:prefLabel	On the art of taming and exploiting parallel tags in a multilingual corpus On the art of taming and exploiting parallel tags in a multilingual corpus (en)
skos:notation	RIV/00216208:11210/12:10132260!RIV13-MSM-11210___
http://linked.open...avai/riv/aktivita	Z
http://linked.open...avai/riv/aktivity	Z(MSM0021620823)
http://linked.open...iv/cisloPeriodika	0
http://linked.open...vai/riv/dodaniDat	2013
http://linked.open...aciTvurceVysledku	Rosen, Alexandr
http://linked.open.../riv/druhVysledku	J - Článek v odborném periodiku
http://linked.open...iv/duvernostUdaju	S - Úplné a pravdivé údaje nepodléhající ochraně podle zvláštních právních předpisů
http://linked.open...titaPredkladatele	Univerzita Karlova v Praze / Filozofická fakulta
http://linked.open...dnocenehoVysledku	156355
http://linked.open...ai/riv/idVysledku	RIV/00216208:11210/12:10132260
http://linked.open...riv/jazykVysledku	eng - angličtina
http://linked.open.../riv/klicovaSlova	multilinguality; formal concept analysis; linguistic ontology; morphosyntactic tags; parallel corpus (en)
http://linked.open.../riv/klicoveSlovo	parallel corpus formal concept analysis linguistic ontology morphosyntactic tags multilinguality
http://linked.open...odStatuVydavatele	PL - Polská republika
http://linked.open...ontrolniKodProRIV	[E6ACE25E005B]
http://linked.open...i/riv/nazevZdroje	Prace Filologiczne
http://linked.open...in/vavai/riv/obor	AI
http://linked.open...ichTvurcuVysledku	1 (xsd:int)
http://linked.open...cetTvurcuVysledku	1 (xsd:int)
http://linked.open...UplatneniVysledku	2012
http://linked.open...v/svazekPeriodika	63
http://linked.open...iv/tvurceVysledku	Rosen, Alexandr
http://linked.open...n/vavai/riv/zamer	Czech National Corpus and Corpora of Other Languages
issn	0138-0567
number of pages	16 (xsd:int)
http://localhost/t...ganizacniJednotka	11210

Faceted Search & Find service v1.16.118 as of Jun 21 2024

Alternative Linked Data Documents: ODE Content Formats:

RDF

ODATA

Microdata

About

OpenLink Virtuoso version 07.20.3240 as of Jun 21 2024, on Linux (x86_64-pc-linux-gnu), Single-Server Edition (126 GB total memory, 58 GB memory in use)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software