About: On the art of taming and exploiting parallel tags in a multilingual corpus     Goto   Sponge   NotDistinct   Permalink

An Entity of Type : http://linked.opendata.cz/ontology/domain/vavai/Vysledek, within Data Space : linked.opendata.cz associated with source document(s)

AttributesValues
rdf:type
rdfs:seeAlso
Description
  • Multilingual parallel corpora can be annotated with monolingual tools, such as morphosyntactic taggers. However, even taggers for typologically similar languages often use incompatible tagsets, which results in conceptual and formal variety of tags within a single corpus. Retraining taggers on data annotated with a common tagset is not a realistic option. Differences between tagsets are often rooted in different linguistic perspectives rather than in real distinctions between the languages, which means good chances to find a common ground. Moreover, a different perspective may provide additional information missing in one tagset but present in another. Our first goal is to delegate the task of dealing with multiple tagsets to an abstract interlingual representation of linguistic categories. Ideally, each tag in every language-specific tagset used in the corpus is linked to a position in a tangled hierarchy of concepts. To accommodate the different perspectives, the hierarchy takes three views of word class. The Czech tag for a relative pronoun is decoded as a category with the properties of inflectional adjective, syntactic noun, and semantic pronoun, each with its appropriate morphological characteristics. Comparison of different tagsets reveals mismatches, where tags are seen as ambiguous wrt concepts. Such mismatches are properly represented, which allows for a principled mapping strategy between languages-specific tagsets, and for intuitive and underspecified queries. The hierarchy can be built and the mismatches partially resolved using Formal Concept Analysis (Ganter & Wille, 1999). Our second goal is to refine existing morphosyntactic annotation by projecting distinctions in one tagset onto a conceptually different tagset. The hierarchy and automatic word-to-word alignment is used to learn from word tokens in another language. We show results of an experiment for different languages and tagsets, including untagged texts.
  • Multilingual parallel corpora can be annotated with monolingual tools, such as morphosyntactic taggers. However, even taggers for typologically similar languages often use incompatible tagsets, which results in conceptual and formal variety of tags within a single corpus. Retraining taggers on data annotated with a common tagset is not a realistic option. Differences between tagsets are often rooted in different linguistic perspectives rather than in real distinctions between the languages, which means good chances to find a common ground. Moreover, a different perspective may provide additional information missing in one tagset but present in another. Our first goal is to delegate the task of dealing with multiple tagsets to an abstract interlingual representation of linguistic categories. Ideally, each tag in every language-specific tagset used in the corpus is linked to a position in a tangled hierarchy of concepts. To accommodate the different perspectives, the hierarchy takes three views of word class. The Czech tag for a relative pronoun is decoded as a category with the properties of inflectional adjective, syntactic noun, and semantic pronoun, each with its appropriate morphological characteristics. Comparison of different tagsets reveals mismatches, where tags are seen as ambiguous wrt concepts. Such mismatches are properly represented, which allows for a principled mapping strategy between languages-specific tagsets, and for intuitive and underspecified queries. The hierarchy can be built and the mismatches partially resolved using Formal Concept Analysis (Ganter & Wille, 1999). Our second goal is to refine existing morphosyntactic annotation by projecting distinctions in one tagset onto a conceptually different tagset. The hierarchy and automatic word-to-word alignment is used to learn from word tokens in another language. We show results of an experiment for different languages and tagsets, including untagged texts. (en)
Title
  • On the art of taming and exploiting parallel tags in a multilingual corpus
  • On the art of taming and exploiting parallel tags in a multilingual corpus (en)
skos:prefLabel
  • On the art of taming and exploiting parallel tags in a multilingual corpus
  • On the art of taming and exploiting parallel tags in a multilingual corpus (en)
skos:notation
  • RIV/00216208:11210/12:10132260!RIV13-MSM-11210___
http://linked.open...avai/riv/aktivita
http://linked.open...avai/riv/aktivity
  • Z(MSM0021620823)
http://linked.open...iv/cisloPeriodika
  • 0
http://linked.open...vai/riv/dodaniDat
http://linked.open...aciTvurceVysledku
http://linked.open.../riv/druhVysledku
http://linked.open...iv/duvernostUdaju
http://linked.open...titaPredkladatele
http://linked.open...dnocenehoVysledku
  • 156355
http://linked.open...ai/riv/idVysledku
  • RIV/00216208:11210/12:10132260
http://linked.open...riv/jazykVysledku
http://linked.open.../riv/klicovaSlova
  • multilinguality; formal concept analysis; linguistic ontology; morphosyntactic tags; parallel corpus (en)
http://linked.open.../riv/klicoveSlovo
http://linked.open...odStatuVydavatele
  • PL - Polská republika
http://linked.open...ontrolniKodProRIV
  • [E6ACE25E005B]
http://linked.open...i/riv/nazevZdroje
  • Prace Filologiczne
http://linked.open...in/vavai/riv/obor
http://linked.open...ichTvurcuVysledku
http://linked.open...cetTvurcuVysledku
http://linked.open...UplatneniVysledku
http://linked.open...v/svazekPeriodika
  • 63
http://linked.open...iv/tvurceVysledku
  • Rosen, Alexandr
http://linked.open...n/vavai/riv/zamer
issn
  • 0138-0567
number of pages
http://localhost/t...ganizacniJednotka
  • 11210
Faceted Search & Find service v1.16.118 as of Jun 21 2024


Alternative Linked Data Documents: ODE     Content Formats:   [cxml] [csv]     RDF   [text] [turtle] [ld+json] [rdf+json] [rdf+xml]     ODATA   [atom+xml] [odata+json]     Microdata   [microdata+json] [html]    About   
This material is Open Knowledge   W3C Semantic Web Technology [RDF Data] Valid XHTML + RDFa
OpenLink Virtuoso version 07.20.3240 as of Jun 21 2024, on Linux (x86_64-pc-linux-gnu), Single-Server Edition (126 GB total memory, 58 GB memory in use)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software