About: Victor: the Web-Page Cleaning Tool     Goto   Sponge   NotDistinct   Permalink

An Entity of Type : http://linked.opendata.cz/ontology/domain/vavai/Vysledek, within Data Space : linked.opendata.cz associated with source document(s)

AttributesValues
rdf:type
Description
  • In this paper we present a complete solution for automatic cleaning of arbitrary HTML pages with a goal of using web data as a corpus in the area of natural language processing and computational linguistics. We employ a sequence-labeling approach based on Conditional Random Fields (CRF). Every block of text in analyzed web page is assigned a set of features extracted from the textual content and HTML structure of the page. The blocks are automatically labeled either as content segments containing main web page content, which should be preserved, or as noisy segments not suitable for further linguistic processing, which should be eliminated. Our solution is based on the tool introduced at the CLEANEVAL 2007 shared task workshop. In this paper, we present new CRF features, a handy annotation tool, and new evaluation metrics. Evaluation itself is performed on a random sample of web pages automatically downloaded from the Czech web domain.
  • In this paper we present a complete solution for automatic cleaning of arbitrary HTML pages with a goal of using web data as a corpus in the area of natural language processing and computational linguistics. We employ a sequence-labeling approach based on Conditional Random Fields (CRF). Every block of text in analyzed web page is assigned a set of features extracted from the textual content and HTML structure of the page. The blocks are automatically labeled either as content segments containing main web page content, which should be preserved, or as noisy segments not suitable for further linguistic processing, which should be eliminated. Our solution is based on the tool introduced at the CLEANEVAL 2007 shared task workshop. In this paper, we present new CRF features, a handy annotation tool, and new evaluation metrics. Evaluation itself is performed on a random sample of web pages automatically downloaded from the Czech web domain. (en)
Title
  • Victor: the Web-Page Cleaning Tool
  • Victor: the Web-Page Cleaning Tool (en)
skos:prefLabel
  • Victor: the Web-Page Cleaning Tool
  • Victor: the Web-Page Cleaning Tool (en)
skos:notation
  • RIV/00216208:11320/08:10077973!RIV11-GA0-11320___
http://linked.open...avai/riv/aktivita
http://linked.open...avai/riv/aktivity
  • P(GD201/05/H014), Z(MSM0021620838)
http://linked.open...vai/riv/dodaniDat
http://linked.open...aciTvurceVysledku
http://linked.open.../riv/druhVysledku
http://linked.open...iv/duvernostUdaju
http://linked.open...titaPredkladatele
http://linked.open...dnocenehoVysledku
  • 402655
http://linked.open...ai/riv/idVysledku
  • RIV/00216208:11320/08:10077973
http://linked.open...riv/jazykVysledku
http://linked.open.../riv/klicovaSlova
  • tool; cleaning; page; victor (en)
http://linked.open.../riv/klicoveSlovo
http://linked.open...ontrolniKodProRIV
  • [381F01CF2CA4]
http://linked.open...v/mistoKonaniAkce
  • Marrakech, Morocco
http://linked.open...i/riv/mistoVydani
  • Marrakech, Morocco
http://linked.open...i/riv/nazevZdroje
  • Proceedings of the 4th Web as Corpus Workshop
http://linked.open...in/vavai/riv/obor
http://linked.open...ichTvurcuVysledku
http://linked.open...cetTvurcuVysledku
http://linked.open...vavai/riv/projekt
http://linked.open...UplatneniVysledku
http://linked.open...iv/tvurceVysledku
  • Pecina, Pavel
  • Spousta, Miroslav
  • Marek, Michal
http://linked.open...vavai/riv/typAkce
http://linked.open.../riv/zahajeniAkce
http://linked.open...n/vavai/riv/zamer
number of pages
http://purl.org/ne...btex#hasPublisher
  • ACL SIGWAC
https://schema.org/isbn
  • 2-9517408-4-0
http://localhost/t...ganizacniJednotka
  • 11320
Faceted Search & Find service v1.16.118 as of Jun 21 2024


Alternative Linked Data Documents: ODE     Content Formats:   [cxml] [csv]     RDF   [text] [turtle] [ld+json] [rdf+json] [rdf+xml]     ODATA   [atom+xml] [odata+json]     Microdata   [microdata+json] [html]    About   
This material is Open Knowledge   W3C Semantic Web Technology [RDF Data] Valid XHTML + RDFa
OpenLink Virtuoso version 07.20.3240 as of Jun 21 2024, on Linux (x86_64-pc-linux-gnu), Single-Server Edition (126 GB total memory, 58 GB memory in use)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software