About: Building a 70 billion word corpus of English from ClueWeb     Goto   Sponge   NotDistinct   Permalink

An Entity of Type : http://linked.opendata.cz/ontology/domain/vavai/Vysledek, within Data Space : linked.opendata.cz associated with source document(s)

AttributesValues
rdf:type
rdfs:seeAlso
Description
  • This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing (indexing for efficient corpus querying using the CQL – Corpus Query Language) steps. In this paper we explain how we tackled them: we describe the tools used for boilerplate cleaning (jusText) and for de-duplication (onion) that was performed not only on full (document-level) duplicates but also on the level of near-duplicate texts. Moreover we show the impact of each of the performed pre-processing steps on the final corpus size.
  • This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing (indexing for efficient corpus querying using the CQL – Corpus Query Language) steps. In this paper we explain how we tackled them: we describe the tools used for boilerplate cleaning (jusText) and for de-duplication (onion) that was performed not only on full (document-level) duplicates but also on the level of near-duplicate texts. Moreover we show the impact of each of the performed pre-processing steps on the final corpus size. (en)
Title
  • Building a 70 billion word corpus of English from ClueWeb
  • Building a 70 billion word corpus of English from ClueWeb (en)
skos:prefLabel
  • Building a 70 billion word corpus of English from ClueWeb
  • Building a 70 billion word corpus of English from ClueWeb (en)
skos:notation
  • RIV/00216224:14330/12:00057572!RIV13-GA0-14330___
http://linked.open...avai/riv/aktivita
http://linked.open...avai/riv/aktivity
  • P(GAP401/10/0792), P(LM2010013)
http://linked.open...vai/riv/dodaniDat
http://linked.open...aciTvurceVysledku
http://linked.open.../riv/druhVysledku
http://linked.open...iv/duvernostUdaju
http://linked.open...titaPredkladatele
http://linked.open...dnocenehoVysledku
  • 125661
http://linked.open...ai/riv/idVysledku
  • RIV/00216224:14330/12:00057572
http://linked.open...riv/jazykVysledku
http://linked.open.../riv/klicovaSlova
  • corpus; clueweb; English; encoding; word sketch (en)
http://linked.open.../riv/klicoveSlovo
http://linked.open...ontrolniKodProRIV
  • [8B31293499DE]
http://linked.open...v/mistoKonaniAkce
  • Istanbul, Turkey
http://linked.open...i/riv/mistoVydani
  • Istanbul, Turkey
http://linked.open...i/riv/nazevZdroje
  • Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)
http://linked.open...in/vavai/riv/obor
http://linked.open...ichTvurcuVysledku
http://linked.open...cetTvurcuVysledku
http://linked.open...vavai/riv/projekt
http://linked.open...UplatneniVysledku
http://linked.open...iv/tvurceVysledku
  • Jakubíček, Miloš
  • Pomikálek, Jan
  • Rychlý, Pavel
http://linked.open...vavai/riv/typAkce
http://linked.open.../riv/zahajeniAkce
number of pages
http://purl.org/ne...btex#hasPublisher
  • European Language Resources Association (ELRA)
https://schema.org/isbn
  • 9782951740877
http://localhost/t...ganizacniJednotka
  • 14330
Faceted Search & Find service v1.16.118 as of Jun 21 2024


Alternative Linked Data Documents: ODE     Content Formats:   [cxml] [csv]     RDF   [text] [turtle] [ld+json] [rdf+json] [rdf+xml]     ODATA   [atom+xml] [odata+json]     Microdata   [microdata+json] [html]    About   
This material is Open Knowledge   W3C Semantic Web Technology [RDF Data] Valid XHTML + RDFa
OpenLink Virtuoso version 07.20.3240 as of Jun 21 2024, on Linux (x86_64-pc-linux-gnu), Single-Server Edition (126 GB total memory, 58 GB memory in use)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software