About: Building a 70 billion word corpus of English from ClueWeb

Facets (new session)
Description
Metadata
Settings
- owl:sameAs
- Inference Rule:

About: Building a 70 billion word corpus of English from ClueWeb Goto Sponge NotDistinct Permalink

An Entity of Type : http://linked.opendata.cz/ontology/domain/vavai/Vysledek, within Data Space : linked.opendata.cz associated with source document(s)

Attributes	Values
rdf:type	skos:Concept http://linked.opendata.cz/ontology/domain/vavai/Vysledek
rdfs:seeAlso	http://nlp.fi.muni.cz/publications/lrec2012_xpomikal_pary_xjakub/lrec2012.pdf
Description	This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing (indexing for efficient corpus querying using the CQL – Corpus Query Language) steps. In this paper we explain how we tackled them: we describe the tools used for boilerplate cleaning (jusText) and for de-duplication (onion) that was performed not only on full (document-level) duplicates but also on the level of near-duplicate texts. Moreover we show the impact of each of the performed pre-processing steps on the final corpus size. This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing (indexing for efficient corpus querying using the CQL – Corpus Query Language) steps. In this paper we explain how we tackled them: we describe the tools used for boilerplate cleaning (jusText) and for de-duplication (onion) that was performed not only on full (document-level) duplicates but also on the level of near-duplicate texts. Moreover we show the impact of each of the performed pre-processing steps on the final corpus size. (en)
Title	Building a 70 billion word corpus of English from ClueWeb Building a 70 billion word corpus of English from ClueWeb (en)
skos:prefLabel	Building a 70 billion word corpus of English from ClueWeb Building a 70 billion word corpus of English from ClueWeb (en)
skos:notation	RIV/00216224:14330/12:00057572!RIV13-GA0-14330___
http://linked.open...avai/riv/aktivita	P
http://linked.open...avai/riv/aktivity	P(GAP401/10/0792), P(LM2010013)
http://linked.open...vai/riv/dodaniDat	2013
http://linked.open...aciTvurceVysledku	Jakubíček, Miloš Pomikálek, Jan Rychlý, Pavel
http://linked.open.../riv/druhVysledku	D - Článek ve sborníku
http://linked.open...iv/duvernostUdaju	S - Úplné a pravdivé údaje nepodléhající ochraně podle zvláštních právních předpisů
http://linked.open...titaPredkladatele	Masarykova univerzita / Fakulta informatiky
http://linked.open...dnocenehoVysledku	125661
http://linked.open...ai/riv/idVysledku	RIV/00216224:14330/12:00057572
http://linked.open...riv/jazykVysledku	eng - angličtina
http://linked.open.../riv/klicovaSlova	corpus; clueweb; English; encoding; word sketch (en)
http://linked.open.../riv/klicoveSlovo	clueweb encoding word sketch English corpus
http://linked.open...ontrolniKodProRIV	[8B31293499DE]
http://linked.open...v/mistoKonaniAkce	Istanbul, Turkey
http://linked.open...i/riv/mistoVydani	Istanbul, Turkey
http://linked.open...i/riv/nazevZdroje	Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)
http://linked.open...in/vavai/riv/obor	IN
http://linked.open...ichTvurcuVysledku	3 (xsd:int)
http://linked.open...cetTvurcuVysledku	3 (xsd:int)
http://linked.open...vavai/riv/projekt	Temporal aspects of knowledge and information LINDAT-CLARIN: Institute for analysis, processing and distribution of linguistic data
http://linked.open...UplatneniVysledku	2012
http://linked.open...iv/tvurceVysledku	Jakubíček, Miloš Pomikálek, Jan Rychlý, Pavel
http://linked.open...vavai/riv/typAkce	WRD - Světová
http://linked.open.../riv/zahajeniAkce	2012-01-01 (xsd:date)
number of pages	5 (xsd:int)
http://purl.org/ne...btex#hasPublisher	European Language Resources Association (ELRA)
https://schema.org/isbn	9782951740877
http://localhost/t...ganizacniJednotka	14330

Faceted Search & Find service v1.16.118 as of Jun 21 2024

Alternative Linked Data Documents: ODE Content Formats:

RDF

ODATA

Microdata

About

OpenLink Virtuoso version 07.20.3240 as of Jun 21 2024, on Linux (x86_64-pc-linux-gnu), Single-Server Edition (126 GB total memory, 58 GB memory in use)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software