Attributes | Values |
---|
rdf:type
| |
rdfs:seeAlso
| |
Description
| - This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing (indexing for efficient corpus querying using the CQL – Corpus Query Language) steps. In this paper we explain how we tackled them: we describe the tools used for boilerplate cleaning (jusText) and for de-duplication (onion) that was performed not only on full (document-level) duplicates but also on the level of near-duplicate texts. Moreover we show the impact of each of the performed pre-processing steps on the final corpus size.
- This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing (indexing for efficient corpus querying using the CQL – Corpus Query Language) steps. In this paper we explain how we tackled them: we describe the tools used for boilerplate cleaning (jusText) and for de-duplication (onion) that was performed not only on full (document-level) duplicates but also on the level of near-duplicate texts. Moreover we show the impact of each of the performed pre-processing steps on the final corpus size. (en)
|
Title
| - Building a 70 billion word corpus of English from ClueWeb
- Building a 70 billion word corpus of English from ClueWeb (en)
|
skos:prefLabel
| - Building a 70 billion word corpus of English from ClueWeb
- Building a 70 billion word corpus of English from ClueWeb (en)
|
skos:notation
| - RIV/00216224:14330/12:00057572!RIV13-GA0-14330___
|
http://linked.open...avai/riv/aktivita
| |
http://linked.open...avai/riv/aktivity
| - P(GAP401/10/0792), P(LM2010013)
|
http://linked.open...vai/riv/dodaniDat
| |
http://linked.open...aciTvurceVysledku
| |
http://linked.open.../riv/druhVysledku
| |
http://linked.open...iv/duvernostUdaju
| |
http://linked.open...titaPredkladatele
| |
http://linked.open...dnocenehoVysledku
| |
http://linked.open...ai/riv/idVysledku
| - RIV/00216224:14330/12:00057572
|
http://linked.open...riv/jazykVysledku
| |
http://linked.open.../riv/klicovaSlova
| - corpus; clueweb; English; encoding; word sketch (en)
|
http://linked.open.../riv/klicoveSlovo
| |
http://linked.open...ontrolniKodProRIV
| |
http://linked.open...v/mistoKonaniAkce
| |
http://linked.open...i/riv/mistoVydani
| |
http://linked.open...i/riv/nazevZdroje
| - Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)
|
http://linked.open...in/vavai/riv/obor
| |
http://linked.open...ichTvurcuVysledku
| |
http://linked.open...cetTvurcuVysledku
| |
http://linked.open...vavai/riv/projekt
| |
http://linked.open...UplatneniVysledku
| |
http://linked.open...iv/tvurceVysledku
| - Jakubíček, Miloš
- Pomikálek, Jan
- Rychlý, Pavel
|
http://linked.open...vavai/riv/typAkce
| |
http://linked.open.../riv/zahajeniAkce
| |
number of pages
| |
http://purl.org/ne...btex#hasPublisher
| - European Language Resources Association (ELRA)
|
https://schema.org/isbn
| |
http://localhost/t...ganizacniJednotka
| |