共建共享语料库资源——CROWN/CLOB语料库建设进展介绍

北京外国语大学 许家金

Abstract: Our collection of English texts aimed at a balanced, in its modest sense, corpus of written contemporary English modelled after the sampling strategies of Brown family corpora. The name CROWN is the fusion of the initial of China and the hind part of Brown, which means a new Brown family corpus created by Chinese scholars. We hope to follow the sampling frame of Brown corpus as closely as Kučera and Francis did in the 1960s. CROWN corpus will be viewed as a good reference corpus for contrastive research of various kinds on the one hand, it will match its Brown family predecessors (such as Brown, LOB, Frown, FLOB, BLOB and BE06) in size and composition on the other.

 

Specifications of data collection:

The first standard release of CROWN corpus is going to be two million words, covering 15 categories (see Table 1) of texts published in 2009, or one year before and after 2009. The two million word CROWN corpus will consist of an American English component called CROWN-A (a.k.a. CROWN), and a British English component called CROWN-B (a.k.a. CLOB).

 

 

Text categories

No. of texts

A

Press: Reportage

44

B

Press: Editorial

27

C

Press: Reviews

17

D

Religion

17

E

Skill and hobbies

36

F

Popular lore

48

G

Belles-lettres

75

H

Miscellaneous: Government & house organs

30

J

Learned

80

K

Fiction: General

29

L

Fiction: Mystery

24

M

Fiction: Science

6

N

Fiction: Adventure

29

P

Fiction: Romance

29

R

Humour

9

 

?           Sources of publication: Every contributor is expected to submit minimally 3 U.S.-based texts and 3 U.K.-based texts.

?           The nationality of authors: Only the writings by the U.S. and the U.K. citizens or permanent residents are selected. The URLs, or the publishers, and the author profile pages should be recorded. In the case of multiple authors, the primary/first author should hold U.S. or U.K. citizenship or permanent residence.

?           The length of each text should be no shorter than 2000 words, according to its word count in MS Word. For some text types, one single article is not likely to go as long as 2000 words. In that case, please put together two or more articles of the same nature and on similar topics. If a text sample is going to be taken from a very long text, say, a novel, please take out approximately 700 words from the beginning, the middle and the ending of the book respectively.

?           Advertisements, tables, figures, sideline links etc. were deleted.

?           Format of texts: Plain text files named as A01A.txt, A01B.txt, A02A.txt, A02B.txt, B01A.txt, B01B.txt, etc. (The fourth character A for American, and B for British). So there are 1500 texts instead of 500 texts. Short texts are especially many in news reports. The strength of saving the short texts separately is that each text represents itself. It is quite easy to merge the related texts as a 2000-word one, if we need to do so. However, once the short texts have been put together in one single 2000-word file, without explicit section makers, it is almost impossible to save them as individual texts as they originally are.

?           Reprinted works are not accepted. The work has to be first published in 2009 (± 1 year). E.g. reprinted detective novels of Sherlock Holmes should not be considered.

?           Metadata have to be encoded when you submit your texts. Please use the attached Metadata Encoder to add descriptive information to the texts.

Metadata template:

Author

e.g. John Smith

Country

e.g. UK/US

Publication year

e.g. 2009

Publisher

e.g. New York Times

URL(s)

e.g. http://

 

Copyright and dissemination:

We would intend a GNU distribution of the texts, given that they are not used for commercial purposes in whatever manner. Everyone who makes positive contribution to the data collection will get a copy of CROWN and CLOB corpus.

 

Please cite the corpora as:

Xu, Jiajin & Maocheng Liang. 2011. CROWN Corpus. National Research Centre for Foreign Language Education, Beijing Foreign Studies University.

Xu, Jiajin & Maocheng Liang. 2011. CLOB Corpus. National Research Centre for Foreign Language Education, Beijing Foreign Studies University.

 

Preliminary statistics of the Crown and CLOB

 

三代布朗家族语料库的库容信息

语料库

语体

子库容量

总库容

语料库

语体

子库容量

总库容

Brown

1961

小说

259467

1027021

LOB

1961

小说

258722

1018785

通用

423160

通用

418137

学术

163309

学术

162322

新闻

181085

新闻

179604

Frown

1992

小说

260414

1027323

FLOB

1991

小说

260664

1024643

通用

421933

通用

419990

学术

163228

学术

163286

新闻

181748

新闻

180703

Crown

2009

小说

259250

1026226

CLOB

2009

小说

259484

1023466

通用

422799

通用

421163

学术

163197

学术

163139

新闻

180980

新闻

179680

Token definition: [a-zA-Z0-9-]+