北京外国语大学 许家金
Abstract: Our collection of English texts aimed at a balanced, in its modest sense, corpus of written contemporary English modelled after the sampling strategies of Brown family corpora. The name CROWN is the fusion of the initial of China and the hind part of Brown, which means a new Brown family corpus created by Chinese scholars. We hope to follow the sampling frame of Brown corpus as closely as Kučera and Francis did in the 1960s. CROWN corpus will be viewed as a good reference corpus for contrastive research of various kinds on the one hand, it will match its Brown family predecessors (such as Brown, LOB, Frown, FLOB, BLOB and BE06) in size and composition on the other.
Specifications of data collection:
The first standard release of CROWN corpus is going to be two million words, covering 15 categories (see Table 1) of texts published in 2009, or one year before and after 2009. The two million word CROWN corpus will consist of an American English component called CROWN-A (a.k.a. CROWN), and a British English component called CROWN-B (a.k.a. CLOB).
|
|
Text categories |
No. of texts |
|
A |
Press: Reportage |
44 |
|
B |
Press: Editorial |
27 |
|
C |
Press: Reviews |
17 |
|
D |
Religion |
17 |
|
E |
Skill and hobbies |
36 |
|
F |
Popular lore |
48 |
|
G |
Belles-lettres |
75 |
|
H |
Miscellaneous: Government & house organs |
30 |
|
J |
Learned |
80 |
|
K |
Fiction: General |
29 |
|
L |
Fiction: Mystery |
24 |
|
M |
Fiction: Science |
6 |
|
N |
Fiction: Adventure |
29 |
|
P |
Fiction: Romance |
29 |
|
R |
Humour |
9 |
? Sources of publication: Every contributor is expected to submit minimally 3 U.S.-based texts and 3 U.K.-based texts.
? The nationality of authors: Only the writings by the U.S. and the U.K. citizens or permanent residents are selected. The URLs, or the publishers, and the author profile pages should be recorded. In the case of multiple authors, the primary/first author should hold U.S. or U.K. citizenship or permanent residence.
? The length of each text should be no shorter than 2000 words, according to its word count in MS Word. For some text types, one single article is not likely to go as long as 2000 words. In that case, please put together two or more articles of the same nature and on similar topics. If a text sample is going to be taken from a very long text, say, a novel, please take out approximately 700 words from the beginning, the middle and the ending of the book respectively.
? Advertisements, tables, figures, sideline links etc. were deleted.
? Format of texts: Plain text files named as A01A.txt, A01B.txt, A02A.txt, A02B.txt, B01A.txt, B01B.txt, etc. (The fourth character A for American, and B for British). So there are 1500 texts instead of 500 texts. Short texts are especially many in news reports. The strength of saving the short texts separately is that each text represents itself. It is quite easy to merge the related texts as a 2000-word one, if we need to do so. However, once the short texts have been put together in one single 2000-word file, without explicit section makers, it is almost impossible to save them as individual texts as they originally are.
? Reprinted works are not accepted. The work has to be first published in 2009 (± 1 year). E.g. reprinted detective novels of Sherlock Holmes should not be considered.
? Metadata have to be encoded when you submit your texts. Please use the attached Metadata Encoder to add descriptive information to the texts.
Metadata template:
|
Author |
e.g. John Smith |
|
Country |
e.g. UK/US |
|
Publication year |
e.g. 2009 |
|
Publisher |
e.g. New York Times |
|
URL(s) |
e.g. http:// |
Copyright and dissemination:
We would intend a GNU distribution of the texts, given that they are not used for commercial purposes in whatever manner. Everyone who makes positive contribution to the data collection will get a copy of CROWN and CLOB corpus.
Please cite the corpora as:
Xu, Jiajin & Maocheng Liang. 2011. CROWN Corpus. National Research Centre for Foreign Language Education, Beijing Foreign Studies University.
Xu, Jiajin & Maocheng Liang. 2011. CLOB Corpus. National Research Centre for Foreign Language Education, Beijing Foreign Studies University.
Preliminary statistics of the Crown and CLOB
三代布朗家族语料库的库容信息
|
语料库 |
语体 |
子库容量 |
总库容 |
语料库 |
语体 |
子库容量 |
总库容 |
|
1961 |
小说 |
259467 |
1027021 |
LOB 1961 |
小说 |
258722 |
1018785 |
|
通用 |
423160 |
通用 |
418137 |
||||
|
学术 |
163309 |
学术 |
162322 |
||||
|
新闻 |
181085 |
新闻 |
179604 |
||||
|
Frown 1992 |
小说 |
260414 |
1027323 |
FLOB 1991 |
小说 |
260664 |
1024643 |
|
通用 |
421933 |
通用 |
419990 |
||||
|
学术 |
163228 |
学术 |
163286 |
||||
|
新闻 |
181748 |
新闻 |
180703 |
||||
|
Crown 2009 |
小说 |
259250 |
1026226 |
CLOB 2009 |
小说 |
259484 |
1023466 |
|
通用 |
422799 |
通用 |
421163 |
||||
|
学术 |
163197 |
学术 |
163139 |
||||
|
新闻 |
180980 |
新闻 |
179680 |
Token definition: [a-zA-Z0-9-]+