PanLex Swadesh Corpus
January 2017 version
Jonathan Pool, editor

URL: http://dev.panlex.org/db/panLex_swadesh.zip
PanLex documentation: http://dev.panlex.org/snapshots/
NLTK documentation: http://www.nltk.org/book/ch02.html#comparative-wordlists

Published by:
PanLex, a project of The Long Now Foundation
Berkeley, California, U.S.A.
http://panlex.org

Summary
=======

This corpus, derived from the PanLex database (http://panlex.org), contains multiple files, in two subdirectories. The swadesh110 directory contains about 2,000 files, and the swadesh207 directory contains about 800 files. Each file in a directory contains words and phrases in one language variety, expressing various concepts. There are 110 concepts expressed in the swadesh110 files, and 207 concepts in the swadesh207 files.

The concepts in the swadesh110 directory are those in the Swadesh-Yakhontov 110 concepticon (http://starling.rinet.ru/new100/Swadesh.htm) and are identified in this corpus with numbers 001 to 110. The concepts in the swadesh207 directory are those in the Swadesh 207 concepticon (https://en.wiktionary.org/wiki/Appendix:Swadesh_lists) and are identified in this corpus with numbers 001 to 207. These two concepticons are among several commonly referred to as “Swadesh lists”. A “concepticon” is a list of concepts, each with an identifier. We may call the concepticon that generated the files in one of these directories its “base concepticon”.

Each file contains 1 line per concept. The expressions on the Nth line express the Nth concept. If multiple expressions express the same concept, the synonymous expressions are delimited with tabs. If no expression expressing the Nth concept is known, the Nth line is empty.

The base of the name of each file is the PanLex uniform identifier (“UID”) of the language variety of the expressions in the file. Information about each of the language varieties is in the langs110.txt or langs207.txt file. It provides for each language variety the following facts:
(1) PanLex UID
(2) ISO 639 language code (see http://www-01.sil.org/iso639-3/default.asp)
(3) ISO 639 language type
    i = 639-3 individual language
    m = 639-3 macrolanguage
    c = 639-2 collective language
    f = 639-5 language family or group
    o = other
(4) normal scripts of expressions
(5) PanLex default name
(6) UID of the language variety in which the default name is an expression

Example
=======

The Swadesh-Yakhontov 110 concepticon includes a concept identified with the number 060. The PanLex database tells us that the English expressions “night” and “nighttime”, the Russian expression “ночь”, the Ghanongga expression “bongi”, and other expressions in various language varieties express that concept. The 60th line of each file (eng-000.txt, rus-000.txt, xpm-000.txt, etc.) in the PanLex Swadesh-Yakhontov 110 Corpus contains 0 or more expressions expressing concept 060.

Details
=======

The PanLex database documents translations among “expressions” in “language varieties”. We identify the language varieties with, among other things, a “default name”, which is usually an autoglossonym (name in the language variety itself), and if so then facts 1 and 6 in the langs.txt file are the same. Each documented translation is attributed to one or more “sources”. A more detailed description of the design of the database is available at http://dev.panlex.org/db-design/.

The PanLex database treats concepticons as artificial language varieties, with the concept identifiers as their expressions. For example, the Swadesh-Yakhontov 110 concepticon is a language variety containing 110 expressions, from “001” through “110”.

One of the sources of PanLex data, labeled “art:Colowick”, is Susan M. Colowick and Jonathan Pool, “A Union of Concepticons” (2016). This source translates concepticon expressions into one another. For example, it translates “060” in Swadesh-Yakhontov 110 into “033” in ALCAM 120. By mapping concepticons to one another, it permits equivalence to be inferred between expressions that have been documented as expressing the concepts of different concepticons. For example, Swadesh-Yakhontov 110’s expression “060” has been translated into “pˈɨste” in Jicaque, and ALCAM 120’s expression “033” has been translated into “òtú” in Kendem; because art:Colowick has translated these two concepticon expressions into one another, we can infer that these two natural-language expressions express the same concept.

The concepticon language varieties intertranslated by art:Colowick (shown with their UIDs and names) include:

art-000	PanLem
art-012	Swadesh 207
art-245	Swadesh 100
art-257	LWT Code
art-260	Swadesh 200
art-261	SILCAWL
art-266	Swadesh-Gudschinsky 200
art-267	ALCAM 120
art-268	ABVD 210
art-269	ℤ
art-270	LEGO Concepticon
art-273	PanLex Union Concepticon
art-277	Swadesh-Yakhontov 110

Let C be a concept, and let E1 be that concept’s expression in some concepticon. In addition to E1, we consider any other expression E2 to express C if either of the following conditions is true:

(1) E2 is a translation of E1 attested by at least 1 source.

(2) For some expression EC in some concepticon, EC is a translation of E1 attested by “art:Colowick”, and E2 is a translation of EC attested by at least 1 source.

The PanLex database contains expressions in about 11,000 language varieties. For each of those language varieties, and for each base concepticon, one can determine whether there is at least one expression in the language variety that satisfies the above criterion for any concept C in the concepticon. If this is the case for at least 75% of the concepts in the concepticon, then we include a list for that language variety in the concepticon’s directory. This is why there are only about 800 or 2,000 lists in these directories, rather than about 11,000. Since concepticons, too, are language varieties, there are some lists for concepticons. The data for additional language varieties (those with fewer expressed concepts) are available from the PanLex database via its API (http://dev.panlex.org/api/) or its table snapshots (http://dev.panlex.org/snapshots/). If you wish to know which sources attest the translations, how redundantly attested or credible the translations are, or various other facts about these data, you can likewise consult the database.

The PanLex database is undergoing expansion and quality control, so it is to be expected that future versions of these corpora will contain increasingly many lists and some corrections of errors.

Permissions
===========

This corpus is released under CC0 1.0 Universal (https://creativecommons.org/publicdomain/zero/1.0/legalcode).

Users of this corpus are invited to cite the following work:

David Kamholz, Jonathan Pool, and Susan M. Colowick (2014). PanLex: Building a Resource for Panlingual Lexical Translation. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC), pp. 3145–50. http://www.lrec-conf.org/proceedings/lrec2014/pdf/1029_Paper.pdf.

