Catafolk is a project that indexes cross-cultural music datasets and serves metadata about these collections in a consistent, easy-to-use format. We hope this project will help diversifying computational music research and contribute to the creation of large, cross-cultural musical datasets.
The project is still in an early stage.
Catafolk consists of three components. First, there is the registry, a repository that contains all the metadata about all corpora and the songs therein. Second, the indices in the registry are generated using the Python package. Third, there is the website (which you're looking at), that serves the registry via a graphical interface. Central to all these three components is the Catafolk schema: the list of metadata fields used by Catafolk. You can find the index below.
The registry is a Github repository that contains all metadata about all corpora. For every corpus it contains at least two files: a corpus.yml
file with metadata about the entire corpus, a index.csv
file that contains metadata about every song in the corpus using the Catafolk schema. Moreover, it can incluce a README.md
file, and a src
directory containing Python code and whatever else is needed to generate the index. All corpora are versioned.
The Python package is primarily used to generate indices of corpora, and ensure it respects the schema. Indices are generated by combining metadata from various sources: the source files, perhaps some additional metadata source, some constants, and automatically determined fields (checksums, filepaths, etc.). These are then combined for every song in the corpus and the result is exported to a csv filed.
Python package repositoryTo contribute corpora, you can submit a pull-request to the registry. As a bare minimum, it needs to include a corpus.yml
and a index.csv
file. But including a README.md
and a src
directory with code used to generate the index.
You can also contribute code, or simply play around with the website and let us know what you think. Editing readme's or reporting issues is also very valuable.
If you want to contribute in whatever way, feel free to get in touch!
Get in touchField | Group | Required | Description | Data type | Details |
---|---|---|---|---|---|
id | general | yes | unique identifier of the entry | string | If the dataset uses some form of id, for example in the filename, this is generally used. Otherwise an appropriate id is generated, usually something like`pueblo04` |
dataset_id | general | yes | id of the dataset | string | |
title | lyrics | no | title of the song | string | |
title_translation | lyrics | no | translation of the title | string | This is used if two versions of the title are given: in the original language, and a translation. The translation will typically be English. |
location | location | no | location of the song | string | This is roughly the place where the song originated. Most of the time, the place of collection is used as a proxy. However, it could be that a song was recorded elsewhere, and in that case the, say, birthplace of the performer might be used as the location instead. |
latitude | location | no | geographic latitude coordinate of the location | float | |
longitude | location | no | geographic longitude coordinate of the location | float | |
auto_geocoded | location | no | whether the coordinates were automatically determined | boolean | This is required whenever location information is present. |
language | lyrics | no | the language of the lyrics | string | If no lyrics are given, the language of the performer can be used. |
glottolog_id | lyrics | no | Gottolog id of the language | string | |
culture | culture | no | culture or nationality of the original performer | string | |
culture_dplace_id | culture | no | D-Place identifier of the culture/society | string | |
culture_hraf_id | culture | no | HRAF identifier of the culture | string | |
genres | general | no | the genre of the piece | string-list | The genres used are specific to a dataset. |
performers | performance | no | names of the performers | string-list | |
performer_genders | performance | no | gender of the performers | string-list | Should have the same length as performers. |
instrumentation | performance | no | the instrumentation | string-list | |
instrument_use | performance | no | whether the piece uses (non-vocal) instruments | boolean | |
percussion_use | performance | no | whether the piece uses percussive instruments | boolean | Can only be true if instrument_use is true |
voice_use | performance | no | whether the piece uses the voice | boolean |
Catafolk
A catalogue of folk music datasets for computational ethnomusicology
Copyright Bas Cornelissen
Music Cognition Group & clclab
ILLC, University of Amsterdam
This project is in its infancy: many things are still likely to change. If you are interested in the project, or want to contribute, please get in touch. Your comments and suggestions are also very welcome.
Get in touch