About

Catafolk is a project that indexes cross-cultural music datasets and serves metadata about these collections in a consistent, easy-to-use format. We hope this project will help diversifying computational music research and contribute to the creation of large, cross-cultural musical datasets.

The project is still in an early stage.

Architecture

Catafolk consists of three components. First, there is the registry, a repository that contains all the metadata about all corpora and the songs therein. Second, the indices in the registry are generated using the Python package. Third, there is the website (which you're looking at), that serves the registry via a graphical interface. Central to all these three components is the Catafolk schema: the list of metadata fields used by Catafolk. You can find the index below.

Registry

The registry is a Github repository that contains all metadata about all corpora. For every corpus it contains at least two files: a corpus.yml file with metadata about the entire corpus, a index.csv file that contains metadata about every song in the corpus using the Catafolk schema. Moreover, it can incluce a README.md file, and a srcdirectory containing Python code and whatever else is needed to generate the index. All corpora are versioned.

Registry GitHub repository
Python package

The Python package is primarily used to generate indices of corpora, and ensure it respects the schema. Indices are generated by combining metadata from various sources: the source files, perhaps some additional metadata source, some constants, and automatically determined fields (checksums, filepaths, etc.). These are then combined for every song in the corpus and the result is exported to a csv filed.

Python package repository
Contributing

To contribute corpora, you can submit a pull-request to the registry. As a bare minimum, it needs to include a corpus.yml and a index.csv file. But including a README.mdand a src directory with code used to generate the index.

You can also contribute code, or simply play around with the website and let us know what you think. Editing readme's or reporting issues is also very valuable.

If you want to contribute in whatever way, feel free to get in touch!

Get in touch
Index
FieldGroupRequiredDescriptionData typeDetails
idgeneralyesunique identifier of the entrystringIf the dataset uses some form of id, for example in the filename, this is generally used. Otherwise an appropriate id is generated, usually something like`pueblo04`
dataset_idgeneralyesid of the datasetstring
titlelyricsnotitle of the songstring
title_translationlyricsnotranslation of the titlestringThis is used if two versions of the title are given: in the original language, and a translation. The translation will typically be English.
locationlocationnolocation of the songstringThis is roughly the place where the song originated. Most of the time, the place of collection is used as a proxy. However, it could be that a song was recorded elsewhere, and in that case the, say, birthplace of the performer might be used as the location instead.
latitudelocationnogeographic latitude coordinate of the locationfloat
longitudelocationnogeographic longitude coordinate of the locationfloat
auto_geocodedlocationnowhether the coordinates were automatically determinedbooleanThis is required whenever location information is present.
languagelyricsnothe language of the lyricsstringIf no lyrics are given, the language of the performer can be used.
glottolog_idlyricsnoGottolog id of the languagestring
cultureculturenoculture or nationality of the original performerstring
culture_dplace_idculturenoD-Place identifier of the culture/societystring
culture_hraf_idculturenoHRAF identifier of the culturestring
genresgeneralnothe genre of the piecestring-listThe genres used are specific to a dataset.
performersperformancenonames of the performersstring-list
performer_gendersperformancenogender of the performersstring-listShould have the same length as performers.
instrumentationperformancenothe instrumentationstring-list
instrument_useperformancenowhether the piece uses (non-vocal) instrumentsboolean
percussion_useperformancenowhether the piece uses percussive instrumentsbooleanCan only be true if instrument_use is true
voice_useperformancenowhether the piece uses the voiceboolean

Catafolk
A catalogue of folk music datasets for computational ethnomusicology


Copyright Bas Cornelissen
Music Cognition Group & clclab
ILLC, University of Amsterdam

Project in infancy

This project is in its infancy: many things are still likely to change. If you are interested in the project, or want to contribute, please get in touch. Your comments and suggestions are also very welcome.

Get in touch