About

Catafolk is a project that indexes cross-cultural music datasets and serves metadata about these collections in a consistent, easy-to-use format. We hope this project will help diversifying computational music research and contribute to the creation of large, cross-cultural musical datasets.

The project is still in an early stage.

Architecture

Catafolk consists of three components. First, there is the registry, a repository that contains all the metadata about all corpora and the songs therein. Second, the indices in the registry are generated using the Python package. Third, there is the website (which you're looking at), that serves the registry via a graphical interface. Central to all these three components is the Catafolk schema: the list of metadata fields used by Catafolk. You can find the index below.

Registry

The registry is a Github repository that contains all metadata about all corpora. For every corpus it contains at least two files: a corpus.yml file with metadata about the entire corpus, a index.csv file that contains metadata about every song in the corpus using the Catafolk schema. Moreover, it can incluce a README.md file, and a srcdirectory containing Python code and whatever else is needed to generate the index. All corpora are versioned.

Registry GitHub repository

Python package

The Python package is primarily used to generate indices of corpora, and ensure it respects the schema. Indices are generated by combining metadata from various sources: the source files, perhaps some additional metadata source, some constants, and automatically determined fields (checksums, filepaths, etc.). These are then combined for every song in the corpus and the result is exported to a csv filed.

Python package repository

Contributing

To contribute corpora, you can submit a pull-request to the registry. As a bare minimum, it needs to include a corpus.yml and a index.csv file. But including a README.mdand a src directory with code used to generate the index.

You can also contribute code, or simply play around with the website and let us know what you think. Editing readme's or reporting issues is also very valuable.

If you want to contribute in whatever way, feel free to get in touch!

Get in touch

Index

Field	Group	Required	Description	Data type	Details

id	general	yes	unique identifier of the entry	string	If the dataset uses some form of id, for example in the filename, this is generally used. Otherwise an appropriate id is generated, usually something like`pueblo04`
dataset_id	general	yes	id of the dataset	string
title	lyrics	no	title of the song	string
title_translation	lyrics	no	translation of the title	string	This is used if two versions of the title are given: in the original language, and a translation. The translation will typically be English.
location	location	no	location of the song	string	This is roughly the place where the song originated. Most of the time, the place of collection is used as a proxy. However, it could be that a song was recorded elsewhere, and in that case the, say, birthplace of the performer might be used as the location instead.
latitude	location	no	geographic latitude coordinate of the location	float
longitude	location	no	geographic longitude coordinate of the location	float
auto_geocoded	location	no	whether the coordinates were automatically determined	boolean	This is required whenever location information is present.
language	lyrics	no	the language of the lyrics	string	If no lyrics are given, the language of the performer can be used.
glottolog_id	lyrics	no	Gottolog id of the language	string
culture	culture	no	culture or nationality of the original performer	string
culture_dplace_id	culture	no	D-Place identifier of the culture/society	string
culture_hraf_id	culture	no	HRAF identifier of the culture	string
genres	general	no	the genre of the piece	string-list	The genres used are specific to a dataset.
performers	performance	no	names of the performers	string-list
performer_genders	performance	no	gender of the performers	string-list	Should have the same length as performers.
instrumentation	performance	no	the instrumentation	string-list
instrument_use	performance	no	whether the piece uses (non-vocal) instruments	boolean
percussion_use	performance	no	whether the piece uses percussive instruments	boolean	Can only be true if instrument_use is true
voice_use	performance	no	whether the piece uses the voice	boolean

Catafolk
A catalogue of folk music datasets for computational ethnomusicology

Project in infancy

This project is in its infancy: many things are still likely to change. If you are interested in the project, or want to contribute, please get in touch. Your comments and suggestions are also very welcome.

Get in touch