Corpus of Spoken Yiddish in Europe

User Guide

The Corpus of Spoken Yiddish in Europe (CSYE) provides free access to Holocaust survivor testimonies in Yiddish with high-quality transcripts and research tools. This guide explains how to navigate the corpus website, search through its materials, and understand our transcription system. The guide ends with instructions on submitting corrections, our Terms of Use documents, and citation information.

What’s in the CSYE?

The corpus consists of video-recorded testimony interviews in Yiddish from the USC Shoah Foundation Visual History Archive (VHA), along with accurate transcripts produced by our expert team of transcribers and reviewers. Each interview is divided into “tapes,” corresponding to digitized video cassettes of at most 30 minutes, and featured on a standalone testimony page. These pages contain:

  • Metadata including biographical information and interview details
  • Streaming video with our transcripts embedded as subtitles
  • Downloadable audio (.m4a files)
  • Searchable and downloadable transcripts (in multiple file formats; available in Latin transliteration and, for reviewed tapes, in the Hebrew-based Yiddish alphabet)

The interviews in the CSYE are listed on the Testimony Index page.

In addition to these hundreds of testimony pages, the corpus website also contains other tools and resources accessible from the top navigation menu:

  • An interactive Map showing survivors’ birthplaces, color-coded by dialect, with optional historical and other layers
  • A Search tool for querying words and phrases across all transcripts
  • A Word Maps tool for visualizing the geographic distribution of words
  • CSYE Glosses, a curated collection of pedagogical guides for Yiddish teachers and learners, survivor profiles, and linguistic discoveries and insights from the corpus

Finding and Browsing Testimonies

There are several ways to find testimonies in the CSYE: the Testimony Index, the interactive Map, and the Search tools. Each is described below.

Testimony Index

The Testimony Index lists all interviews currently available in the CSYE. Each row displays the survivor’s name, VHA interview code, city of birth (the customary English/official place name and Yiddish place name), gender, age, and broad Yiddish dialect (Northeastern Yiddish, Central Yiddish, or Southeastern Yiddish). Metadata is drawn from the VHA, with corrections by the CSYE team where needed. You can filter the list by name, city, or interview code using the search box, or use the dropdown menus to filter by speaker dialect or gender. The columns are also sortable.

Clicking on a survivor’s name opens their individual testimony page, which includes a photo, metadata (date of birth, date of interview and age at time of interview, interviewer name, interview duration, and transcript status), and maps showing the survivor’s place of birth and place of interview. Below the metadata, each transcribed tape is available for streaming with embedded subtitles. A searchable transcript table allows you to search within that testimony and jump directly to the relevant moment in the appropriate video tape. Audio and transcript files are available for download, and a ready-made citation is provided at the bottom of the page, in case you use the testimony in your research or teaching.

Using the Map

The Map displays the birthplaces of all survivors in the CSYE, with each point color-coded by dialect. Clicking on a point opens a link to that survivor’s testimony page. (A separate map of interview locations is also available.)

The map includes several optional layers that can be enabled or disabled using the layers button in the upper-right corner:

  • Historical state borders: Political boundaries from 1886–1950, adapted from the CShapes 2.0 dataset; use the timeline at the bottom of the map to change the date
  • Dialect boundaries: Approximate boundaries of the major Yiddish dialects
  • LCAAJ informants: Birthplaces of informants from the Language and Culture Atlas of Ashkenazic Jewry whose audio recordings are available online, useful for comparing CSYE and LCAAJ coverage
  • Other historical maps: A georectified Yiddish map from 1922, and Polish administrative district maps from 1897 and 1931

For full details on map features and data sources, see the Notes section on the Map page.

Searching the Corpus

The Search tool allows you to query words and phrases across all CSYE transcripts in Latin transliteration. Results are displayed by testimony, and clicking on a result jumps directly to the relevant moment in the video. (A separate Hebrew-alphabet search is available for reviewed transcripts.)

By default, searches match complete words only (e.g., searching milkh finds only milkh ‘milk,’ not milkhome ‘war’). Several options are available to refine your search:

  • Regular expressions: Enable this option for partial word matching and more complex patterns. For example, the pattern khoydesh|khadoshim will match both singular and plural forms of the word for ‘month.’ A tutorial on regular expression syntax is available here.
  • Reviewed transcripts only: Limit results to transcripts that have been reviewed by a second team member.
  • Include interviewers: By default, results show only survivor speech; uncheck the box “Exclude interviewers” to include utterances from the interviewer, as well.

The Word Maps tool extends search into the geographic dimension: enter a word to see which survivors use it and where they are from, or enter two words to compare their regional distribution. Regular expressions are supported here, too.

Understanding the Transcripts

Transcripts in the CSYE are produced in Latin characters by a team of expert transcribers and reviewers using a consistent set of conventions. Here is a guide to reading them.

Orthography

All transcripts are available in Latin transliteration. A subset of transcripts have undergone additional review and are also available in the Hebrew-based Yiddish alphabet. Where both orthographies are available, users can toggle between them on testimony pages using the subtitles button (“CC”) in the video player or using the tabs above the transcript table.

Importantly, Latin spellings reflect standard YIVO transliteration rather than a fully faithful transcription of the speaker’s regional dialect. For example, just as all Yiddish publishers would write קומען און גײן ‘come and go’ regardless of the writer’s dialect, so too do we transliterate that phrase as kumen un geyn regardless of the regional pronunciation (Northeastern Yiddish: “kumen un geyn”; Central Yiddish: “kimen in gayn”; Southeastern Yiddish: “kimen in geyn”). This makes it easier to search consistently across dialects. However, we do faithfully transcribe distinctive local vocabulary, local pronunciations that are not predictable from broader systemic sound changes (e.g., gevelt ‘wanted’ in Northeastern Yiddish where other dialects have gevolt), and other forms that seem to be idiosyncratic to the speaker. We never “correct” a speaker’s grammar, e.g., to reflect standard case and gender markings.

We have adopted a few modifications from standard YIVO transliteration (and Yiddish spelling, in reviewed transcripts). For example, we include hyphens inside all compound nouns (mes-les; tsind-bombe) and we use apostrophes in possessives (mame’s; Zalmen’s). We also spell out all numbers as words (fir un tsvantsik, not “24”).

Finally, note that overlapping speech — when the survivor and interviewer are speaking simultaneously — is represented by overlapping time-aligned segments in the transcripts, rather than by any special notation within the text itself.

Capitalization

Capitalization in the transliteration carries meaning:

  • Yiddish words are transliterated entirely in lowercase (e.g., haynt bin ikh fir un zibetsik yor).
  • Names (including personal names, place names, and organizations) begin with a single uppercase letter (e.g., Moyshe; Varshe; Komyug [Komunistishe Yugnt]).
  • Non-Yiddish borrowings, code-switches, and mixed forms are marked in ALL CAPS and spelled according to the norms of the original language (e.g., THAT’S RIGHT, dos iz dos; di milkhome hot zikh geFINISHt; nisht OŚWIĘCIM).

Note that we fully spell out non-Yiddish abbreviations (e.g., MISTER rather than “Mr.”) and initialisms (e.g., F.B.I., with uppercase letters separated by periods); Hebrew years, however, are spelled out as though Yiddish words (e.g., tof shin alef). Note also that borrowings from Modern Hebrew, Russian, and other languages written in non-Latin alphabets are also transliterated in ALL CAPS, but we are less consistent in our use of a standard romanization.

Other Conventions

Our transcription conventions include the following markers:

  • UNK: Special marker for a word or words that the transcriber could not understand
  • <vort>: Speech transcribed with uncertainty (the transcriber’s best guess appears inside the angle brackets)
  • ge<UNK>evet: Words where only some portion was uncertain
  • ikh bin ge- geven: False starts and other partial words (followed by a trailing hyphen)
  • "kum aher": Quoted speech appears in quotation marks
  • ah, aha, eh, ehm, uh, uhm, hm, mhm (transcribed as heard): Special markers for filled pauses [Note the use of uhm rather than um, because the latter is coincidentally a Yiddish word]
  • spn: Special marker for non-speech sounds occurring mid-utterance (laughter, throat clearing, etc.); there is also a special marker for a dental click, tsk

Downloading Transcripts

All transcript files are freely available for download from our public GitHub repository. The full repository can be downloaded as a ZIP archive. Individual audio and transcript files can also be downloaded directly from each testimony page.

Transcripts are available in the following formats:

  • Plain text (.txt): Tab-separated files where each line contains the speaker’s name, the start and end times of the utterance (in seconds), and the transcribed content; this is the most convenient format for text analysis and corpus searches
  • ELAN (.eaf): Transcript files with a separate tier for each speaker, for use with ELAN annotation software
  • Praat TextGrid (.TextGrid): Transcript files with a separate tier for each speaker, for use with Praat phonetics software

Within each format, files are organized into reviewed/ and unreviewed/ subdirectories. Reviewed transcripts are available in both Latin transliteration (.la) and the Hebrew-based Yiddish alphabet (.he); unreviewed transcripts are available in Latin transliteration (.la) only.

Researchers and other advanced users: Downloading audio and transcript files opens up possibilities for offline research. Transcript files in plain text format can be searched using command-line tools like grep for fast and flexible pattern matching across the full corpus. Audio files in .m4a format can be converted to .wav for use with acoustic analysis software like Praat using a tool like FFmpeg (ffmpeg -i input.m4a output.wav). Praat TextGrid files downloaded from the repository can be opened alongside the converted audio directly in Praat. Scripts to facilitate use of the corpus — including a script for downloading CSYE data in the format required by the Montreal Forced Aligner — are available in the CSYE-scripts repository on GitHub.

Submitting Corrections

Transcription is a difficult and often subtle task, and despite our best efforts, errors may appear in published transcripts (especially in unreviewed files). If you notice a typographical error or mistranscription — including cases where a word is marked UNK or with <angle brackets> that you are able to identify — we encourage you to submit a correction by opening a support ticket on our GitHub repository. You will need to create a free GitHub account if you do not already have one.


Terms of Use

The interviews included in the Corpus of Spoken Yiddish in Europe are made available by the USC Shoah Foundation. The USC Shoah Foundation owns the intellectual property rights, including copyrights, to all its video-recorded interviews and all their metadata, as well as the code and database structures that power the Visual History Archive and related storage, production, and post-production systems. Those who access the CSYE are obligated to abide by the USC Shoah Foundation Terms of Use, pertaining specifically to USC SF materials. Users of the CSYE acknowledge that they have read and agree to these terms.

The USC Shoah Foundation has permitted the CSYE development team to provide download access to the original audio of the testimony interviews covered under the CSYE’s license agreement. However, the videos of these interviews are being shared with CSYE users on a streaming-only basis. Any attempt to download interview video files is strictly prohibited.

With the exception of the interview media files and their associated metadata, all materials on this website are distributed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Creative Commons License


Citation Information

Any published information from USC Shoah Foundation interviews and metadata, in whole or in part, should cite the interviews as sources; see the template below. We also request that you cite the CSYE by including a reference to our article in Language Documentation & Conservation. Citations should follow standard reference practice.

  • Interviewee’s name. Interview [#]. Tape [#] (if applicable). Timestamp/range of the reference (if applicable). USC Shoah Foundation Visual History Archive. USC Shoah Foundation. Year of interview. Date of access.

  • Bleaman, Isaac L., and Chaya R. Nove. 2025. The Corpus of Spoken Yiddish in Europe: Goals, Methods, and Applications. Language Documentation & Conservation 19: 142–157. https://hdl.handle.net/10125/74812.

To help you put together your source lists and bibliographies, an example reference is provided on the bottom of every testimony page in the CSYE.


Last updated: June 4, 2026