The Corpus of Spoken Yiddish in Europe (CSYE) provides free access to Holocaust survivor testimonies in Yiddish with high-quality transcripts and research tools. This guide explains how to navigate the corpus website, search through its materials, and understand our transcription system. The guide ends with instructions on submitting corrections, our Terms of Use documents, and citation information.
The corpus consists of video-recorded testimony interviews in Yiddish from the USC Shoah Foundation Visual History Archive (VHA), along with accurate transcripts produced by our expert team of transcribers and reviewers. Each interview is divided into “tapes,” corresponding to digitized video cassettes of at most 30 minutes, and featured on a standalone testimony page. These pages contain:
The interviews in the CSYE are listed on the Testimony Index page.
In addition to these hundreds of testimony pages, the corpus website also contains other tools and resources accessible from the top navigation menu:
There are several ways to find testimonies in the CSYE: the Testimony Index, the interactive Map, and the Search tools. Each is described below.
The Testimony Index lists all interviews currently available in the CSYE. Each row displays the survivor’s name, VHA interview code, city of birth (the customary English/official place name and Yiddish place name), gender, age, and broad Yiddish dialect (Northeastern Yiddish, Central Yiddish, or Southeastern Yiddish). Metadata is drawn from the VHA, with corrections by the CSYE team where needed. You can filter the list by name, city, or interview code using the search box, or use the dropdown menus to filter by speaker dialect or gender. The columns are also sortable.
Clicking on a survivor’s name opens their individual testimony page, which includes a photo, metadata (date of birth, date of interview and age at time of interview, interviewer name, interview duration, and transcript status), and maps showing the survivor’s place of birth and place of interview. Below the metadata, each transcribed tape is available for streaming with embedded subtitles. A searchable transcript table allows you to search within that testimony and jump directly to the relevant moment in the appropriate video tape. Audio and transcript files are available for download, and a ready-made citation is provided at the bottom of the page, in case you use the testimony in your research or teaching.
The Map displays the birthplaces of all survivors in the CSYE, with each point color-coded by dialect. Clicking on a point opens a link to that survivor’s testimony page. (A separate map of interview locations is also available.)
The map includes several optional layers that can be enabled or disabled using the layers button in the upper-right corner:
For full details on map features and data sources, see the Notes section on the Map page.
The Search tool allows you to query words and phrases across all CSYE transcripts in Latin transliteration. Results are displayed by testimony, and clicking on a result jumps directly to the relevant moment in the video. (A separate Hebrew-alphabet search is available for reviewed transcripts.)
By default, searches match complete words only (e.g., searching milkh finds only milkh ‘milk,’ not milkhome ‘war’). Several options are available to refine your search:
khoydesh|khadoshim will match both singular and plural forms of the word for ‘month.’ A tutorial on regular expression syntax is available here.The Word Maps tool extends search into the geographic dimension: enter a word to see which survivors use it and where they are from, or enter two words to compare their regional distribution. Regular expressions are supported here, too.
Transcripts in the CSYE are produced in Latin characters by a team of expert transcribers and reviewers using a consistent set of conventions. Here is a guide to reading them.
All transcripts are available in Latin transliteration. A subset of transcripts have undergone additional review and are also available in the Hebrew-based Yiddish alphabet. Where both orthographies are available, users can toggle between them on testimony pages using the subtitles button (“CC”) in the video player or using the tabs above the transcript table.
Importantly, Latin spellings reflect standard YIVO transliteration rather than a fully faithful transcription of the speaker’s regional dialect. For example, just as all Yiddish publishers would write קומען און גײן ‘come and go’ regardless of the writer’s dialect, so too do we transliterate that phrase as kumen un geyn regardless of the regional pronunciation (Northeastern Yiddish: “kumen un geyn”; Central Yiddish: “kimen in gayn”; Southeastern Yiddish: “kimen in geyn”). This makes it easier to search consistently across dialects. However, we do faithfully transcribe distinctive local vocabulary, local pronunciations that are not predictable from broader systemic sound changes (e.g., gevelt ‘wanted’ in Northeastern Yiddish where other dialects have gevolt), and other forms that seem to be idiosyncratic to the speaker. We never “correct” a speaker’s grammar, e.g., to reflect standard case and gender markings.
We have adopted a few modifications from standard YIVO transliteration (and Yiddish spelling, in reviewed transcripts). For example, we include hyphens inside all compound nouns (mes-les; tsind-bombe) and we use apostrophes in possessives (mame’s; Zalmen’s). We also spell out all numbers as words (fir un tsvantsik, not “24”).
Finally, note that overlapping speech — when the survivor and interviewer are speaking simultaneously — is represented by overlapping time-aligned segments in the transcripts, rather than by any special notation within the text itself.
Capitalization in the transliteration carries meaning:
Note that we fully spell out non-Yiddish abbreviations (e.g., MISTER rather than “Mr.”) and initialisms (e.g., F.B.I., with uppercase letters separated by periods); Hebrew years, however, are spelled out as though Yiddish words (e.g., tof shin alef). Note also that borrowings from Modern Hebrew, Russian, and other languages written in non-Latin alphabets are also transliterated in ALL CAPS, but we are less consistent in our use of a standard romanization.
Our transcription conventions include the following markers:
All transcript files are freely available for download from our public GitHub repository. The full repository can be downloaded as a ZIP archive. Individual audio and transcript files can also be downloaded directly from each testimony page.
Transcripts are available in the following formats:
Within each format, files are organized into reviewed/ and unreviewed/ subdirectories. Reviewed transcripts are available in both Latin transliteration (.la) and the Hebrew-based Yiddish alphabet (.he); unreviewed transcripts are available in Latin transliteration (.la) only.
Researchers and other advanced users: Downloading audio and transcript files opens up possibilities for offline research. Transcript files in plain text format can be searched using command-line tools like grep for fast and flexible pattern matching across the full corpus. Audio files in .m4a format can be converted to .wav for use with acoustic analysis software like Praat using a tool like FFmpeg (ffmpeg -i input.m4a output.wav). Praat TextGrid files downloaded from the repository can be opened alongside the converted audio directly in Praat. Scripts to facilitate use of the corpus — including a script for downloading CSYE data in the format required by the Montreal Forced Aligner — are available in the CSYE-scripts repository on GitHub.
Transcription is a difficult and often subtle task, and despite our best efforts, errors may appear in published transcripts (especially in unreviewed files). If you notice a typographical error or mistranscription — including cases where a word is marked UNK or with <angle brackets> that you are able to identify — we encourage you to submit a correction by opening a support ticket on our GitHub repository. You will need to create a free GitHub account if you do not already have one.
The interviews included in the Corpus of Spoken Yiddish in Europe are made available by the USC Shoah Foundation. The USC Shoah Foundation owns the intellectual property rights, including copyrights, to all its video-recorded interviews and all their metadata, as well as the code and database structures that power the Visual History Archive and related storage, production, and post-production systems. Those who access the CSYE are obligated to abide by the USC Shoah Foundation Terms of Use, pertaining specifically to USC SF materials. Users of the CSYE acknowledge that they have read and agree to these terms.
The USC Shoah Foundation has permitted the CSYE development team to provide download access to the original audio of the testimony interviews covered under the CSYE’s license agreement. However, the videos of these interviews are being shared with CSYE users on a streaming-only basis. Any attempt to download interview video files is strictly prohibited.
With the exception of the interview media files and their associated metadata, all materials on this website are distributed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 
Any published information from USC Shoah Foundation interviews and metadata, in whole or in part, should cite the interviews as sources; see the template below. We also request that you cite the CSYE by including a reference to our article in Language Documentation & Conservation. Citations should follow standard reference practice.
Interviewee’s name. Interview [#]. Tape [#] (if applicable). Timestamp/range of the reference (if applicable). USC Shoah Foundation Visual History Archive. USC Shoah Foundation. Year of interview. Date of access.
Bleaman, Isaac L., and Chaya R. Nove. 2025. The Corpus of Spoken Yiddish in Europe: Goals, Methods, and Applications. Language Documentation & Conservation 19: 142–157. https://hdl.handle.net/10125/74812.
To help you put together your source lists and bibliographies, an example reference is provided on the bottom of every testimony page in the CSYE.
Last updated: June 4, 2026