Preprocessing the data
The first task is to read the file with the meta-information about the corpus. The metadata.csv file includes one column with the audio filename and one with its transcription, separated by the | symbol. The relevant code is included in the text-clustering.ipynb notebook:
import pandas as pd
# Read the data from the reduced csv file.
data = pd.read_csv('./data/metadata.csv', usecols=range(2), names=['audiofile', 'transcription'], sep="|")
data.head()
>> audiofile transcription
0 LJ001-0001 Printing, in the only sense with which ...
1 LJ001-0002 in being comparatively modern.
2 LJ001-0003 For although the Chinese took impressio...
3 LJ001-0004 produced the block books, which were th...
4 LJ001-0005 the invention of movable metal letters ...
Unfortunately, the dataset lacks any information...