Skip to content

Corpusreader for TAC dataset - need usage instructions #750

@lashmore

Description

@lashmore

It is very difficult to intuitively understand how the TACReader class is meant to be used. What path do I send to "corpusRoot"? Here is the file hierarchy of the raw TAC 2014-2015 data, where 2015 has a similar folder structure to 2014.

From what I can tell, TACReader is breaking down XML documents. The only folder containing XML data is in source_documents. Inside the .txt files is XML file structure. Is TACReader ONLY parsing information from source_documents, or does it parse from other folders in the file structure?

Screen Shot 2021-08-30 at 2 50 10 PM

Here's how I'm trying to use TACReader and here's the error message I'm getting. Note, I've tried a bunch of different paths to set corpusRoot at, and they're all giving me the same error. I'm running completely blind here. Any help would be very appreciated!

import edu.illinois.cs.cogcomp.nlp.corpusreaders.TACReader;

public class PreprocessTAC {
    public static void main(String[] args) throws Exception {
        String path = "/path/to/tac_kbp_eng_event_arg_comp_train_eval_2014-2015/data/";
        TACReader reader_tac = new TACReader(path, false);
    }
}

Error message:

Exception in thread "main" java.lang.NullPointerException: Cannot read the array length because "<local4>" is null
	at edu.illinois.cs.cogcomp.core.io.IOUtils.lsFilesRecursive(IOUtils.java:145)
	at edu.illinois.cs.cogcomp.nlp.corpusreaders.TACReader.getFileListing(TACReader.java:239)
	at edu.illinois.cs.cogcomp.nlp.corpusreaders.XmlDocumentReader.initializeReader(XmlDocumentReader.java:107)
	at edu.illinois.cs.cogcomp.nlp.corpusreaders.AnnotationReader.<init>(AnnotationReader.java:47)
	at edu.illinois.cs.cogcomp.nlp.corpusreaders.AbstractIncrementalCorpusReader.<init>(AbstractIncrementalCorpusReader.java:61)
	at edu.illinois.cs.cogcomp.nlp.corpusreaders.XmlDocumentReader.<init>(XmlDocumentReader.java:89)
	at edu.illinois.cs.cogcomp.nlp.corpusreaders.TACReader.<init>(TACReader.java:113)
	at PreprocessTAC.main(PreprocessTAC.java:7)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions