Skip to content

Darg-Iztech/GreyLiterature

Repository files navigation

GreyLiterature

Grey literature answer quality / user reputation measurement with BERT and DistilBERT.

Training models using DP, SE, or other datasets

Dataset:

When training a model --data_dir and --labels arguments are always required.

  • --data_dir ➡️ The directory including raw.csv file, which will be divided into train, dev and test sets. The raw.csv files for DP and SE are available at IZTECH Cloud Repository.
  • --labels ➡️ The argument that must be set as the name of the column containing the ground truth values for the author classes. Typically, it is set as 'sum_class', 'mean_class', or 'median_class'.

When working on other data sets, be sure to organize the raw.csv file to include the following columns:

user_id,question_title,question_text,answer_text,answer_score,user_answer_count,sum_class,mean_class,median_class

To see a sample raw.csv file ➡️ data/test/raw.csv

Options and arguments:

Some default training arguments are:

  • --sequence = 'TQA' ➡️ Uses Title+Question+Answer sequence. Alternatively, use 'A', 'TA', or 'QA'.
  • --model = 'bert' ➡️ Uses BERT model. Alternatively, use distilbert.
  • --device = 'cpu' ➡️ Uses CPU as running device. Alternatively, use cuda.
  • --crop = 1.0 ➡️ Uses 100% of the answers and runs multi-class classification. Alternatively, set a value less than 1.0. For example, --crop = 0.25 performs binary classification on the top and bottom 25% of the answers.

To see all training arguments and options, run:

python3 main.py --help

Example Training Command:

python3 main.py --model='distilbert' --data_dir='data/dp' --labels='median_class' --device='cuda' --crop=0.25

⚠️ Here, data/dp directory must include raw.csv file, which will be divided into train, dev and test sets.

⚠️ Since --sequence='TQA' by default, train, dev and test sets are stored under data/dp/TQA.

⚠️ Since crop is less than 1.0, this command runs a binary classification.

Testing with a Pre-trained Model

Step 1) Download the pre-trained model with 72% accuracy from HERE (~1.2 Gb pth.tar file).

Step 2) Use predict.py module to predict the reputability of specific author(s).

The predict.py module takes 2 arguments:

  • --checkpoint_path (required) ➡️ Path to pth.tar file downloaded in Step 1.
  • --test_path (optional) ➡️ Path to the CSV file including questions and answers of which author reputability will be predicted. The file must include title,question,answer,label columns. If the --test_path is not set, the question, answer, and label are taken as console input.

To see a sample test.csv file ➡️ data/test/test.csv

Example Testing Commands:

Example 1) --test_path is set to test.csv file that includes 3 answers from 3 authors:

python3 predict.py --checkpoint_path='models/checkpoint.pth.tar' --test_path='data/test/test.csv'
------------------
Expected:  [1 1 0 0]
Predicted: [1 1 0 0]

Example 2) --test_path is not set:

python3 predict.py --checkpoint_path='models/checkpoint.pth.tar'
------------------
Enter title (as plain text): What are UI Patterns?
Enter question (as plain text): What is the difference between User Interaction Design Patterns and User Interface Design Patterns?
Enter answer (as plain text): I've heard them used interchangeably, but people might be defining one to be the design patterns used to implement a user interface (for example, using the Command pattern to implement undo and redo operations supported by buttons on the UI and/or keyboard shortcuts) and the other to be patterns that appear in the user interface, such as the Wizard pattern and then defining characteristics about that pattern (name, intent, applicability, known uses, and more).
Enter expected label (as 0 or 1): 1
------------------
Expected:  [1]
Predicted: [1]

References

Modified version of the code in https://github.com/isspek/west_iyte_plausability_news_detection

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •