Skip to content

fabulosa/Course_Search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Course Search Feature with Different Document Representation Methods

-- Weijie Jiang (Jenny), jiangwj[at]berkeley[dot]edu

This project tried to build a course search backend in the context of UC Berkeley with three traditional and modern document representation methods, including TF-IDF, BM25, and Sentence-BERT. The backend was implemented using Python3 and demonstrated in jupyter notebook.

Install required packages (two options):

1. Your python version had better to be 3.8, here are all the packages that are used in this project:

  • packages: pandas, sklearn, string, nltk, numpy, torch, transformers, sentence_transformers, IPython, rank_bm25, jupyter

  • use pip3 install [package name] to install all of them

2. use anaconda virtual environment

You can create a new anaconda virtual environment for running this project by running:

  • conda create -n course_search python=3.8
  • conda activate course_search

Then you can either manually install all the required packages listed in 1 (recommended) or go to the main directory Course_Search and run:

  • conda env update --file environment.yml (might take long)

Data Preprocessing:

  1. Go to the main directory Course_Search and run: python data_preprocessing.py, two files course_id.pkl (dictionares for indexing courses) and courseId_description.json (a dictionary that maps course ID to course description) will be generated.
  2. Go to folder tf-idf and run: python data_preprocessing.py, two files tfidf.npy (TF-IDF course description matrix) and word_dict.json (dictionary for indexing words) will be generated in folder tf-idf.
  3. Go to folder sentence_BERT and run: python generate_course_embedding_sentBert.py, which will take couple minutes, then a file course_embeddings_sentBert.npy (sentence-BERT course description matrix) will be generated.

Offline Evaluation:

The offline evaluation is a task to predict the most similar course to a given course, you will see the scores of Recall@10 and Mean/Median Rank for each method when running the following commands.

  1. Go to folder tf-idf and run: python validation.py
  2. Go to folder BM25 and run: python validation.py
  3. Go to folder sentence_BERT and run: python validation.py

Online Course Search:

Run each jupyter notebook below to experience the search feature hosted by each document representation method.

Search Exemplars:

In order to examine the advantages and disadvantages of each method. Several representative examples are selected and analyzed in this section. The search result on the official website for class search at Berkeley https://classes.berkeley.edu/search/class/ were also compared with the methods implemented in this project.

Query 1: “how to manage money”

This query is very colloquial, but a novice student might use it to search for courses related to finance and investment. The word “money” may be a synonym to some specialized words in course descriptions.

TF-IDF: The top 5 results turned out not related to finance largely due to the fact that “manage” appear too many times in the course descriptions which dominates the relevance score.

pic

BM25: It is obvious that results got improved in that all the top 5 courses lie in the topic of investment strategies or money management. This example demonstrated the advantage of BM25 over TF-IDF in dealing with term saturation. We know the relevance score calculated by TF-IDF linearly increases with term frequency while for BM25 it grows fast first but then goes slowly. So BM25 is able to balance the relevance of each word in a query much better than TF-IDF when the frequencies of words in the query vary a lot.

pic

Sentence-BERT: Sentence-BERT learned condensed representation of courses and the query-document matching process is not based on term matching. It is apparent that the top 5 courses ranked by the model share more semantic similarity with the query “how to manage money” although sometimes the term “money” and “manage” do not show up in the course description. For example, in the first result, “investment”, “value assets”, “cash flows”, and “pricing risk” are all relevant to the query, but TF-IDF and BM25 could not retrieve it since none of them match any word in the query.

pic

Official Website: It seems the results returned from the official website were retrieved in a more naïve way of term match. But it tended to diversify the courses from different department, although none of them were from business school.

pic

Query 2: “basic programming”

Students with non-technical majors may search for a course to learn basic programming. The stem of “programming” --“program” -- is also a homograph which has unrelated meanings, such as “an integrated course of academic studies” and “a sequence of instructions that a computer can interpret and execute”. “Basic” is also a homograph which could represent the meaning of “fundamental” and a high-level programming language “BASIC”.

TF-IDF: The search results include a combination of courses that contain the two different meanings of “program” (0, 1, 3 vs 2, 4). However, although the meaning of “program” in results 0, 1, 3 aligns with is meaning in the query, only result 0 (Information 206A) is a course that teaches basic programming.

pic

BM25: The search results demonstrated the two characteristics of BM25 in dealing with term saturation and document length. The results tended to achieve a balance between word “basic” and “program” and all the results have both the two words in their course descriptions, unlike the results generated by TF-IDF above that include many more occurrences of “programming / program” than “basic”. BM25 also assigns more weight to shorter document (b=0.75), so the length of course descriptions in the results are shorter than those resulted from TF-IDF on average. Similar to TF-IDF, BM25 cannot parse homograph which led to result 2 irrelevant to the query. But since BM25 balanced the two words in the query well, usually a course with description containing similar number of “basic” and “programming” is a fundamental programming course, which rendered the results slightly better than those retrieved by TF-IDF.

pic

Sentence-BERT: The semantically meaningful document embeddings learned by Sentence-BERT could understand the query is about a basic programming course. All the results are courses that teach basic programming. Although Information 206A ranked the 6th, the top 5 results looked even more basic than that, such as Pascal programming and Symbolic programming. This example demonstrated the strength of Sentence-BERT in dealing with lexical relations (especially homography).

pic

Official Website: It seems all the results returned from the official website considered “program” as the meaning of “an integrated course of academic studies” instead of “programming language”.

pic

Query 3: “a data science course which has low requirement for math”

This is a longer query that includes two disciplines, data science and math, and the polarity of data science is positive while negative for math. The query might be searched by students without a strong math background but are interested in learning data science.

TF-IDF: For this query, TF-IDF returned very diverse results of data science courses from different department, such as Sociology, Geography, Data Science, Physics, and Global Studies. All the top 5 results do not contain “math” in their descriptions, which is probably due to the fact that word “data” or “data science” occur so many times in their descriptions that they dominated the relevance score to the query. This characteristic of TF-IDF actually contributes to its success in retrieving data science courses that has low requirement for math, because usually a more applied data science course tends to mention data or data science more frequently than math.

pic

BM25: Opposite to TF-IDF, BM25 balanced the weight of each word in the query, therefore, the top 3 courses are all related to both math and data, which ended up being diverged from the original meaning of the query.

pic

Sentence-BERT: Although Sentence-BERT tends to perform better than traditional stream of bag-of-words methods in understanding the semantic meaning of a document, it seems it did not exhibit its strength in understanding this query. The top result Statistics 28 contains more mathematical elements than results 1, 2, 4, although it is a lower division course. The model might infer low requirement for math from the occurrence of “lower division”. However, result 3 is not a data science course at all, which implies the model did not quite penalize the occurrence of “math” in course descriptions by understanding the combination of “lower requirement” and “math”.

pic

Official Website: The results tended to fully orient towards data science without penalizing the occurrence of math-related words, such as probability and statistical. The results served as a sheer comparison to the results retrieved by the three methods above.

pic

Summary:

This project implemented a course search backend in the context of UC Berkeley with three document representation methods. TF-IDF and BM25 are two traditional bag-of-words based methods which represent a document with sparse features, while Sentence-BERT is a modern NLP model which represent a document with condensed features learned from a large volume of corpus and has demonstrated success in capturing complex semantic and lexical relationships in language. An offline evaluation on course similarity using the three types of course description representations showed that BM25 performed the best in retrieving the most similar course, although discrepancies among the three methods are very small. Online search results comparison and analysis substantiated the advantage and disadvantage of each method, which resonated with the design of each method. In general, TF-IDF tended to retrieve results that sometimes be dominated by a single word with a large number of occurrences, while BM25 was better at dealing with the term saturation problem and tended to favor documents of shorter length. Sentence-BERT demonstrated advantage over the other two methods of understanding certain level of semantic and lexical complexities, but failed in understanding some combination of words with different polarities. It is worth mentioning that none of the methods exhibited sheer superiority to the other methods, although they tended to retrieve more reasonable results than the official website in general. It might be practical to integrate different document representation methods together, for example, BM25 to generate candidate sets and Sentence-BERT to conduct ranking, so as to achieve further improvement in the retrieval results.

About

Final project for INFO 202 - Information Organization and Retrieval

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published