Webpage Correlation Engine

Mohit Mishra, Indian Institute of Technology (BHU) Varanasi, Language: Java

This project/software basically clusters out similar pages in one clusters.

To do this, we use the concept of document clustering (easily available on the internet for understanding). The question now is, given a bunch of URLs, how woul you classify which set of urls are similar/different to/from each other. For this, the problem becomes simple to understand if we can "transform" the URL problem to a text document, and then apply document clustering. However, there is a major concern:

Make sure relevant content is chosen for the document text.
Noise removal

The algorithm used here, however, doesn't consider the noise at all. The noise is self-removed while performing document cosine similarity. Having mentioned this, cosine similarity metric is used to quantitatively determine the closeness between two text docs (and hence two webpages).

Adaptive K-Means Clustering is used to classify the webpages. Adaptive in the sense that the algorithm/program itselfs computes the "optimized" value of k instead of statically defining it and experimenting over a range of values of k. This is based on the ratio of intra-cluster distance and inter-cluster distance. The threshold, in our case, worked out best at 0.5 (might be different from cases to cases).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
bin		bin
src		src
README.md		README.md
stop-words_english_1_en.txt		stop-words_english_1_en.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Webpage Correlation Engine

About

Uh oh!

Releases

Packages

Languages

karzler/WebpageCorrelationEngine

Folders and files

Latest commit

History

Repository files navigation

Webpage Correlation Engine

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages