TopWORDS

Brief Description

This project is an implementation of TopWORDS algorithm proposed in the following paper.

Deng K, Bol P K, Li K J, et al. On the unsupervised analysis of domain-specific Chinese texts[J]. Proceedings of the National Academy of Sciences, 2016: 201516510.

TopWORDS can achieve word discovery and text segmentation simultaneously for Chinese texts. It is designed to be fast and use very little memory. In my test, it takes around 5 minutes to segment "The Story of Stone" with an Intel i3-4160 CPU and less than 2G memory. This implementation is based on Spark 1.6.x which means it can be used in both local machine with specified number of threads and in yarn clusters for large amount of texts.

For more information about its theory, refer to http://qf6101.github.io/machine%20learning/2016/07/01/TopWORDS (in Chinese)

Local Machine Mode

Download Spark 1.6.x from http://spark.apache.org/downloads.html
Set the parameters in deploy/sbin/topwords_local.sh (simply set only SPARK_HOME if you just need to run "The Story of Stone" example)
Run the script: bash deploy/sbin/topwords_local.sh

Yarn Cluster Mode

Set the parameters in deploy/sbin/topwords_yarn.sh
Run th script: bash deploy/sbin/topwords_yarn.sh (you may need to initialize the keytab in advance)

API Usage

Please refer to src/test/scala/io/github/qf6101/topwords/TestTopWORDS.scala

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
deploy		deploy
src		src
test_data		test_data
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TopWORDS

Brief Description

Local Machine Mode

Yarn Cluster Mode

API Usage

About

Uh oh!

Releases

Packages

Languages

License

coder3344/topwords

Folders and files

Latest commit

History

Repository files navigation

TopWORDS

Brief Description

Local Machine Mode

Yarn Cluster Mode

API Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages