GitHub - mathbouchard/fastText: Library for fast text representation and classification.

Introduction

fastText is a library for efficient learning of word representations and sentence classification. This fork provides a way to minimize the model size without loosing too much information about the vectors. The small memory footprint enables to use fasttext word to vector conversion and language identification as a service on a small server. A grpc client is also included to build such a service.

Approach

The size of kept words can be limited (by default to 50000). All the subwords are kept but they are replaced by a representation where we kept the minimum and maximum value of the vector and where all the other component are represented as the range between min and max on 4 bits. According to our experience, the incurred lost of precision seems to have minimal practical impacts.

Installation and usage

Install the modified fasttext

$ cd fastText
$ make

To generate the binary file used in the grpc server :

$ ./fasttext generate-compact ../fasttext_format_reader/wiki.fr.bin

A prior transformation can also be applied on the vector :

$ ./fasttext generate-compact ../fasttext_format_reader/wiki.fr.bin ../fasttext_format_reader/transformation.bin

Where the file transformation.bin is raw float buffer of a (dim x dim)-matrix and dim is the dimension of the word vector.

To build the grpc service :

$ cd grpc_server
$ make

And to run the service without DetectLanguages:

$ ./wv_server ../wiki_data.bin 192.168.2.46:50052

Where wiki_data.bin is the generated binary file in the first step.

To run the service with DetectLanguages:

$ ./wv_server ../wiki_data_fr.bin 192.168.2.46:50052 lid.176.bin

Where lid.176.bin can be download at [https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/lid.176.bin].

To try the service with the provided client execute:

$ ./wv_client salut vieux 192.168.2.46:50052

License

fastText is BSD-licensed. We also provide an additional patent grant.

Name		Name	Last commit message	Last commit date
Latest commit History 284 Commits
.circleci		.circleci
docs		docs
grpc_server		grpc_server
python		python
scripts		scripts
src		src
tests		tests
website		website
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
PATENTS		PATENTS
README.md		README.md
classification-example.sh		classification-example.sh
classification-results.sh		classification-results.sh
eval.py		eval.py
get-wikimedia.sh		get-wikimedia.sh
prepare_map.py		prepare_map.py
pretrained-vectors.md		pretrained-vectors.md
quantization-example.sh		quantization-example.sh
runtests.py		runtests.py
setup.cfg		setup.cfg
setup.py		setup.py
wikifil.pl		wikifil.pl
word-vector-example.sh		word-vector-example.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Table of contents

Introduction

Approach

Installation and usage

License

About

Uh oh!

Releases

Packages

Languages

License

mathbouchard/fastText

Folders and files

Latest commit

History

Repository files navigation

Table of contents

Introduction

Approach

Installation and usage

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages