You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+25Lines changed: 25 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -44,6 +44,8 @@ _Please read the [contribution guidelines](contributing.md) before contributing.
44
44
*[NLP in Vietnamese](#nlp-in-vietnamese)
45
45
*[NLP for Dutch](#nlp-for-dutch)
46
46
*[NLP in Indonesian](#nlp-in-indonesian)
47
+
*[NLP in Urdu](#nlp-in-urdu)
48
+
*[NLP in Persian](#nlp-in-persian)
47
49
*[Other Languages](#other-languages)
48
50
*[Credits](#credits)
49
51
@@ -528,6 +530,29 @@ NLP as API with higher level functionality such as NER, Topic tagging and so on
528
530
### Libraries
529
531
-[Natural Language Processing library](https://github.com/urduhack/urduhack) for ( 🇵🇰)Urdu language
530
532
533
+
## NLP in Persian
534
+
535
+
[Back to Top](#contents)
536
+
537
+
### Libraries
538
+
-[Hazm](https://github.com/sobhe/hazm): Python library for digesting Persian text.
539
+
-[Parsivar](https://github.com/ICTRC/Parsivar): A Language Processing Toolkit for Persian
540
+
-[Perke](https://github.com/AlirezaTheH/perke): Perke is a Python keyphrase extraction package for Persian language. It provides an end-to-end keyphrase extraction pipeline in which each component can be easily modified or extended to develop new models.
541
+
-[Perstem](https://github.com/jonsafari/perstem): Persian stemmer, morphological analyzer, transliterator, and partial part-of-speech tagger
542
+
-[ParsiAnalyzer](https://github.com/NarimanN2/ParsiAnalyzer): Persian Analyzer For Elasticsearch
543
+
-[virastar](https://github.com/aziz/virastar): Cleaning up Persian text!
544
+
545
+
### Datasets
546
+
-[Bijankhan Corpus](https://dbrg.ut.ac.ir/بیژن%E2%80%8Cخان/): Bijankhan corpus is a tagged corpus that is suitable for natural language processing research on the Persian (Farsi) language. This collection is gathered form daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural and so on. Totally, there are 4300 different subjects. The Bijankhan collection contains about 2.6 millions manually tagged words with a tag set that contains 40 Persian POS tags.
547
+
-[Uppsala Persian Corpus (UPC)](https://sites.google.com/site/mojganserajicom/home/upc): Uppsala Persian Corpus (UPC) is a large, freely available Persian corpus. The corpus is a modified version of the Bijankhan corpus with additional sentence segmentation and consistent tokenization containing 2,704,028 tokens and annotated with 31 part-of-speech tags. The part-of-speech tags are listed with explanations in [this table](https://sites.google.com/site/mojganserajicom/home/upc/Table_tag.pdf).
548
+
-[Large-Scale Colloquial Persian](http://hdl.handle.net/11234/1-3195): Large Scale Colloquial Persian Dataset (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. LSCP includes 120M sentences from 27M casual Persian tweets with its dependency relations in syntactic annotation, Part-of-speech tags, sentiment polarity and automatic translation of original Persian sentences in English (EN), German (DE), Czech (CS), Italian (IT) and Hindi (HI) spoken languages. Learn more about this project at [LSCP webpage](https://iasbs.ac.ir/~ansari/lscp/).
549
+
-[ArmanPersoNERCorpus](https://github.com/HaniehP/PersianNER): The dataset includes 250,015 tokens and 7,682 Persian sentences in total. It is available in 3 folds to be used in turn as training and test sets. Each file contains one token, along with its manually annotated named-entity tag, per line. Each sentence is separated with a newline. The NER tags are in IOB format.
550
+
-[FarsiYar PersianNER](https://github.com/Text-Mining/Persian-NER): The dataset includes about 25,000,000 tokens and about 1,000,000 Persian sentences in total based on [Persian Wikipedia Corpus](https://github.com/Text-Mining/Persian-Wikipedia-Corpus). The NER tags are in IOB format. More than 1000 volunteers contributed tag improvements to this dataset via web panel or android app. They release updated tags every two weeks.
551
+
-[PERLEX](http://farsbase.net/PERLEX.html): The first Persian dataset for relation extraction, which is an expert translated version of the “Semeval-2010-Task-8” dataset. Link to the relevant publication.
552
+
-[Persian Syntactic Dependency Treebank](http://dadegan.ir/catalog/perdt): This treebank is supplied for free noncommercial use. For commercial uses feel free to contact us. The number of annotated sentences is 29,982 sentences including samples from almost all verbs of the Persian valency lexicon.
553
+
-[Uppsala Persian Dependency Treebank (UPDT)](http://stp.lingfil.uu.se/~mojgan/UPDT.html): Dependency-based syntactically annotated corpus.
554
+
-[Hamshahri](https://dbrg.ut.ac.ir/hamshahri/): Hamshahri collection is a standard reliable Persian text collection that was used at Cross Language Evaluation Forum (CLEF) during years 2008 and 2009 for evaluation of Persian information retrieval systems.
555
+
531
556
## Other Languages
532
557
533
558
- Russian: [pymorphy2](https://github.com/kmike/pymorphy2) - a good pos-tagger for Russian
0 commit comments