一、自然语言处理常用工具(原文链接)
Dependency Parser(依存语法分析器)
CaboCha: A tool for Japanese dependency structure analysis based on cascaded chunking.
KNP: A Japanese dependency parser that also includes some form of predicate-argument analysis.
MaltParser: A parser based on the shift-reduce method.
MSTParser: A tool for dependency parsing based on maximum spanning trees.
Finite State Models(有限状态模型)
Kyfd: A decoder for text-processing systems build using weighted finite state transducers.
OpenFST: A library implementing many operations over weighted finite state transducers (WFSTs) to allow for easy building of finite-state models.
General NLP Libraries(通用nlp工具)
NLTK: A general library for NLP written in Python.
OpenNLP: A library written in Java that implements many different NLP tools.
Stanford CoreNLP: A library including many of the NLP tools developed at Stanford.
Language Modeling(语言模型)
IRSTLM: A toolkit for training and storing language models.
kenlm: A tool for memory and time-efficient storage of language models.
Kylm: A language modeling toolkit (written by me) that allows for weighted finite state transducer output and modeling of unknown words. Implemented in Java.
RandLM: A tool for randomized language models that are able to handle massive models in a small memory space.
SRILM: An efficient n-gram language modeling toolkit that features a variety of features. A variety of smoothing techniques (including Kneser-Ney), class based models, model merging, etc.
Machine Learning(机器学习)
AROW++: An implementation of Adaptive Regularization of Weight Vectors, an online learning algorithm that is robust to noise.
Classias: A library implementing many different kinds of classifier algorithms, both online and batch.
CRF++: An implementation of conditional random fields, a standard sequence prediction method. Can be customized with feature templates.
CRFsuite: A very fast implementation of conditional random fields.
LIBLINEAR: A library implementing linear support vector machines and logistic regression. Training is extremely fast.
LIBSVM: A full-featured package for learning support vector machines.
Mallet: A machine learning package for use in natural language processing. It implements hidden Markov models, maximum entropy Markov models, and conditional random fields. Written in Java.
SVM-Light: An efficient SVM library.
Weka: A machine learning library supporting a large number of machine learning algorithms.
Machine Translation Alignment(机器翻译对齐)
Berkeley Aligner: An alignment toolkit implementing both supervised and unsupervised word alignment models.
GIZA++: A standard tool for creating word alignments using the IBM models.
pialign: A phrase aligner based on inversion transduction grammars that can create compact but effective translation models.
Machine Translation Decoder(机器翻译解码)
cdec: A parsing-based decoder implementing tree and forest translation.
Joshua: A decoder implementing syntax-based translation.
Moses: A popular statistical machine translation decoder that supports phrase-based and tree-based models.
Travatar: A tree-to-string decoder for syntax-based translation.
Machine Translation Evaluation(机器翻译评测)
METEOR: A tool for the METEOR metric, which performs accurate evaluation using a number of methods such as synonym regularization, stemming, and considering reordering.
multeval: A tool for evaluating machine translation results that considers the statistical significance of the results for several evaluation measures.
RIBES: An evaluation measure to measure the accuracy of word reordering.
Morphological Analysis(形态分析)
Chasen: A morphological analysis tool using HMMs. Site is in Japanese.
JUMAN: A tool for Japanese morphological analysis.
KyTea: A tool for word segmentation and morphological analysis that is relatively robust to unknown words and easily domain adaptable.
MeCab: A tool for morphological analysis using conditional random fields (CRFs). Site is in Japanese.
Sen: A Japanese morphological analysis system written in Java.
Phrase Structure Parsing(短语结构语法分析器)
Berkeley Parser: A context free grammar parser with models for (at least) English, Arabic, Chinese, Bulgarian, French, and German.
Charniak Parser: A discriminative CFG parser for English.
Egret: A PCFG parser that can output n-best lists and packed forests.
EVALB: A tool for evaluating parsing accuracy.
Stanford Parser: A parser that can output both CFG parses and dependencies. Can parse English, Chinese, Arabic, French, and German.
Pronunciation Estimation(发音估计?)
KyTea: A toolkit for word segmentation and pronunciation estimation.
mpaligner: A program for aligning graphemes to phonemes for training pronunciation estimation systems, mainly for use with Japanese (site is also in Japanese).
Phonetisaurus: A WFST-based toolkit for grapheme-to-phoneme and phoneme-to-grapheme conversion.
Speech Recognition(语音识别)
CMU Sphinx: A widely used speech recognition program.
Juicer: A WFST-based speech recognition decoder.
Julius: An open-source decoder for large vocabulary automatic speech recognition.
二、nlp开源工具选择图(原文链接)
三、python相关nlp工具汇总(原文链接)