一、自然语言处理常用工具（原文链接）

Dependency Parser(依存语法分析器)

CaboCha: A tool for Japanese dependency structure analysis based on cascaded chunking.

KNP: A Japanese dependency parser that also includes some form of predicate-argument analysis.

MaltParser: A parser based on the shift-reduce method.

MSTParser: A tool for dependency parsing based on maximum spanning trees.

Finite State Models（有限状态模型）

Kyfd: A decoder for text-processing systems build using weighted finite state transducers.

OpenFST: A library implementing many operations over weighted finite state transducers (WFSTs) to allow for easy building of finite-state models.

General NLP Libraries（通用nlp工具）

NLTK: A general library for NLP written in Python.

OpenNLP: A library written in Java that implements many different NLP tools.

Stanford CoreNLP: A library including many of the NLP tools developed at Stanford.

Language Modeling（语言模型）

IRSTLM: A toolkit for training and storing language models.

kenlm: A tool for memory and time-efficient storage of language models.

Kylm: A language modeling toolkit (written by me) that allows for weighted finite state transducer output and modeling of unknown words. Implemented in Java.

RandLM: A tool for randomized language models that are able to handle massive models in a small memory space.

SRILM: An efficient n-gram language modeling toolkit that features a variety of features. A variety of smoothing techniques (including Kneser-Ney), class based models, model merging, etc.

Machine Learning（机器学习）

AROW++: An implementation of Adaptive Regularization of Weight Vectors, an online learning algorithm that is robust to noise.

Classias: A library implementing many different kinds of classifier algorithms, both online and batch.

CRF++: An implementation of conditional random fields, a standard sequence prediction method. Can be customized with feature templates.

CRFsuite: A very fast implementation of conditional random fields.

LIBLINEAR: A library implementing linear support vector machines and logistic regression. Training is extremely fast.

LIBSVM: A full-featured package for learning support vector machines.

Mallet: A machine learning package for use in natural language processing. It implements hidden Markov models, maximum entropy Markov models, and conditional random fields. Written in Java.

SVM-Light: An efficient SVM library.

Weka: A machine learning library supporting a large number of machine learning algorithms.

Machine Translation Alignment（机器翻译对齐）

Berkeley Aligner: An alignment toolkit implementing both supervised and unsupervised word alignment models.

GIZA++: A standard tool for creating word alignments using the IBM models.

pialign: A phrase aligner based on inversion transduction grammars that can create compact but effective translation models.

Machine Translation Decoder（机器翻译解码）

cdec: A parsing-based decoder implementing tree and forest translation.

Joshua: A decoder implementing syntax-based translation.

Moses: A popular statistical machine translation decoder that supports phrase-based and tree-based models.

Travatar: A tree-to-string decoder for syntax-based translation.

Machine Translation Evaluation（机器翻译评测）

METEOR: A tool for the METEOR metric, which performs accurate evaluation using a number of methods such as synonym regularization, stemming, and considering reordering.

multeval: A tool for evaluating machine translation results that considers the statistical significance of the results for several evaluation measures.

RIBES: An evaluation measure to measure the accuracy of word reordering.

Morphological Analysis（形态分析）

Chasen: A morphological analysis tool using HMMs. Site is in Japanese.

JUMAN: A tool for Japanese morphological analysis.

KyTea: A tool for word segmentation and morphological analysis that is relatively robust to unknown words and easily domain adaptable.

MeCab: A tool for morphological analysis using conditional random fields (CRFs). Site is in Japanese.

Sen: A Japanese morphological analysis system written in Java.

Phrase Structure Parsing（短语结构语法分析器）

Berkeley Parser: A context free grammar parser with models for (at least) English, Arabic, Chinese, Bulgarian, French, and German.

Charniak Parser: A discriminative CFG parser for English.

Egret: A PCFG parser that can output n-best lists and packed forests.

EVALB: A tool for evaluating parsing accuracy.

Stanford Parser: A parser that can output both CFG parses and dependencies. Can parse English, Chinese, Arabic, French, and German.

Pronunciation Estimation（发音估计？）

KyTea: A toolkit for word segmentation and pronunciation estimation.

mpaligner: A program for aligning graphemes to phonemes for training pronunciation estimation systems, mainly for use with Japanese (site is also in Japanese).

Phonetisaurus: A WFST-based toolkit for grapheme-to-phoneme and phoneme-to-grapheme conversion.

Speech Recognition（语音识别）

CMU Sphinx: A widely used speech recognition program.

Juicer: A WFST-based speech recognition decoder.

Julius: An open-source decoder for large vocabulary automatic speech recognition.

自然语言处理常用工具及选择汇总

February 24, 2015 • 生活