Tools for Corpus Linguistics

A comprehensive list of 96 tools used in corpus analysis.
Suggest a Tool

POS Tagger
Text analysis
Assessing text complexity
Semantic Parser
Network Analysis
Semantic Tagger
Boilerplate remover
statistical analysis
Morphological tagger
Statistical NLP
Metaphor identifier
Sentence Boundary Detector
Morphological Tagger
Multilevel Tagger
Corpus creation
semantic analysis
word lists
Duplicate remover
Meta modelling
Search tool
Variant detector

Tool Description Categories Platform Pricing
@nnotateSemi-automatic annotation of corpus dataAnnotationSolaris, LinuxFree (with licence agreement)
aConCordeMultilingual concordance tool (English and Arabic)ConcordancerLinux, MacOSX, WindowsFree
almaneser / SALTASemantic Parser/POS Tagger for EnglishParser, POS TaggerFree (with licence agreement)
AMALGAMTool for grammatical annotation (POS and phrase structure). Tagging a text that was entered via email.AnnotationWebFree
ANNISSearch and visualization tool for multi-layer linguistic corpora with diverse types of annotationSearch, VisualizationWeb (or Linux, Mac, Windows)Free
AntCLAWSGUIFront-end interface for CLAWS taggerPOS TaggerWindowsFree
AntConcCorpus analysis toolkitWordlists, ConcordancerLinux, MacOSX, WindowsFree
AntFileConverterFreeware tool to convert PDF and Word (DOCX) files into plain textConverterWindows, MacOSXFree
AntMoverTool for text structure (moves) analysisText analysisWindowsFree
AntPConcCorpus analysis toolkit for files encoded with UTF-8Wordlists, ConcordancerWindows, MacOSXFree
AntWordProfilerTool for profiling vocabulary level and text complexityAssessing text complexityLinux, MacOSX, WindowsFree
AtomicMulti-layer corpus annotation platform.AnnotationLinux, MacOSX, WindowsFree
BNCWebBNCweb is a web-based client program for searching and retrieving lexical, grammatical and textual data from the British National Corpus (BNC).Analysis, ConcordancerWebFree
BowStatistical Language Modeling, Text Retrieval, Classification and ClusteringText analysisUNIX, LinuxFree
CharedTool for detecting the character encoding of a textText analysisPython 2.6 or laterFree
CLaRKXML Based System For Corpora DevelopmentCompilationFree (with licence agreement)
CLAWS POS-TaggerCLAWS- POS Tagger POS TaggerWebVia licence or in-house tagging at Lancaster
CLiCA corpus tool to support the analysis of literary texts.ConcordancerWebFree
CollocateTool for the extraction of concordances and collocationsConcordancerWindows35 USD
ConcordancerOnline tool for frequency counts and text cloudsConcordancerWebFree
CorpKitAn advanced modern corpus toolkit with an emphasis on visualization and annotated corpora.Wordlists, Parsing, Concordancer, VisualizationLinux, MacOSX, Windows (Python)Free
Corpus-ToolsText annotation and analysis toolText analysisFree
Corpus PresenterTree tagger and corpus analysis softwareWordlists, Parsing, Concordancer, VisualizationWindowsFree
CorpusSearchLiteSearches parsed corpora in the Penn Treebank format???
CPQWebOverview of and access to a wide range of corporaDatabaseWebFree (once registered)
DexterTool for text annotationAnnotationLinux, MacOSX, WindowsFree
DISCOCorpus pre-processing tool for a variety of languages that Dallows to retrieve the semantic similarity between arbitrary words and phrasesTokenization, AnnotationWindows, Linux, Solaris, and MacOSFree
ELANTranscription and annotation of sound or video filesTranscription, AnnotationLinux, MacOSX, WindowsFree
EncodeAntTool for the detection and conversion of character encodingsConverterWindows, MacOSXFree
EXMARaLDATool for transcription, annotation, corpus analysis of spoken dataTranscription, Annotation, AnalysisFree
FireAntSocial media analysis toolkitDownloader, converterWindows, MacOSXFree
FrameNetDictionary of more than 10,000 word senses, tagged for semantic roles (according to Fillmorean Frame Semantics)Semantic ParserWebFree
Google NgramsAn ngram-viewer for the whole of Google BooksngramsWebFree
GraphCollTool for building and exploring networks of linguistic collocationsVisualizationWindows, MacOSXFree
GsearchTool for syntactic pattern matching??Down
HeidelGram Web-Based ToolsBasic corpus analysis toolkit for the HeidelGram CorpusWordlists, ConcordancerWebFree
HGSimpleCorpusNetworkBatch frequency analysis on corrupted (e.g. OCR) corpus data and generation of network analysis data.Wordlists, Network AnalysisMulti (Python)Free, Open Source
HTST SamuelsHistorical Thesaurus Semantic Tagger via web-interfaceSemantic TaggerWebFree
ICARUSSearch and visualization tool for dependency treesVisualizationFree
IMS Corpus WorkbenchTool for sorting frequencies in corporaWordlists, ConcordancerWeb and local versionFree
jTokenizerTokenizing natural languageTokenizerFree
JusTextTool for removing boilerplate content, such as navigation links, headers, and footers from HTML pagesBoilerplate removerPythonFree
LancsBoxSoftware package for the analysis of language data and corporaWordlists, concordancer, statistical analysis, visualizationFree
LinguisticaWord segmentation and morphological analysis?Segmentation, Morphological taggerLinux, MacOSX, WindowsFree
MALLETPackage for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to textStatistical NLPWindowsFree
VU Amsterdam Metaphor Identification CorpusCorpus tool for metaphor identificationMetaphor identifierWeb and local versionFree
MLCTTool for building and processing corporaConcordancer, Sentence Boundary DetectorFree
MonoConc EsyConcordancing and text search tool that allows primary and secondary concordancingConcordancer, Sentence Boundary DetectorFree for non-commerical research
MorphAdornerTool for performing morphological tagging of textsMorphological TaggerFree
Natural Language ToolkitPlatform for building Python programs to work with human language dataTokenizer, TaggerUnix, MacOSX, Windows (+Python 3.4)Free
NooJTags texts and corpora (i.e. sets of text files) at the Orthographical, Lexical, Morphological, Syntactic and Semantic levelsMultilevel TaggerWindows, Mac OS X, LINUX and BSD UnixFree
NoSketch EngineWord sketches, thesaurus, keyword computation, corpus creationCorpus creation, semantic analysis, word listsFree
OnionTool for removing duplicate parts from large collections of textsDuplicate removerFree
OpenConcTool for concordancingConcordancerFree
PALinkAAnnotation toolAnnotationDown
ParaConcA bilingual/multilingual concordancerConcordancerNon-Free
PareidoscopePareidoscope is a collection of tools for determining the association between arbitrary linguistic structures, such as collocations, collostructions or between structures.Collocation, ConstructionsFree
PepperConversion between linguistic formats, e.g. from TEI to ANNIS to Tiger XML to EXMARaLDA.ConversionFree
PhraseContextTool for wordlists, concordancing, collocation, TTR, Wordlists, Concordancer35€
ProtAntTool for prototypical text analysisWordlistsWindows, MacOSXFree
pysupersensetaggerAnalyses texts for MWE and supersenses.Text analysisUnix, Mac OS X (Python)Free
RSTToolTool that can annotate texts for constituency and rhetorical structureAnnotationWindows, Macintosh, UNIX and LINUX Free
SaltMeta models for linguistic data.Meta modellingFree
SarAntTool for batch search and replacingEditing, searchingWindowsFree
SegmentAntTool for the segmentation of Japanese and ChineseSegmentation, TokenizingWindows, MacOSX, LinuxFree
Simple Concordance ProgramTool for concordance and word listing that works with many languagesConcordancerWindows, MacOSXFree
SketchEngineWord sketches, thesaurus, keyword computation, corpus creationCorpus creation, semantic analysis, word lists30 day trial or 4,85€/month
SpiderLingSoftware for obtaining text from the web useful for building text corporaCrawlerFree
SPreTool for segmenting and annotating textsAnnotationFree
Stylo for RTool for computational stylistic analysis (authorship attribution, genre analysis)Text analysisFree
Stanford Log-linear POS TaggerPOS Tagger (with Penn Treebank Tagset) for English, Arabic, Chinese, GermanPOS TaggerFree
SynpathyTool for manual syntactic annotationAnnotationWindows, MacOSX, LinuxFree
TagAntPart-of-speech tagging tool built on Tree TaggerPOS TaggerWindows, MacOSX, LinuxFree
TASX-AnnotatorTool for multilevel annotation and transcription of (multi-channel) video and audio data.Multilevel Tagger, TranscriptionWindows, MacOSX, Linux, SolarisDown
TextSTATTool for creation and manipulation of linguistic data from different languagesCorpus creation, concordancerWindows, GNU/Linux und MacOSFree
TigerSearchTool for searching syntactically and POS-tagged corporaSearch toolFree
Tree Editor TrEd 2.0Graphical editor and viewer for tree-like structures.VisualizationWindows, GNU/Linux und MacOSFree
TreeTaggerTool for annotating text with part-of-speech and lemma informationPOS Tagger, AnnotationWindows, MacOSX, LinuxFree
TurboParserMultilingual dependency parser with linear programmingParserFree
Tweet NLPTweet tokenizer, POS Tagger, hierarchical word clusters, and a dependency parser for tweets, along with annotated corpora and web-based annotation tools. Clusters: POS Tagger, Tokenizer, Parser Free
UAM CorpusToolText annotation tool and statistics for various types of linguistic analysisAnnotationFree
UAM ImageToolImage annotation tool for visual data corporaAnnotationFree
UnitokTool that splits texts into tokensTokenizerFree
VARDSpelling variant detection and deletion in historical corpora (particularly EModE)Variant detectorFree (with academic email)
VariAntTool for the detection of spelling variantsVariant detectorWindowsFree
WConcord 3.0A full featured concordancerConcordancerFree
WmatrixTool for corpus analysis and comparisonWordlists, Concordancer, POS Tagger, Semantic TaggerWeb£50 per username per year
WordsmithOne of the most established corpus toolkitsConcordancer, Wordlists, StatisticsWindows60€ per licence
WordstatixCorpus analysis toolConcordancerFree
Worldbuilder(should soon be available)Tool for annotation and visualisation in analysis applying text-world-theoryAnnotation, Visualization??
XairaIndexing and analysis of XML resources,IndexingWindowsFree, Open Source
Phonological CorpusTools (PCT)Phonological analysis on transcribed corporaPhonologyMulti (Python)Free
BootCatTool for crawling and compiling data from the web with a list of seed words.Crawler, Compilation
gensimDeep learning via word2vecword2vecMulti (Python)Free, Open Source
PyXMLConcConcordancer for XML files with automatic tag and attribute detection.ConcordancerMulti (Python), WindowsFree, Open Source
TextanzLanguage analysis program that produces frequency lists, word lists, parts of speech tags.Wordlists, Concordancer, POS Tagger, DictionaryAny OSFree, Open Source