Tools for Corpus Linguistics

A comprehensive list of 111 tools used in corpus analysis.

Please feel free to contribute by suggesting new tools or by pointing out mistakes in the data.

Suggest a Tool

POS Tagger
Text analysis
Assessing text complexity
Semantic Parser
pattern matching
Network Analysis
Semantic Tagger
Boilerplate remover
Frequency Analysis
Morphological tagger
Statistical NLP
Sentence Boundary Detector
Morphological Tagger
Multilevel Tagger
Corpus creation
semantic analysis
word lists
Duplicate remover
text analysis
Meta modelling
Topic Modeling
lexical sophistication
variation analysis
Search tool
Variant detector
Metaphor identifier
political science
temporal tagger

Tool Description Categories Platform Pricing
@nnotateSemi-automatic annotation of corpus dataAnnotationSolaris, LinuxFree (with licence agreement)
aConCordeMultilingual concordance tool (English and Arabic)ConcordancerLinux, MacOSX, WindowsFree
almaneser / SALTASemantic Parser/POS Tagger for EnglishParser, POS TaggerFree (with licence agreement)
AMALGAMTool for grammatical annotation (POS and phrase structure). Tagging a text that was entered via email.AnnotationWebFree
ANNISSearch and visualization tool for multi-layer linguistic corpora with diverse types of annotationSearch, VisualizationWeb (or Linux, Mac, Windows)Free
AntCLAWSGUIFront-end interface for CLAWS taggerPOS TaggerWindowsFree
AntConcCorpus analysis toolkitWordlists, ConcordancerLinux, MacOSX, WindowsFree
AntFileConverterFreeware tool to convert PDF and Word (DOCX) files into plain textConverterWindows, MacOSXFree
AntMoverTool for text structure (moves) analysisText analysisWindowsFree
AntPConcCorpus analysis toolkit for files encoded with UTF-8Wordlists, ConcordancerWindows, MacOSXFree
AntWordProfilerTool for profiling vocabulary level and text complexityAssessing text complexityLinux, MacOSX, WindowsFree
AtomicMulti-layer corpus annotation platform.AnnotationLinux, MacOSX, WindowsFree
BNCWebBNCweb is a web-based client program for searching and retrieving lexical, grammatical and textual data from the British National Corpus (BNC).Analysis, ConcordancerWebFree
BootCatTool for crawling and compiling data from the web with a list of seed words.Crawler, Compilation
BowStatistical Language Modeling, Text Retrieval, Classification and ClusteringText analysisUNIX, LinuxFree
CharedTool for detecting the character encoding of a textText analysisPython 2.6 or laterFree
CLaRKXML Based System For Corpora DevelopmentCompilationFree (with licence agreement)
CLAWS POS-TaggerCLAWS- POS Tagger POS TaggerWebVia licence or in-house tagging at Lancaster
CLiCA corpus tool to support the analysis of literary texts.ConcordancerWebFree
CollocateTool for the extraction of concordances and collocationsConcordancerWindows35 USD
ConcordancerOnline tool for frequency counts and text cloudsConcordancerWebFree
CorpKitAn advanced modern corpus toolkit with an emphasis on visualization and annotated corpora.Wordlists, Parsing, Concordancer, VisualizationLinux, MacOSX, Windows (Python)Free
CorporaCoCoA set of R functions used to compare co-occurrence between corporaCollocationsRFree
Corpus PresenterTree tagger and corpus analysis softwareWordlists, Parsing, Concordancer, VisualizationWindowsFree
Corpus-ToolsText annotation and analysis toolText analysisFree
CorpusSearchLiteSearches parsed corpora in the Penn Treebank formatSearching??
CPQWebOverview of and access to a wide range of corporaDatabaseWebFree (once registered)
DexterTool for text annotationAnnotationLinux, MacOSX, WindowsFree
DISCOCorpus pre-processing tool for a variety of languages that Dallows to retrieve the semantic similarity between arbitrary words and phrasesTokenization, AnnotationWindows, Linux, Solaris, and MacOSFree
ELANTranscription and annotation of sound or video filesTranscription, AnnotationLinux, MacOSX, WindowsFree
EncodeAntTool for the detection and conversion of character encodingsConverterWindows, MacOSXFree
EXMARaLDATool for transcription, annotation, corpus analysis of spoken dataTranscription, Annotation, AnalysisFree
FireAntSocial media analysis toolkitDownloader, converterWindows, MacOSXFree
FrameNetDictionary of more than 10,000 word senses, tagged for semantic roles (according to Fillmorean Frame Semantics)Semantic ParserWebFree
gensimDeep learning via word2vecword2vecMulti (Python)Free, Open Source
Google NgramsAn ngram-viewer for the whole of Google BooksngramsWebFree
GraphCollTool for building and exploring networks of linguistic collocationsVisualizationWindows, MacOSXFree
GsearchTool for syntactic pattern matchingpattern matching?Down
HeidelGram Web-Based ToolsBasic corpus analysis toolkit for the HeidelGram CorpusWordlists, ConcordancerWebFree
HGSimpleCorpusNetworkBatch frequency analysis on corrupted (e.g. OCR) corpus data and generation of network analysis data.Wordlists, Network AnalysisMulti (Python)Free, Open Source
HTST SamuelsHistorical Thesaurus Semantic Tagger via web-interfaceSemantic TaggerWebFree
ICARUSSearch and visualization tool for dependency treesVisualizationFree
IMS Corpus WorkbenchTool for sorting frequencies in corporaWordlists, ConcordancerWeb and local versionFree
jTokenizerTokenizing natural languageTokenizerFree
JusTextTool for removing boilerplate content, such as navigation links, headers, and footers from HTML pagesBoilerplate removerPythonFree
KoGra-RAn R-based online tool that provides statistical measures for corpus-based frequenciesStatistics, Frequency AnalysisWebFree
LancsBoxThe Lancaster Desktop Corpus Toolbox; Software package for the analysis of language data and corporaCollocations, Frequency AnalysisJavaFree (CC)
LinguisticaWord segmentation and morphological analysis?Segmentation, Morphological taggerLinux, MacOSX, WindowsFree
MALLETPackage for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to textStatistical NLPWindowsFree
MLCTTool for building and processing corporaConcordancer, Sentence Boundary DetectorFree
MonoConc EsyConcordancing and text search tool that allows primary and secondary concordancingConcordancer, Sentence Boundary DetectorFree for non-commerical research
MorphAdornerTool for performing morphological tagging of textsMorphological TaggerFree
Natural Language ToolkitPlatform for building Python programs to work with human language dataTokenizer, TaggerUnix, MacOSX, Windows (+Python 3.4)Free
NooJTags texts and corpora (i.e. sets of text files) at the Orthographical, Lexical, Morphological, Syntactic and Semantic levelsMultilevel TaggerWindows, Mac OS X, LINUX and BSD UnixFree
NoSketch EngineWord sketches, thesaurus, keyword computation, corpus creationCorpus creation, semantic analysis, word listsFree
OnionTool for removing duplicate parts from large collections of textsDuplicate removerFree
Online Graded Text EditorTool for profiling a text's vocabulary level and complexitytext analysis, editing, vocabularyOSX, WindowsFree
OpenConcTool for concordancingConcordancerFree
PALinkAAnnotation toolAnnotationDown
ParaConcA bilingual/multilingual concordancerConcordancerNon-Free
PareidoscopePareidoscope is a collection of tools for determining the association between arbitrary linguistic structures, such as collocations, collostructions or between structures.Collocation, ConstructionsFree
PepperConversion between linguistic formats, e.g. from TEI to ANNIS to Tiger XML to EXMARaLDA.ConversionFree
Phonological CorpusTools (PCT)Phonological analysis on transcribed corporaPhonologyMulti (Python)Free
PhraseContextTool for wordlists, concordancing, collocation, TTR, Wordlists, Concordancer35€
PRAATA tool for doing phonetics by computerphonetics, spokenWindows, Mac, LinuxOpen Source
ProtAntTool for prototypical text analysisWordlistsWindows, MacOSXFree
pysupersensetaggerAnalyses texts for MWE and supersenses.Text analysisUnix, Mac OS X (Python)Free
PyXMLConcConcordancer for XML files with automatic tag and attribute detection.ConcordancerMulti (Python), WindowsFree, Open Source
RSTToolTool that can annotate texts for constituency and rhetorical structureAnnotationWindows, Macintosh, UNIX and LINUX Free
SaltMeta models for linguistic data.Meta modellingFree
SarAntTool for batch search and replacingEditing, searchingWindowsFree
SegmentAntTool for the segmentation of Japanese and ChineseSegmentation, TokenizingWindows, MacOSX, LinuxFree
ShinyconcShinyConc is a framework for generating custom web-based concordancers and is written in R and R Shiny.concordancer, kwic, ROpen Source / RFree
Simple Concordance ProgramTool for concordance and word listing that works with many languagesConcordancerWindows, MacOSXFree
SketchEngineWord sketches, thesaurus, keyword computation, corpus creationCorpus creation, semantic analysis, word lists30 day trial or 4,85€/month
SpiderLingSoftware for obtaining text from the web useful for building text corporaCrawlerFree
SPreTool for segmenting and annotating textsAnnotationFree
Stanford Log-linear POS TaggerPOS Tagger (with Penn Treebank Tagset) for English, Arabic, Chinese, GermanPOS TaggerFree
Stanford Topic Modeling ToolboxThe Stanford Topic Modeling Toolbox (TMT) allows users to perform topic modeling on texts imported from spreadsheets. It supports both LDA and labelled LDA.Topic ModelingJavaFree
Stylo for RTool for computational stylistic analysis (authorship attribution, genre analysis)Text analysisFree
SynpathyTool for manual syntactic annotationAnnotationWindows, MacOSX, LinuxFree
TAALESTAALES measures over 400 indices of lexical sophistication.lexical sophisticationMac, Linux, WindowsOpen Source
TagAntPart-of-speech tagging tool built on Tree TaggerPOS TaggerWindows, MacOSX, LinuxFree
TASX-AnnotatorTool for multilevel annotation and transcription of (multi-channel) video and audio data.Multilevel Tagger, TranscriptionWindows, MacOSX, Linux, SolarisDown
Text Variation ExplorerThe Text Variation Explorer TVE is a tool for exploring the effect of window size on various common linguistic measures. It visualizes these measures and allows for PCA/Cluster analysis.visualization, variation analysisJavaFree
TextanzLanguage analysis program that produces frequency lists, word lists, parts of speech tags.Wordlists, Concordancer, POS Tagger, DictionaryAny OSFree, Open Source
TextSTATTool for creation and manipulation of linguistic data from different languagesCorpus creation, concordancerWindows, GNU/Linux und MacOSFree
Thesaurus.comEnglish language thesaurus with links to English dictionary and translation sites.EFL, ESL, LinguisticsNot sure, I'm not a programmer or geek.Free
TigerSearchTool for searching syntactically and POS-tagged corporaSearch toolFree
Tree Editor TrEd 2.0Graphical editor and viewer for tree-like structures.VisualizationWindows, GNU/Linux und MacOSFree
TreeTaggerTool for annotating text with part-of-speech and lemma informationPOS Tagger, AnnotationWindows, MacOSX, LinuxFree
TurboParserMultilingual dependency parser with linear programmingParserFree
Tweet NLPTweet tokenizer, POS Tagger, hierarchical word clusters, and a dependency parser for tweets, along with annotated corpora and web-based annotation tools. Clusters: POS Tagger, Tokenizer, Parser Free
UAM CorpusToolText annotation tool and statistics for various types of linguistic analysisAnnotationFree
UAM ImageToolImage annotation tool for visual data corporaAnnotationFree
UnitokTool that splits texts into tokensTokenizerFree
VARDSpelling variant detection and deletion in historical corpora (particularly EModE)Variant detectorFree (with academic email)
VariAntTool for the detection of spelling variantsVariant detectorWindowsFree
VU Amsterdam Metaphor Identification CorpusCorpus tool for metaphor identificationMetaphor identifierWeb and local versionFree
WConcord 3.0A full featured concordancerConcordancerFree
WebLichtWebLicht is an execution environment for automatic annotation of text corpora embedded with the CLARIN-D project.annotationWebFree (CLARIN-D Account needed)
WmatrixTool for corpus analysis and comparisonWordlists, Concordancer, POS Tagger, Semantic TaggerWeb£50 per username per year
WordFishExtract political positions from text documents.political scienceRFree
WordscoresA tool (approach) to extract dimensional information from political textspolitical scienceFree
WordsmithOne of the most established corpus toolkitsConcordancer, Wordlists, StatisticsWindows60€ per licence
WordstatixCorpus analysis toolConcordancerFree
Worldbuilder(should soon be available)Tool for annotation and visualisation in analysis applying text-world-theoryAnnotation, Visualization??
XairaIndexing and analysis of XML resources,IndexingWindowsFree, Open Source
kdiff3KDiff3 is a diff and merge program.ComparisonWindows, Linux, OSXFree, Open Source
TAACOTAACO is a tool that calculates 150 indices of textual/lexical cohesion.cohesion, lexical sophisticationAllFree, Open Source
HeidelTimeA multilingual, domain-sensitive temporal taggertemporal tagger, timex3JavaFree, Open Source