Tools for Corpus Linguistics

A comprehensive list of 188 tools used in corpus analysis.

Please feel free to contribute by suggesting new tools or by pointing out mistakes in the data.

Suggest a Tool

Tags

Everything
annotation
concordancer
parser
pos tagger
search
visualization
wordlists
compilation
text analysis
converter
n-grams
p-frames
lexical bundles
lexical frames
text complexity
collocation
statistics
segmentation
coding
concordaner
ddl
pedagogy
analysis
crawler
parallel
tagging
colligation
parsing
collocations
exploration
searching
database
dialogues
cleaning
annotations
tokenization
transcription
downloader
readability
semantic parser
word2vec
ngrams
pattern matching
temporal tagger
timex3
network analysis
semantic tagger
ICE
tokenizer
boilerplate remover
patterns
comparison
keywords
sociolinguistics
frequency analysis
lexis
lemmaizer
news
data
machine learning
morphological tagger
statistical nlp
MDA
sentence boundary ...
tagger
multilevel tagger
corpus creation
semantic analysis
duplicate remover
editing
vocabulary
constructions
regex
conversion
phonology
speech
prosody
spoken
phonetics
query
thesaurus
meta modelling
tokenizing
kwic
r
topic modeling
cohesion
lexical sophistication
word clouds
variation analysis
dictionary
text-processing
python
phraseology
xml
frequency
SPAADIA
efl
esl
linguistics
search tool
multi-layer
variant detector
reading
metaphor identific ...
metaphors
ebooks
political science
indexing
chinese
graphs
rhetorical analysis
textual criticism
witnesses
close reading
stylometry
management
twitter
web-based
coherence
lexical analysis
style
video
discourse
images
multilevel
qda
mixed methods
markup
anc
sampling
matching

Tools

Tool Description Categories Platform Pricing
@nnotateSemi-automatic annotation of corpus dataannotationSolaris, LinuxFree (with licence agreement)
aConCordeMultilingual concordance tool (English and Arabic)concordancerLinux, Mac, WindowsFree
almaneser / SALTASemantic Parser/POS Tagger for Englishparser, pos taggerFree (with licence agreement)
AMALGAMTool for grammatical annotation (POS and phrase structure). Tagging a text that was entered via email.annotationWebFree
ANNISSearch and visualization tool for multi-layer linguistic corpora with diverse types of annotationsearch, visualizationWeb (or Linux, Mac, Windows)Free
AntCLAWSGUIFront-end interface for CLAWS taggerpos taggerWindowsFree
AntConcCorpus analysis toolkitwordlists, concordancerLinux, Mac, WindowsFree
AntCorGenA freeware discipline-specific corpus creation tool.compilation, text analysisWindows, Mac, LinuxFree
AntFileConverterFreeware tool to convert PDF and Word (DOCX) files into plain textconverterWindows, MacFree
AntFileSplitterA freeware text file splitting tool.compilationWindows, Mac, LinuxFree
AntGramA freeware n-gram and p-frame (open-slot n-gram) generation tool.text analysis, n-grams, p-frames, lexical bundles, lexical framesWindows, Mac, LinuxFree
AntMoverTool for text structure (moves) analysistext analysisWindowsFree
AntPConcCorpus analysis toolkit for files encoded with UTF-8wordlists, concordancerWindows, MacFree
AntWordProfilerTool for profiling vocabulary level and text complexitytext complexityLinux, Mac, WindowsFree
AtomicMulti-layer corpus annotation platform.annotationLinux, Mac, WindowsFree
BFSU CollocatorA collocation analysis toolkitcollocation, statisticsWindowsFree
BFSU English Sentence SegmenterA simple sentence segmentersegmentationWindowsFree
BFSU Qualitative CoderA tool for manual coding of corporacoding, annotationWindowsFree
BFSU Sentence CollectorA pedagogic concordancerconcordaner, ddl, pedagogyWindowsFree
BFSU Stanford ParserA simple parserparserWindowsFree
BNCWebBNCweb is a web-based client program for searching and retrieving lexical, grammatical and textual data from the British National Corpus (BNC).analysis, concordancerWebFree
BootCatTool for crawling and compiling data from the web with a list of seed words.crawler, compilation
BowStatistical Language Modeling, Text Retrieval, Classification and Clusteringtext analysisUNIX, LinuxFree
BSFU ParaConcA parallel concordancerconcordancer, parallelWindowsFree
BSFU PowerConcA fairly powerful concordancerconcordancerWindowsFree
BSFU Stanford POS TaggerA PoS taggerpos tagger, taggingWindowsFree
CasualConcCasualConc is a concordance program that runs natively on Mac 10.9 or lateconcordancerOSXFree
CharedTool for detecting the character encoding of a texttext analysisPython 2.6 or laterFree
Chi-Square and Log Likelihood CalculatorA simple tool for calculating Chi-squared and LLstatisticsWindowsFree
CLaRKXML Based System For Corpora DevelopmentcompilationFree (with licence agreement)
CLAWS POS-TaggerCLAWS- POS Tagger pos taggerWebVia licence or in-house tagging at Lancaster
CLiCA corpus tool to support the analysis of literary texts.concordancerWebFree
Colligator 2.0A colligation query/analysis toolkitcolligationWindowsFree
CollocateTool for the extraction of concordances and collocationsconcordancerWindows35 USD
Concordance RandomizerA concordance randomizerconcordancerWindowsFree
ConcordancerOnline tool for frequency counts and text cloudsconcordancerWebFree
CorpKitAn advanced modern corpus toolkit with an emphasis on visualization and annotated corpora.wordlists, parsing, concordancer, visualizationLinux, Mac, Windows (Python)Free
CorporaCoCoA set of R functions used to compare co-occurrence between corporacollocationsRFree
Corpus PresenterTree tagger and corpus analysis softwarewordlists, parsing, concordancer, visualizationWindowsFree
Corpus-ToolsText annotation and analysis tooltext analysisFree
CorpusExplorerA complex corpus analysis toolkit combining 45 interactive tools.visualization, exploration, tagging, text analysisWindowsFree, Open Source
CorpusSearchLiteSearches parsed corpora in the Penn Treebank formatsearching
CPQWebOverview of and access to a wide range of corporadatabaseWebFree (once registered)
DARTAn annotation tool and research environment for annotating dialogues.dialogues, annotationWindowsFree
DeTagging ToolA tool that strips annotation/tags from filescleaning, annotationsWindowsFree
DexterTool for text annotationannotationLinux, Mac, WindowsFree
DISCOCorpus pre-processing tool for a variety of languages that Dallows to retrieve the semantic similarity between arbitrary words and phrasestokenization, annotationWindows, Linux, Solaris, and MacOSFree
ELANTranscription and annotation of sound or video filestranscription, annotationLinux, Mac, WindowsFree
EncodeAntTool for the detection and conversion of character encodingsconverterWindows, MacFree
EXMARaLDATool for transcription, annotation, corpus analysis of spoken datatranscription, annotation, analysisFree
FireAntSocial media analysis toolkitdownloader, converterWindows, MacFree
Flesh PCCalculating Flesh-scoresreadability, statisticsWindowsFree
FrameNetDictionary of more than 10,000 word senses, tagged for semantic roles (according to Fillmorean Frame Semantics)semantic parserWebFree
gensimDeep learning via word2vecword2vecMulti (Python)Free, Open Source
Google NgramsAn ngram-viewer for the whole of Google BooksngramsWebFree
GraphCollTool for building and exploring networks of linguistic collocationsvisualizationWindows, MacFree
GsearchTool for syntactic pattern matchingpattern matching?Down
HeidelGram Web-Based ToolsBasic corpus analysis toolkit for the HeidelGram Corpuswordlists, concordancerWebFree
HeidelTimeA multilingual, domain-sensitive temporal taggertemporal tagger, timex3JavaFree, Open Source
HGSimpleCorpusNetworkBatch frequency analysis on corrupted (e.g. OCR) corpus data and generation of network analysis data.wordlists, network analysisMulti (Python)Free, Open Source
HTST SamuelsHistorical Thesaurus Semantic Tagger via web-interfacesemantic taggerWebFree
ICARUSSearch and visualization tool for dependency treesvisualizationFree
ICEwebA tool for compiling, downloading, and analyzing web corpora in accordance with the ICEICE, compilation, crawlerWindowsFree
IMS Corpus WorkbenchTool for sorting frequencies in corporawordlists, concordancerWeb and local versionFree
jTokenizerTokenizing natural languagetokenizerFree
JusTextTool for removing boilerplate content, such as navigation links, headers, and footers from HTML pagesboilerplate removerPythonFree
KaleidographicA dynamic and interactive visualization tool for multivariate data.visualizationWebFree
KAT ToolGrouping patterns based on search termspatterns, concordancerWindowsFree
kdiff3KDiff3 is a diff and merge program.comparisonWindows, Linux, OSXFree, Open Source
Keyword PlusA keyword generation/analysis toolkeywordsWindowsFree
KhepriA view-based toolfor exploring (historical sociolinguistic) datasociolinguistics, visualizationJavaScript, WebFree, Open Source
KoGra-RAn R-based online tool that provides statistical measures for corpus-based frequenciesstatistics, frequency analysisWebFree
LancsBoxThe Lancaster Desktop Corpus Toolbox; Software package for the analysis of language data and corporacollocations, frequency analysisJavaFree (CC)
LEXAA complex lemmatizer.lexis, lemmaizerFree
LexisNexisA database containing (new and old) news articles. They also have other (business) data.news, dataWebCommercial
LightSideA machine learning workbench.machine learningLinux, WindowsFree, Open Source
LinguisticaWord segmentation and morphological analysis?segmentation, morphological taggerLinux, Mac, WindowsFree
MALLETPackage for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to textstatistical nlpWindowsFree
MAT - Multidemensional Analysis TaggerA tagger for MDA (Biber et al.)tagging, MDAWindows, MacFree
MLCTTool for building and processing corporaconcordancer, sentence boundary detectorFree
MonoConc EsyConcordancing and text search tool that allows primary and secondary concordancingconcordancer, sentence boundary detectorFree for non-commerical research
MorphAdornerTool for performing morphological tagging of textsmorphological taggerFree
Natural Language ToolkitPlatform for building Python programs to work with human language datatokenizer, taggerUnix, Mac, Windows (+Python 3.4)Free
NooJTags texts and corpora (i.e. sets of text files) at the Orthographical, Lexical, Morphological, Syntactic and Semantic levelsmultilevel taggerWindows, Mac, LINUX and BSD UnixFree
NoSketch EngineWord sketches, thesaurus, keyword computation, corpus creationcorpus creation, semantic analysis, wordlistsFree
OnionTool for removing duplicate parts from large collections of textsduplicate removerFree
Online Graded Text EditorTool for profiling a text's vocabulary level and complexitytext analysis, editing, vocabularyOSX, WindowsFree
OpenConcTool for concordancingconcordancerFree
PALinkAAnnotation toolannotationDown
ParaConcA bilingual/multilingual concordancerconcordancerNon-Free
PareidoscopePareidoscope is a collection of tools for determining the association between arbitrary linguistic structures, such as collocations, collostructions or between structures.collocation, constructionsFree
PatCountA pattern counting tool with powerful statistic capabilities and regex supportpatternsWindowsFree
Pattern BuilderA tool helping with regular expressions and PoS tagsregex, taggingWindowsFree
PepperConversion between linguistic formats, e.g. from TEI to ANNIS to Tiger XML to EXMARaLDA.conversionFree
Phonological CorpusTools (PCT)Phonological analysis on transcribed corporaphonologyMulti (Python)Free
PhraseContextTool for wordlists, concordancing, collocation, TTR, wordlists, concordancer35€
PraalinePraaline is a system for metadata management, annotation, visualisation and analysis of spoken language corpora.speech, prosody, spoken, annotation, concordancer, search, visualization, converter, analysisWindows, Mac, LinuxFree / Open Source (GPL3)
PRAATA tool for doing phonetics by computerphonetics, spokenWindows, Mac, LinuxOpen Source
ProtAntTool for prototypical text analysiswordlistsWindows, MacFree
pysupersensetaggerAnalyses texts for MWE and supersenses.text analysisUnix, Mac (Python)Free
PyXMLConcConcordancer for XML files with automatic tag and attribute detection.concordancerMulti (Python), WindowsFree, Open Source
Query Tool for the Edenburgh Associative ThesaurusA query tool for the EATquery, thesaurusWindowsFree
Readability AnalyzerA tool for generating various readability statisticsreadability, statisticsWindowsFree
RSTToolTool that can annotate texts for constituency and rhetorical structureannotationWindows, Macintosh, UNIX and LINUX Free
SaltMeta models for linguistic data.meta modellingFree
SarAntTool for batch search and replacingediting, searchingWindowsFree
SegmentAntTool for the segmentation of Japanese and Chinesesegmentation, tokenizingWindows, Mac, LinuxFree
ShinyconcShinyConc is a framework for generating custom web-based concordancers and is written in R and R Shiny.concordancer, kwic, rOpen Source / RFree
Simple Concordance ProgramTool for concordance and word listing that works with many languagesconcordancerWindows, MacFree
SketchEngineWord sketches, thesaurus, keyword computation, corpus creationcorpus creation, semantic analysis, wordlists30 day trial or 4,85€/month
SpiderLingSoftware for obtaining text from the web useful for building text corporacrawlerFree
SPreTool for segmenting and annotating textsannotationFree
Stanford Log-linear POS TaggerPOS Tagger (with Penn Treebank Tagset) for English, Arabic, Chinese, Germanpos taggerFree
Stanford Topic Modeling ToolboxThe Stanford Topic Modeling Toolbox (TMT) allows users to perform topic modeling on texts imported from spreadsheets. It supports both LDA and labelled LDA.topic modelingJavaFree
Stylo for RTool for computational stylistic analysis (authorship attribution, genre analysis)text analysisFree
Sub-Corpus CreatorA tool for creating sub-corpora based on search searchs and metadatacompilationWindowsFree
SynpathyTool for manual syntactic annotationannotationWindows, Mac, LinuxFree
TAACOTAACO is a tool that calculates 150 indices of textual/lexical cohesion.cohesion, lexical sophisticationAllFree, Open Source
TAALESTAALES measures over 400 indices of lexical sophistication.lexical sophisticationMac, Linux, WindowsOpen Source
TagAntPart-of-speech tagging tool built on Tree Taggerpos taggerWindows, Mac, LinuxFree
TagxedoA tool for generating word clouds.word clouds, visualizationWebFree
TASX-AnnotatorTool for multilevel annotation and transcription of (multi-channel) video and audio data.multilevel tagger, transcriptionWindows, Mac, Linux, SolarisDown
Text Analysis Computing Tools (TACT)A simple, fairly old concordancer.concordancerCommercial
Text Variation ExplorerThe Text Variation Explorer TVE is a tool for exploring the effect of window size on various common linguistic measures. It visualizes these measures and allows for PCA/Cluster analysis.visualization, variation analysisJavaFree
Text Visualization BrowserA survey/gallery of text visualizationsvisualizationWebFree
TextanzLanguage analysis program that produces frequency lists, word lists, parts of speech tags.wordlists, concordancer, pos tagger, dictionaryAny OSFree, Open Source
TextArcA tool for visualizing the structure of texts.visualization
TextDirectoryTextDirectory is a tool for aggregating text files based on various filters and transformation functions.compilation, text-processing, pythonWindows, Linux, OSXFree, Open Source
TextplotA tool for mapping a document into a network of terms in order to visualize the topic structure.visualization, network analysisPythonFree, Open Source
TextSmith ToolsA tool for genre-informed phraseological profilesphraseology, segmentationWindowsFree
TextSTATTool for creation and manipulation of linguistic data from different languagescorpus creation, concordancerWindows, GNU/Linux und MacOSFree
The (Phonetic) Transcription EditorAn editor for creating phonetic transcriptionstranscriptionWindowsFree
The Simple Corpus ToolA corpus analysis toolkit that supports XML annotations.concordancer, annotation, xml, frequencyWindowsFree
The Simple PoS TaggerA simply PoS-tagger utilizing Perl Lingua::EN:Taggerpos tagger, taggerWindowsFree
The SPAADIA concordancerA concordancer for the SPAADIA corpusconcordancer, SPAADIAWindowsFree
The Text Feature AnalyserA tool for investigating textual features and various meassurestext analysis, concordancerWindowsFree
Thesaurus.comEnglish language thesaurus with links to English dictionary and translation sites.efl, esl, linguisticsNot sure, I'm not a programmer or geek.Free
TigerSearchTool for searching syntactically and POS-tagged corporasearch toolFree
TnT - Thorsten Brants's PoS TaggerA simple PoS-Taggerpos tagger, taggerWindows/UnixAvailable via Stanford
Tree Editor TrEd 2.0Graphical editor and viewer for tree-like structures.visualizationWindows, GNU/Linux und MacOSFree
TreeTaggerTool for annotating text with part-of-speech and lemma informationpos tagger, annotationWindows, Mac, LinuxFree
TurboParserMultilingual dependency parser with linear programmingparserFree
Tweet NLPTweet tokenizer, POS Tagger, hierarchical word clusters, and a dependency parser for tweets, along with annotated corpora and web-based annotation tools. Clusters: http://www.cs.cmu.edu/~ark/TweetNLP/cluster_viewer.html pos tagger, tokenizer, parserFree
TXMXML & TEI compatible text analysis software based on TreeTagger, the CQP search engine and the R statistical environment.text analysis, concordancer, r, statistics, search tool, tokenizer, xmlWindows,Mac,Linux,TomcatFree
UAM CorpusToolText annotation tool and statistics for various types of linguistic analysis and multilayer annotationannotation, multi-layerFree
UAM ImageToolImage annotation tool for visual data corporaannotationFree
UnitokTool that splits texts into tokenstokenizerFree
VARDSpelling variant detection and deletion in historical corpora (particularly EModE)variant detectorFree (with academic email)
VariAntTool for the detection of spelling variantsvariant detectorWindowsFree
VoyantA web-based reading/analysis toolkit for digital texts.reading, text analysisWebFree
VU Amsterdam Metaphor Identification CorpusCorpus tool for metaphor identificationmetaphor identification, metaphorsWeb and local versionFree
WConcord 3.0A full featured concordancerconcordancerFree
WebLichtWebLicht is an execution environment for automatic annotation of text corpora embedded with the CLARIN-D project.annotationWebFree (CLARIN-D Account needed)
WmatrixTool for corpus analysis and comparisonwordlists, concordancer, pos tagger, semantic taggerWeb£50 per username per year
WordCruncherA tool for analyzing ebooks.concordancer, frequency, ebooksWindows, Mac, iOSFree
WordFishExtract political positions from text documents.political scienceRFree
WordscoresA tool (approach) to extract dimensional information from political textspolitical scienceFree
WordsmithOne of the most established corpus toolkits providing a variety of functionalityconcordancer, wordlists, statisticsWindows60€ per licence
WordstatixCorpus analysis toolconcordancerFree
WorldbuilderTool for annotation and visualisation in analysis applying text-world-theoryannotation, visualization
WordleA tool for generating word clouds.word clouds, visualizationWebFree
XairaIndexing and analysis of XML resources,indexingWindowsFree, Open Source
YACSI Chinese Tokeniser / PoS TaggerA Chinese tokenizer and PoS taggerchinese, tokenizer, pos taggerWindowsFree
GephiA toolkit for network analysisnetwork analysis, graphsWindows, Linux, MacFree
DocuScopeA tool for computer-aided rhetorical anyalysisrhetorical analysis, text analysis, visualizationWindows (Java)Free
juxtaComparing and collating multiple witnesses to single textual workstextual criticism, witnessesWindows, Unix, Linux, MacFree
WordHoardClose reading and scholarly analysis of deeply tagged textsclose readingWindows, Unix, Linux, MacFree
Intelligent ArchiveManaging corpora for stylometrystylometry, managementWindows, Unix, Linux, MacFree
TwarcA command line tool (and Python library) for archiving Twitter JSONtwitterPython, Windows, Linux, MacFree, Open Source
WebAnnoA web-based annotation toolannotation, web-basedWebFree
Coh-MetrixA web-based system to compute cohesion and coherence metrics.cohesion, coherenceWebFree
LIWCA tool that tries to compute scores for different emotions, thinkings styles, and social concerns.lexical analysis, styleWebFree (but commerical)
ANVILA tool for video annoation.video, annotationWindows, Linux, MacFree
LDA-ToolkitA toolkit for linguistic discourse and image analysis.discourse, imagesWindowsFree
FLAIR (2.0)An online tool for language teachers and learners that analyzes grammatical constructions and readability on the fly.constructions, readabilityWebFree
DisMoAn automatic multi-level annotator for spoken language corpora.spoken, multilevel, multi-layer, pos tagger, annotation
TagCrowdA simple tool for generating tag/word clouds onlineword clouds, visualizationWebFree
MMAX2A multi-level annotation toolannotation, multilevel, multi-layerJavaFree, Open Source
KorAPA complex platform for corpus analysis developed at the IDS in Mannheimanalysis, multilevel, multi-layerWebFree, Open Source
kfNgramA simple tool for generating n-gramsn-gramsWindowsFree
MAXQDASophisticated QDA software that works with multimodal data and supports mixed methods approachesqda, mixed methodsWindows, Mac, Android, iOSCommerical
ATLAS.tiA sophistaticated QDA software for mixed methods approachesqda, mixed methodsWindows, Mac, Android, iOSCommerical
Pipoca (formerly openQDA)A web-based QDA softwareqda, mixed methodsWebFree, Open Source
f4analyseQDA software specifically geared towards interview (spoken) dataqda, spokenWindows, Mac, LinuxCommerical
f4transkriptSoftware for transcribing audio datatranscription, spokenWindows, Max, LinuxCommercial
CATMA (Computer Assisted Text Markup and Analysis)A complex annotation and analysis packagemarkup, analysis, visualizationWebCommerical
ANC2goA web service that allows users to create custom sub-corpora of the ANCanc, samplingWebFree
CoMOnA tooil for corpus matching analysismatchingWebFree