A hopefully comprehensive list of currently 283 tools used in corpus compilation and analysis.
This list is kept up to date by its users. Hence, please feel free to contribute by suggesting new tools.
You can also make suggestions, e.g., corrections, regarding individual tools by clicking the ✎ symbol. As this is a non-commercial side (side, side) project, checking and incorporating updates usually takes some time.
There is also a comprehensive list of all tags in the database.
Tool | Description | Tags | Platforms | Pricing |
---|---|---|---|---|
@nnotate ✎ | Semi-automatic annotation of corpus data | annotation | Solaris, Linux | Free (with licence agreement) |
aConCorde ✎ | Multilingual concordance tool (English and Arabic) | concordancer | Linux, Mac, Windows | Free |
ACTRES Corpus Browser ✎ | A tool for retrieving tagged information in more than one language. | tagging | Web | Commercial |
ACTRES Corpus Manager ✎ | A corpus compilation and analysis platform with a focus on multilingual and parallel corpora. | compilation, corpus management, annotation, multilingual | Web | Commercial |
ACTRES Rhetorical Movel Tagger ✎ | A tool for tagging rhetorical moves. | tagging, rhetorics | Web | Commercial |
almaneser / SALTA ✎ | Semantic Parser and PoS Tagger for English | parser, pos tagger, tagging | Free (with licence agreement) | |
AMALGAM ✎ | Tool for grammatical annotation (PoS and phrase structure). Tagging a text that was entered via email. | annotation | Web | Free |
AMesure ✎ | A web-based system to analyse the reading complexity of French texts | text complexity, readability | Web | Free |
ANC2go ✎ | A web service that allows users to create custom sub-corpora of the ANC | ANC, sampling | Web | Free |
ANNIS ✎ | Search and visualization tool for multi-layer linguistic corpora with diverse types of annotation | search, visualization | Web (or Linux, Mac, Windows) | Free |
AntCLAWSGUI ✎ | Front-end interface for CLAWS tagger | pos tagger, tagging | Windows | Free |
AntConc ✎ | Corpus analysis toolkit | wordlists, concordancer, keywords | Linux, Mac, Windows | Free |
AntCorGen ✎ | A freeware discipline-specific corpus creation tool. | compilation, text analysis | Windows, Mac, Linux | Free |
AntFileConverter ✎ | Freeware tool to convert PDF and Word (DOCX) files into plain text | converter | Windows, Mac | Free |
AntFileSplitter ✎ | A freeware text file splitting tool. | compilation | Windows, Mac, Linux | Free |
AntGram ✎ | A freeware n-gram and p-frame (open-slot n-gram) generation tool. | text analysis, n-grams, p-frames, lexical bundles, lexical frames | Windows, Mac, Linux | Free |
ANTLR ✎ | ANother Tool for Language Recognition is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. | parser generator | Linux, Mac, Windows | Free, Open Source |
AntMover ✎ | Tool for text structure (moves) analysis | text analysis | Windows | Free |
AntPConc ✎ | Corpus analysis toolkit designed for working with parallel corpora. | wordlists, concordancer | Windows, Mac | Free |
AntWordProfiler ✎ | Tool for profiling vocabulary level and text complexity | text complexity | Linux, Mac, Windows | Free |
ANVIL ✎ | A tool for video annoation. | video, annotation | Windows, Linux, Mac | Free |
ATLAS.ti ✎ | A sophistaticated QDA software for mixed methods approaches | qda, mixed methods | Windows, Mac, Android, iOS | Commercial |
Atomic ✎ | Multi-layer corpus annotation platform. | annotation | Linux, Mac, Windows | Free |
Authorial Voice Analyzer (AVA) ✎ | A tool for the analysis of interactional metadiscourse features. | discourse, voice | Mac | Free |
BFSU Collocator ✎ | A collocation analysis toolkit | collocation, statistics | Windows | Free |
BFSU ConcGram Lite ✎ | A tool for retrieving bigrams with directional variations. | bigrams, concgrams | Windows | Free |
BFSU English Sentence Segmenter ✎ | A simple sentence segmenter | segmentation | Windows | Free |
BFSU ParaConc ✎ | A parallel concordancer | concordancer, parallel | Windows | Free |
BFSU PowerConc ✎ | A fairly powerful concordancer | concordancer | Windows | Free |
BFSU Qualitative Coder ✎ | A tool for manual coding of corpora | coding, annotation | Windows | Free |
BFSU Sentence Collector ✎ | A pedagogic concordancer | concordaner, ddl, pedagogy, language learning | Windows | Free |
BFSU Stanford Parser ✎ | A simple parser | parser | Windows | Free |
BFSU Stanford PoS Tagger (Light) ✎ | A GUI for the Standford PoS tagger | pos tagger, tagging | Windows | Free |
BNCWeb ✎ | BNCweb is a web-based client program for searching and retrieving lexical, grammatical and textual data from the British National Corpus (BNC). | analysis, concordancer | Web | Free |
BootCat ✎ | Tool for crawling and compiling data from the web with a list of seed words. | crawler, compilation | ||
Bow ✎ | Statistical Language Modeling, Text Retrieval, Classification and Clustering | text analysis | UNIX, Linux | Free |
buzz ✎ | A python-based linguistic analysis tool. | parsing, concordancer, visualization | Python | Free, Open Source |
Calc: Corpus Calculator ✎ | A web-based tool to calculate basic corpus statistics, for example, comparing frequencies across corpora. | statistics | Web | Free |
CasualConc ✎ | CasualConc is a concordance program that runs natively on macOS. | concordancer | OSX | Free |
CATMA (Computer Assisted Text Markup and Analysis) ✎ | An undogmatic, complex annotation and analysis package. | markup, analysis, visualization, annotation | Web | Free |
CEFRLex ✎ | A web-based tool to analyse the lexical complexity of words in texts according to the CEFR scale in various languages. | text complexity, readability, language learning | Web | Free |
Chared ✎ | Tool for detecting the character encoding of a text | text analysis | Python 2.6 or later | Free |
Chi-Square and Log Likelihood Calculator ✎ | A simple tool for calculating Chi-squared and LL | statistics | Windows | Free |
CLAN ✎ | A tool for searching and analyzing child language data in the CHAT transcription format. | search, wordlists, collocation, child language, CHILDES | Windows, Mac, Unix | Free, Open Source |
CLaRK ✎ | XML Based System For Corpora Development | compilation | Free (with licence agreement) | |
CLAWS PoS-Tagger ✎ | The CLAWS part-of-speech tagger. | pos tagger, tagging | Web | Via licence or in-house tagging at Lancaster |
CLiC ✎ | A corpus tool to support the analysis of literary texts. | concordancer | Web | Free |
COCA_MWU20 ColloGram ✎ | A collocation analysis tool based on a COCA collocation family list. | collocation | Windows | Free |
Coh-Metrix ✎ | Coh-Metrix is a system for computing computational cohesion and coherence metrics for written and spoken texts. It allows readers, writers, educators, and researchers to instantly gauge the difficulty of written text for the target audience. | cohesion, coherence, readability, textual analysis | Web | Free |
Colligator 2.0 ✎ | A colligation query/analysis toolkit | colligation | Windows | Free |
Collocate ✎ | Tool for the extraction of concordances and collocations | concordancer | Windows | 35 USD |
CoMOn ✎ | A tooil for corpus matching analysis | matching | Web | Free |
Compleat Lexical Tutor ✎ | A website featuring various tools and materials for data-driven language learning. | vocabulary, language learning, lexis, web-based, ddl | Web | Free |
ConcGramCore ✎ | A modern rewrite of ConcGram (Greaves 2005) that allows efficiently searching for concgrams. | collocation, concgram | Windows | Open Source |
Concordance Randomizer ✎ | A concordance randomizer | concordancer | Windows | Free |
Concordancer ✎ | Online tool for frequency counts and text clouds | concordancer | Web | Free |
ConvoKit ✎ | A toolkit for extracting conversational features and analyzing social phenomena in conversations, using an interface inspired by (and compatible with) scikit-learn. | python, conversational analysis, social media | Python | Free, Open Source |
Coquery ✎ | A free corpus query tool to search, analyze, and visualize corpora | query, visualization | Linux, Mac, Windows | Free |
CorefAnnotator ✎ | An annotation tool for coreference. | corerference, annotation | Windows, Linux, Mac | Open Source |
CorpKit ✎ | An advanced modern corpus toolkit with an emphasis on visualization and annotated corpora. | wordlists, parsing, concordancer, visualization | Linux, Mac, Windows (Python) | Free |
Corpona ✎ | A Python library for processing XML- and JSON-based corpora. | library, XML, JSON, annotation | Python | Open Source |
CorporaCoCo ✎ | A set of R functions used to compare co-occurrence between corpora | collocation | R | Free |
Corpus Presenter ✎ | Tree tagger and corpus analysis software | wordlists, parsing, concordancer, visualization | Windows | Free |
Corpus Text Processor ✎ | Corpus Text Processor is a downloadable application that provides batched operations for common corpus processing tasks such as encoding or standardization. | compilation, corpus management, text processing | Windows, Mac | Free, Open Source |
CorpusExplorer ✎ | A complex corpus analysis toolkit combining 45 interactive tools. | visualization, exploration, tagging, text analysis | Windows | Free, Open Source |
CorpusSearch ✎ | Searches parsed corpora in the Penn Treebank format | searching, penn treebank | ||
Corpustools ✎ | An R package for managing, querying, and analyzing texts. | text analysis, R | R | Free, Open Source |
Cortext Manager ✎ | A scriptable "ecosystem" for modeling and exploring corpora. Especially useful for creating topic models and co-occurence networks. | NER, topic models, visualization, word2vec, collocation, keywords | Web | Free |
CPQWeb ✎ | Overview of and access to a wide range of corpora | database | Web | Free (once registered) |
DART ✎ | An annotation tool and research environment for annotating dialogues. | dialogues, annotation | Windows | Free |
DepCluster ✎ | A tool used for lexeme-based collexeme analysis. | lexis, collexeme, CxG, LBCA | ||
DeTagging Tool ✎ | A tool that strips annotation/tags from files. | cleaning, annotations | Windows | Free |
Dexter ✎ | Tool for text annotation | annotation | Linux, Mac, Windows | Free |
DISCO ✎ | Corpus pre-processing tool for a variety of languages that Dallows to retrieve the semantic similarity between arbitrary words and phrases | tokenization, annotation | Windows, Linux, Solaris, and MacOS | Free |
DisMo ✎ | An automatic multi-level annotator for spoken language corpora. | spoken, multilevel, multi-layer, pos tagger, annotation, tagging | ||
DocuScope ✎ | A tool for computer-aided rhetorical anyalysis | rhetorical analysis, text analysis, visualization | Windows (Java) | Free |
ELAN ✎ | Transcription and annotation of sound or video files | transcription, annotation | Linux, Mac, Windows | Free |
Emdros ✎ | A database engine fpr analyzed and annotated text. | database, annotation, query | Windows, Linux, Mac | Free, Open Source |
EncodeAnt ✎ | Tool for the detection and conversion of character encodings | converter | Windows, Mac | Free |
English Grammar Profiler ✎ | A CEFR grammar profiler for ESL/EFL. | grammar, parsing, CEFR, esl, efl | Web | Free |
EXMARaLDA ✎ | Tool for transcription, annotation, corpus analysis of spoken data | transcription, annotation, analysis | Free | |
f4analyse ✎ | QDA software specifically geared towards interview (spoken) data | qda, spoken | Windows, Mac, Linux | Commercial |
f4transkript ✎ | Software for transcribing audio data | transcription, spoken | Windows, Max, Linux | Commercial |
FinMeter ✎ | A tool for analyzing Finnish poetry in terms of meter, rhyme, semantics, metaphors etc. | lexical analysis, rhetorical analysis, poem analysis, metaphor interpretation, metaphor identification, semantics, metaphors, finnish | Linux, Mac, Windows | Free |
FireAnt ✎ | Social media analysis toolkit | downloader, converter | Windows, Mac | Free |
FLAIR (2.0) ✎ | An online tool for language teachers and learners that analyzes grammatical constructions and readability on the fly. | constructions, readability | Web | Free |
Flesh PC ✎ | Calculating Flesh-scores | readability, statistics | Windows | Free |
FrameNet ✎ | Dictionary of more than 10,000 word senses, tagged for semantic roles (according to Fillmorean Frame Semantics) | semantic parser | Web | Free |
Frequency Program (Paul Nation) ✎ | A tool that turns a text or texts into a word list with frequency figures. | vocabulary, frequency, lexis | Windows | Free |
gensim ✎ | Deep learning via word2vec | word2vec | Multi (Python) | Free, Open Source |
Gephi ✎ | A toolkit for network analysis | network analysis, graphs | Windows, Linux, Mac | Free |
GOLD Parsing System ✎ | A parsing system that can be used to develop programming languages, scripting languages and interpreters. | parser generator | Linux, Mac, Windows | Free |
Google Ngrams ✎ | An ngram-viewer for the whole of Google Books | ngrams | Web | Free |
GraphColl ✎ | Tool for building and exploring networks of linguistic collocations | visualization | Windows, Mac | Free |
Gsearch ✎ | Tool for syntactic pattern matching | pattern matching | ? | Down |
gwic ✎ | A very basic KWIC tool written in Go. | concordancer, KWIC | Windows, Mac, Linux | Open Source |
HeidelGram Web-Based Tools ✎ | Basic corpus analysis toolkit for the HeidelGram Corpus | wordlists, concordancer | Web | Free |
HeidelTime ✎ | A multilingual, domain-sensitive temporal tagger | temporal tagger, timex3 | Java | Free, Open Source |
Heimdall ✎ | A tool that searches a text for sequences written in other languages. | language detection | Linux, Windows, Mac | Open Source |
HGSimpleCorpusNetwork ✎ | Batch frequency analysis on corrupted (e.g. OCR) corpus data and generation of network analysis data. | wordlists, network analysis | Multi (Python) | Free, Open Source |
HTST Samuels ✎ | Historical Thesaurus Semantic Tagger via web-interface | semantic tagger | Web | Free |
ICARUS ✎ | Search and visualization tool for dependency trees | visualization | Free | |
ICECUP ✎ | The ICE Corpus Utility Program (ICECUP) is a corpus exploration tools for parsed corpora such as ICE-GB and DCPSE. | ICE, exploration | Free | |
ICEweb ✎ | A tool for compiling, downloading, and analyzing web corpora in accordance with the ICE | ICE, compilation, crawler | Windows | Free |
IMS Corpus Workbench ✎ | Tool for sorting frequencies in corpora | wordlists, concordancer | Web and local version | Free |
INCEpTION ✎ | A semantic annotation platform that offfers intelligent annotation assistance and knowledge management | annotation, multi-layer annotation, computer-assisted annotation, web-based | Web | Free, Open Source |
Intelligent Archive ✎ | Managing corpora for stylometry | stylometry, management | Windows, Unix, Linux, Mac | Free |
JavaCC ✎ | A popular parser generator for use with Java applications. | parser generator | Linux, Mac, Windows | Free |
jTokenizer ✎ | Tokenizing natural language | tokenizer | Free | |
JusText ✎ | Tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages | boilerplate remover | Python | Free |
juxta ✎ | Comparing and collating multiple witnesses to single textual works | textual criticism, witnesses | Windows, Unix, Linux, Mac | Free |
Kaleidographic ✎ | A dynamic and interactive visualization tool for multivariate data. | visualization | Web | Free |
KAT Tool ✎ | Grouping patterns based on search terms | patterns, concordancer | Windows | Free |
kdiff3 ✎ | KDiff3 is a diff and merge program. | comparison | Windows, Linux, OSX | Free, Open Source |
Keyword Plus ✎ | A keyword generation/analysis tool | keywords | Windows | Free |
kfNgram ✎ | A simple tool for generating n-grams | n-grams, p-frames | Windows | Free |
KHCoder ✎ | A free software for quantitative content analysis or text mining that supports multiple languages. | correspondence, collocation analysis, frequency analysis | Windows, Mac, Linux | Free, Open Source |
Khepri ✎ | A view-based toolfor exploring (historical sociolinguistic) data | sociolinguistics, visualization | JavaScript, Web | Free, Open Source |
KoGra-R ✎ | An R-based online tool that provides statistical measures for corpus-based frequencies | statistics, frequency analysis | Web | Free |
KorAP ✎ | A complex platform for corpus analysis developed at the IDS in Mannheim | analysis, multilevel, multi-layer | Web | Free, Open Source |
KWords ✎ | A tool for keyword identification and analysis. | keywords, CADS, concordancer, collocation analysis | Windows, Linux, Mac | Free |
LancsBox ✎ | The Lancaster Desktop Corpus Toolbox; Software package for the analysis of language data and corpora | collocation, frequency analysis, keywords | Java | Free (CC) |
langid.py ✎ | A standalone language identification tool written in Python. | language detection | Linux, Windows, Mac | Open Source |
LDA-Toolkit ✎ | A toolkit for linguistic discourse and image analysis. | discourse, images | Windows | Free |
Leipzig Corpus Miner ✎ | A modern text mining infrastructure for qualitative data analysis | qda, mixed methods, text mining, lexicometrics, topic models, information retrieval | Linux, Windows, Mac (via VM) | Free |
LEXA ✎ | A complex lemmatizer. | lexis, lemmaizer | Free | |
LexisNexis ✎ | A database containing (new and old) news articles. They also have other (business) data. | news, data | Web | Commercial |
Lexonomy ✎ | A tool for writing and publishing dictionaries and other dictionary-like things. | dictionary, publishing dictionary, annotation | Web | Free |
lexpan ✎ | A tool to analyze syntagmatic structures in corpora. Especially useful to analyze fillers and slots. | syntagmatic, slots | Windows, Linux, Mac | Free |
Lextutor Web Concordancers ✎ | Web concordancers targeted towards DDL | collocations, concordancer, DDL | Web | Free |
LightSide ✎ | A machine learning workbench. | machine learning | Linux, Windows | Free, Open Source |
LightTag ✎ | A commercial text annotation tool focused on managing and working with teams of annotators. | annotation, tagging, ai-tagging | Web | Commercial |
Linguistica ✎ | Word segmentation and morphological analysis? | segmentation, morphological tagger | Linux, Mac, Windows | Free |
Link Grammar Parser ✎ | A syntactic parser of English, Russian, Arabic and Persian (and others), based on Link Grammar. | parser, syntax, grammar | Linux, Mac, Windows | Free |
LIWC ✎ | A tool that tries to compute scores for different emotions, thinkings styles, and social concerns. | lexical analysis, style | Web | Free (but Commercial) |
Log-Likelihood and Effect-Size Calculator ✎ | An online calculator for log-likelihoof and effect sizes. | statistics | Web | Free |
MALLET ✎ | Package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text | statistical nlp | Windows | Free |
MaltOptimizer ✎ | A system for parser optimization using the open-source system MaltParser. | parser, dependency parsing | Windows, Mac, Linux | Free |
MaltParser ✎ | A system for data-driven dependency parsing, which can be used to induce a parsing model from treebank data and to parse new data using an induced model. | parser, dependency parsing | Windows, Mac, Linux | Free |
MAT - Multidemensional Analysis Tagger ✎ | A tagger for MDA (Biber et al.) by Andrea Nini. | tagging, MDA | Windows, Mac | Free |
MAXQDA ✎ | Sophisticated QDA software that works with multimodal data and supports mixed methods approaches | qda, mixed methods | Windows, Mac, Android, iOS | Commercial |
MLCT ✎ | Tool for building and processing corpora | concordancer, sentence boundary detector | Free | |
MMAX2 ✎ | A multi-level annotation tool | annotation, multilevel, multi-layer | Java | Free, Open Source |
MonoConc Esy ✎ | Concordancing and text search tool that allows primary and secondary concordancing | concordancer, sentence boundary detector | Free for non-Commercial research | |
MorphAdorner ✎ | Tool for performing morphological tagging of texts | morphological tagger | Free | |
Murre ✎ | A tool for normalising and generating dialectal Finnish and Swedish | python, variation, dialectal data, finnish | Linux, Mac, Windows | Free |
N-Gram Processor (NGP) ✎ | A perl based tool for the creation and processing of n-gram lists out of text files. | n-grams | Linux, Windows, Mac | Open Source |
NATAS ✎ | A spacy-based library for processing historical corpora (with a focus on neologisms). | historical, python, lexis | Linux, Windows, Mac | Open Source |
Natural Language Toolkit ✎ | Platform for building Python programs to work with human language data | tokenizer, tagger | Unix, Mac, Windows (+Python 3.4) | Free |
NooJ ✎ | Tags texts and corpora (i.e. sets of text files) at the Orthographical, Lexical, Morphological, Syntactic and Semantic levels | multilevel tagger | Windows, Mac, LINUX and BSD Unix | Free |
NoSketch Engine ✎ | Word sketches, thesaurus, keyword computation, corpus creation | corpus creation, semantic analysis, wordlists | Free | |
NVIVO ✎ | A commercial Computer-Assisted Qualitative Data Analysis Software (CAQDAS) software that works with both qualitative and mixed methods data | qda, mixed methods | Windows, Mac | Commercial |
OneClick Terms ✎ | An online term extractor with monolingual and bilingual term extraction capabilities. | keywords, term extraction, bilingual term extraction | Web | Free (limited version), 4.83€ / month |
Onion ✎ | Tool for removing duplicate parts from large collections of texts | duplicate remover | Free | |
Online Graded Text Editor ✎ | Tool for profiling a text's vocabulary level and complexity | text analysis, editing, vocabulary | OSX, Windows | Free |
OpenConc ✎ | Tool for concordancing | concordancer | Free | |
PACTE ✎ | A flexible collaborative text annotation platform that is currently in development. | annotation | Web | Free (for research) |
PALinkA ✎ | Annotation tool | annotation | Down | |
ParaConc ✎ | A bilingual/multilingual concordancer | concordancer | Non-Free | |
Pareidoscope ✎ | Pareidoscope is a collection of tools for determining the association between arbitrary linguistic structures, such as collocations, collostructions or between structures. | collocation, constructions | Free | |
PatCount ✎ | A pattern counting tool with powerful statistic capabilities and regex support | patterns | Windows | Free |
Pattern Builder ✎ | A tool helping with regular expressions and PoS tags | regex, tagging | Windows | Free |
Pepper ✎ | Conversion between linguistic formats, e.g. from TEI to ANNIS to Tiger XML to EXMARaLDA. | conversion | Free | |
Phonological CorpusTools (PCT) ✎ | Phonological analysis on transcribed corpora | phonology | Multi (Python) | Free |
PhraseContext ✎ | Tool for wordlists, concordancing, collocation, TTR, | wordlists, concordancer | 35€ | |
Pipoca (formerly openQDA) ✎ | A web-based QDA software | qda, mixed methods | Web | Free, Open Source |
Praaline ✎ | Praaline is a system for metadata management, annotation, visualisation and analysis of spoken language corpora. | speech, prosody, spoken, annotation, concordancer, search, visualization, converter, analysis | Windows, Mac, Linux | Free / Open Source (GPL3) |
PRAAT ✎ | A tool for doing phonetics by computer | phonetics, spoken | Windows, Mac, Linux | Open Source |
ProtAnt ✎ | Tool for prototypical text analysis | wordlists | Windows, Mac | Free |
pysupersensetagger ✎ | Analyses texts for MWE and supersenses. | text analysis | Unix, Mac (Python) | Free |
PyXMLConc ✎ | Concordancer for XML files with automatic tag and attribute detection. | concordancer | Multi (Python), Windows | Free, Open Source |
QDA Miner ✎ | A commercial QDA tool for coding, annotating, retrieving and analyzing collections of documents and images. | qda, mixed methods, text analysis | Windows | Commercial |
QualCoder ✎ | QualCoder is free, open source software for qualitative data analysis. | qda, text analysis | Linux, Mac, Windows | Free, Open Source |
Quanteda ✎ | A python library used to study neologisms in historical English corpora. | R | Linux, Windows, Mac | Open Source |
Query Tool for the Edenburgh Associative Thesaurus ✎ | A query tool for the EAT | query, thesaurus | Windows | Free |
Range Program (formerly VocabProfiler) (Paul Nation) ✎ | A tool for for analyzing the vocabulary load of texts. | voabulary, lexis | Windows | Free |
RDQA ✎ | An R package for Qualitative Data Analysis (QDA). | qda | Windows, Linux/FreeBSD, Mac | Free |
Readability Analyzer ✎ | A tool for generating various readability statistics | readability, statistics | Windows | Free |
Readability Webfx ✎ | A tool to check how easy or difficult (readability) a given text is. | readability | Web | Free |
Rescribe ✎ | Rescribe is an OCR service/tool geared towards historical texts. | ocr | Windows, Linux, Mac | Free |
RSTTool ✎ | Tool that can annotate texts for constituency and rhetorical structure | annotation | Windows, Macintosh, UNIX and LINUX | Free |
Salt ✎ | Meta models for linguistic data. | meta modelling | Free | |
SarAnt ✎ | Tool for batch search and replacing | editing, searching | Windows | Free |
SegmentAnt ✎ | Tool for the segmentation of Japanese and Chinese | segmentation, tokenizing | Windows, Mac, Linux | Free |
Shinyconc ✎ | ShinyConc is a framework for generating custom web-based concordancers and is written in R and R Shiny. | concordancer, kwic, r | Open Source / R | Free |
Simple Concordance Program ✎ | Tool for concordance and word listing that works with many languages | concordancer | Windows, Mac | Free |
SKELL ✎ | A simple tool for language learners and teachers. | language learning, language teaching | Web | Free |
Sketch Engine ✎ | A corpus manager and text analysis software developed by Lexical Computing. | annotation, concordancer, tagging, sampling, search, visualization, wordlists, keywords, compilation, text analysis, n-grams, collocation, statistics, segmentation, analysis, crawler, parallel, colligation, annotations, tokenization, query, ngrams, boilerplate remover, comparison, frequency analysis, information retrieval, data, sentence boundary, corpus creation, duplicate remover, regex, thesaurus, meta modelling, dictionary, text-processing, xml, frequency, trends patterns, web-based, collocates, collocation analysis, word cloud, coocurence, KWIC, corpus management, multilingual, NLP, diachronic analysis, term extraction, keyword extraction, bilingual term extraction | 30-day free trial then starts at 4.83 €/month | |
SLATE ✎ | SLATE is a python-based CLI annotation tool. It is very lightweight and can be used for various types of span-based annotation. | annotation | Python | Free, Open Source |
SoMaJo ✎ | A tokenizer and sentence splitter for German and English web and social media texts. | tokenizer, sentence boundary detector | Linux, Mac, Windows | Free, Open Source |
SoMeWeTa ✎ | A part-of-speech tagger with support for domain adaptation and external resources. | tagging, pos, pos tagger | Linux, Mac, Windows | Free, Open Source |
SpiderLing ✎ | Software for obtaining text from the web useful for building text corpora | crawler | Free | |
SPPAS ✎ | A tool for the automatic annotation and analysis of speech. | speech, spoken, annotation | Windows, Mac, Linux | Free, Open Source |
SPre ✎ | Tool for segmenting and annotating texts | annotation | Free | |
Stanford Log-linear POS Tagger ✎ | PoS Tagger (with Penn Treebank Tagset) for English, Arabic, Chinese, German | pos tagger, tagging | Free | |
Stanford Topic Modeling Toolbox ✎ | The Stanford Topic Modeling Toolbox (TMT) allows users to perform topic modeling on texts imported from spreadsheets. It supports both LDA and labelled LDA. | topic modeling | Java | Free |
Stylo for R ✎ | Tool for computational stylistic analysis (authorship attribution, genre analysis) | text analysis | Free | |
Sub-Corpus Creator ✎ | A tool for creating sub-corpora based on search searchs and metadata | compilation | Windows | Free |
Synpathy ✎ | Tool for manual syntactic annotation | annotation | Windows, Mac, Linux | Free |
TAACO ✎ | TAACO is a tool that calculates 150 indices of textual/lexical cohesion. | cohesion, lexical sophistication | All | Free, Open Source |
TAALES ✎ | TAALES measures over 400 indices of lexical sophistication. | lexical sophistication | Mac, Linux, Windows | Open Source |
TagAnt ✎ | Part-of-speech tagging tool built on Tree Tagger | pos tagger, tagging | Windows, Mac, Linux | Free |
TagCrowd ✎ | A simple tool for generating tag/word clouds online | word clouds, visualization | Web | Free |
tagtog ✎ | A text annotation tool specifically built to train AI/ML models. | machine learning, annotation | Cloud-Based | Commercial |
Tagxedo ✎ | A tool for generating word clouds. | word clouds, visualization | Web | Free |
TASX-Annotator ✎ | Tool for multilevel annotation and transcription of (multi-channel) video and audio data. | multilevel tagger, transcription | Windows, Mac, Linux, Solaris | Down |
Text Analysis Computing Tools (TACT) ✎ | A simple, fairly old concordancer. | concordancer | Commercial | |
Text Variation Explorer ✎ | The Text Variation Explorer TVE is a tool for exploring the effect of window size on various common linguistic measures. It visualizes these measures and allows for PCA/Cluster analysis. | visualization, variation analysis | Java | Free |
Text Visualization Browser ✎ | A survey/gallery of text visualizations | visualization | Web | Free |
Textanz ✎ | Language analysis program that produces frequency lists, word lists, parts of speech tags. | wordlists, concordancer, pos tagger, dictionary | Any OS | Free, Open Source |
TextArc ✎ | A tool for visualizing the structure of texts. | visualization | ||
TextDirectory ✎ | TextDirectory is a tool for aggregating text files based on various filters and transformation functions. | compilation, text-processing, python | Windows, Linux, OSX | Free, Open Source |
Textplot ✎ | A tool for mapping a document into a network of terms in order to visualize the topic structure. | visualization, network analysis, semantics, graphs | Python | Free, Open Source |
TextSmith Tools ✎ | A tool for genre-informed phraseological profiles | phraseology, segmentation | Windows | Free |
TextSTAT ✎ | Tool for creation and manipulation of linguistic data from different languages | corpus creation, concordancer | Windows, GNU/Linux und MacOS | Free |
The (Phonetic) Transcription Editor ✎ | An editor for creating phonetic transcriptions | transcription | Windows | Free |
The Great American Word Mapper ✎ | A visualization tool for the top 100,000 words used in American English twitter data. | twitter, lexis, social media | Web | Free |
The Prime Machine ✎ | A user- and mobile-friendly corpus analysis toolkit (primarily concordancing) initially designed for English language teaching. | concordancer, language teaching, wordlist, keywords, efl, esl | MacOS, Window, iOS, Android | Free |
The Simple Corpus Tool ✎ | A corpus analysis toolkit that supports XML annotations. | concordancer, annotation, xml, frequency | Windows | Free |
The Simple PoS Tagger ✎ | A simply PoS-tagger utilizing Perl Lingua::EN:Tagger | pos tagger, tagging | Windows | Free |
The SPAADIA concordancer ✎ | A concordancer for the SPAADIA corpus | concordancer, SPAADIA | Windows | Free |
The Text Feature Analyser ✎ | A tool for investigating textual features and various meassures | text analysis, concordancer | Windows | Free |
Thesaurus.com ✎ | English language thesaurus with links to English dictionary and translation sites. | efl, esl, linguistics | Not sure, I'm not a programmer or geek. | Free |
TigerSearch ✎ | Tool for searching syntactically and PoS-tagged corpora | search tool, pos tagger | Free | |
TnT - Thorsten Brants's PoS Tagger ✎ | A simple PoS-Tagger | pos tagger, tagger, tagging | Windows/Unix | Available via Stanford |
Trafilatura ✎ | Trafilatura is a Python package and command-line tool which seamlessly downloads, parses, and scrapes web page data. | corpus creation, python, R, compilation, crawler, boilerplate remover, data, xml, scraping | Python | Free, Open Source |
Tree Editor TrEd 2.0 ✎ | Graphical editor and viewer for tree-like structures. | visualization | Windows, GNU/Linux und MacOS | Free |
TreeTagger ✎ | Tool for annotating text with part-of-speech and lemma information | pos tagger, annotation | Windows, Mac, Linux | Free |
TurboParser ✎ | Multilingual dependency parser with linear programming | parser | Free | |
Twarc ✎ | A command line tool (and Python library) for archiving Twitter JSON | twitter, social media | Python, Windows, Linux, Mac | Free, Open Source |
Tweet NLP ✎ | Tweet tokenizer, PoS Tagger, hierarchical word clusters, and a dependency parser for tweets, along with annotated corpora and web-based annotation tools. Clusters: http://www.cs.cmu.edu/~ark/TweetNLP/cluster_viewer.html | pos tagger, tokenizer, parser | Free | |
TWINT ✎ | A Twitter scraping tool written in Python that allows for scraping Tweets from Twitter profiles without using Twitter's API. | twitter, social media, scraping | Linux, Windows, Mac | Open Source |
TXM ✎ | XML & TEI compatible text analysis software based on TreeTagger, the CQP search engine and the R statistical environment. | text analysis, concordancer, r, statistics, search tool, tokenizer, xml | Windows,Mac,Linux,Tomcat | Free |
UAM CorpusTool ✎ | Text annotation tool and statistics for various types of linguistic analysis and multilayer annotation | annotation, multi-layer annotation, computer-assisted annotation | Free | |
UAM ImageTool ✎ | Image annotation tool for visual data corpora | annotation | Free | |
UBIAI ✎ | A NLP-oriented text annotation platform for teams with comprehensive auto-annotation features. | annotation, NLP | Web | Commercial |
UCREL Semantic Analysis System (USAS) ✎ | An automatic semantic tagger for different languages (e.g., English, Chinese, Italian, Dutch, Portuguese, Spanish). | semantic annotation, tagging, semantics | Free | |
UCS Toolkit ✎ | A toolkit (libraries and scripts) for the statistical analysis of coocurence data. | collocation, coocurence, statistics | R, Perl | Free |
Unitok ✎ | An annotation-aware tokenizer that splits text into line-by-line tokens. | tokenizer | Free | |
UralicNLP ✎ | NLP tools (primarily) for Uralic languages | uralic, parser, pos tagger, tagging, inflection, morphological tagger | Linux, Mac, Windows | Free |
VARD ✎ | Spelling variant detection and deletion in historical corpora (particularly EModE) | variant detector | Free (with academic email) | |
VariAnt ✎ | Tool for the detection of spelling variants | variant detector | Windows | Free |
VideoAnt ✎ | A web-based tool to annotate and discuss web-hosted videos. | annotation, video | Web | Free |
Voyant Tools ✎ | A web-based reading/analysis toolkit for digital texts. | reading, text analysis, visualization, trends patterns | Web | Free, Open Source |
VU Amsterdam Metaphor Identification Corpus ✎ | Corpus tool for metaphor identification | metaphor identification, metaphors | Web and local version | Free |
WConcord 3.0 ✎ | A fully featured concordancer | concordancer | Free | |
WebAnno ✎ | A web-based annotation tool | annotation, web-based | Web | Free |
WebLicht ✎ | WebLicht is an execution environment for automatic annotation of text corpora embedded with the CLARIN-D project. | annotation | Web | Free (CLARIN-D Account needed) |
wiki2corpus ✎ | The tool downloads Wikipedia and converts them into clean text files. | wikipedia, web as corpus | Python | Free |
Wmatrix ✎ | Tool for corpus analysis and comparison. Provides access to CLAWS and USAS. | wordlists, concordancer, pos tagger, semantic tagger, keywords, web-based | Web | £50 per username per year |
WordCruncher ✎ | A tool for searching, studying, and analyzing digital texts and corpora. The tool has been tested for corpora up to a billion words. | concordancer, wordlists, collocates, n-grams, keywords, key phrases, ebooks | Windows, Mac, iOS | Free |
WordFish ✎ | Extract political positions from text documents. | political science | R | Free |
WordHoard ✎ | Close reading and scholarly analysis of deeply tagged texts | close reading | Windows, Unix, Linux, Mac | Free |
Wordle ✎ | A tool for generating word clouds. | word clouds, visualization | Web | Free |
WordMap ✎ | A simple web-based word-map / wordcloud generator. | visualization, web-based | Web | Free |
Wordscores ✎ | A tool (approach) to extract dimensional information from political texts | political science, information retrieval | Free | |
WordSift ✎ | A word cloud generator, with dynamic filters, links to images, and KWIC capabilities. Works with various types/formats of word lists. | word cloud, vocabulary profiling, lexis, vocabulary, language teaching | Web | Free |
Wordsmith ✎ | One of the most established corpus toolkits providing a variety of functionality | concordancer, wordlists, statistics, keywords | Windows | 60€ per licence |
wordspace ✎ | An R package for distributional semantics | semantics, distributional semantics, R | R | Free |
Wordstatix ✎ | Corpus analysis tool | concordancer | Free | |
WordWanderer ✎ | A web-based visualization/analysis tool which allows its users to "wander" a text. | visualization, concordancer | Web | Free |
Worldbuilder ✎ | Tool for annotation and visualisation in analysis applying text-world-theory | annotation, visualization | ||
Xaira ✎ | A tool for indexing and analyzing XML resources. | indexing, xml | Windows | Free, Open Source |
YACSI Chinese Tokeniser / PoS Tagger ✎ | A Chinese tokenizer and PoS tagger | chinese, tokenizer, pos tagger | Windows | Free |
YEDDA ✎ | YEDDA is a python-based collaborative text span annotation tool with support for a very wide variety of languages including Chinese. | annotation | Python | Free, Open Source |
BabelNet ✎ | A multilingual encyclopedic dictionary featuring a semantic network/ontology. | dictionary, ontology, semantics, NLP | Web | Free |
FLAX ✎ | FLAX (Flexible Language Acquisition) is a set of tools and applications to automate the production and delivery of interactive digital language collections. | language learning, language teaching, text analysis | Java, Moodle | Free, Open Source |
Just the Word ✎ | A simple web interface for BNC data | concordancer, frequency analysis, BNC | Web | Free |
Orange Data Mining ✎ | An open source machine learning and data visualization platform based on workflows. | text analysis, visualization, time series | Windows, Unix, Linux, Mac | Free, Open Source |
QualCoder ✎ | An open source tool for qualitative data analysis that supports coding text and images. | qda, annotation | Windows, Mac, Linux, Python | Free, Open Source |
TEITOK ✎ | A web-based platform for viewing, creating, and editing corpora with rich textual mark-up and linguistic annotation. | visualization, TEI, mark-up, annotation | Linux, Mac | Free, Open Source |
Wordless ✎ | An Integrated corpus tool With multilingual support for the study of language, literature, and translation. | concordancer, text analysis, statistics, readability | Windows, Mac, Linux, Python | Free, Open Source |
WebCorp Live ✎ | A tool for accessing the Web as a corpus. | web-as-a-corpus | Web | Free |
CorpusMate ✎ | A web-based, streamlined, and simplified language data analysis experience for younger learners. | language learning, language teaching, concordancer, frequency analysis, pattern | Web | Free |
MetaPak ✎ | A tool to assist metadiscourse analysis based on Hyland's framework. | metadiscourse | Windows | Free |
NeoSCA ✎ | A syntactic complexity analyzer for written English. It is a fork of L2SCA with various additional features. | syntactic complexity, constituency parsing, pattern matching, tregex, command line | Windows, Mac, Linux | Free, Open Source |
Sanchay ✎ | An open source multi-purpose platform focused on South Asian languages. | annotation, tagging, chunking | Windows, Linux | Free, Open Source |
LogosLink ✎ | A tool for corpus management and ontological augmentation for discourse analysis. | discourse analysis, corpus management | Windows | Free |
Word Frequency Analyser ✎ | A web-based tool for analyzing word frequencies that also produces frequency charts and word clouds. | pos tagger, tokenizer, lemmatizer, frequency analysis | Web | Free |
Discourse Analyzer ✎ | An AI (LLM) powered platform for conducting discourse analysis. | discourse analysis, llm, generative AI | Web | Paid |
Turkish-English Learner Corpus – Error Tagging ✎ | TELC is a lexical-error tagged learner corpus compiled in the Turkish setting. It features a web-based error tagging tool. | learner corpus, error tagging | Web | Free |
AutoSearch ✎ | A cloud-based corpus query engine that supports the upload of corpora. | concordancer, corpus query engine | Web | Free |
Text-Fabric ✎ | A Python library for processing corpora (especially based on ancient texts) as annotated graphs. | graph model, annotation, python | Free, Open Source |
Last Updated: October 13, 2024.
In case you are interested, the data is also available in JSON format.