BootCat ✎ | Tool for crawling and compiling data from the web with a list of seed words. | crawler, compilation | | |
ICEweb ✎ | A tool for compiling, downloading, and analyzing web corpora in accordance with the ICE | ICE, compilation, crawler | Windows | Free |
Sketch Engine ✎ | A corpus manager and text analysis software developed by Lexical Computing. | annotation, concordancer, tagging, sampling, search, visualization, wordlists, keywords, compilation, text analysis, n-grams, collocation, statistics, segmentation, analysis, crawler, parallel, colligation, annotations, tokenization, query, ngrams, boilerplate remover, comparison, frequency analysis, information retrieval, data, sentence boundary, corpus creation, duplicate remover, regex, thesaurus, meta modelling, dictionary, text-processing, xml, frequency, trends patterns, web-based, collocates, collocation analysis, word cloud, coocurence, KWIC, corpus management, multilingual, NLP, diachronic analysis, term extraction, keyword extraction, bilingual term extraction | | 30-day free trial then starts at 4.83 €/month |
SpiderLing ✎ | Software for obtaining text from the web useful for building text corpora | crawler | | Free |
Trafilatura ✎ | Trafilatura is a Python package and command-line tool which seamlessly downloads, parses, and scrapes web page data. | corpus creation, python, R, compilation, crawler, boilerplate remover, data, xml, scraping | Python | Free, Open Source |