La primera concreció del DCA i el seu primer estadi d'elaboració és el Diccionari de Textos Catalans Antics (DTCA) consultable en aquest web, un diccionari de forma-lema que posa a l'abast d'investigadors i estudiosos un cabal d'informació excepcional, ja que tots els textos que s'hi han introduït han estat lematitzats i se'n proporciona, per tant, la informació, no únicament per formes ocasionals, sinó també per lemes
[accès réservé] Recueil de tous les textes subsistants rédigés en vieil-anglais, soit 3047 textes relevant de différents genres : poésie, prose, gloses interlinéaires, glossaires, inscriptions runiques, inscriptions en écriture latine
« The corpus contains more than 360 million words of text, including 20 million words each year from 1990-2007, and it is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. The corpus will also be updated at least twice each year from this point on, and will therefore serve as a unique record of linguistic changes in American English. The interface allows you to search for exact words or phrases, wildcards, lemmas, part of speech, or any combinations of these. You can search for surrounding words (collocates) within a ten-word window (e.g. all nouns somewhere near chain, all adjectives near woman, or all verbs near key). »
Tweets2011
As part of the TREC 2011 microblog track, Twitter provided identifiers for approximately 16 million tweets sampled between January 23rd and February 8th, 2011. The corpus is designed to be a reusable, representative sample of the twittersphere - i.e. both important and spam tweets are included.
Twitter corpus for Sentiment Analysis from a class (cs224n)at Stanford.
Class page:
https://sites.google.com/site/twittersentimenthelp/for-researchers#Where_is_the_Tweet_corpus_8553
http://www.stanford.edu/~alecmgo/cs224n
Corpex let's you swiftly browse through all the words of Wikipedia. the system shows you two statistics in four graphs. Corpex is also available as a restful webservice, Corpex is still very much under development. The currently extracted data is still very noisy, and we are currently working on better extraction and filtering approaches. The source code is fully open source, and all the data is also freely available.
MemeTracker builds maps of the daily news cycle by analyzing around 900,000 news stories and blog posts per day from 1 million online sources, ranging from mass media to personal blogs. We track the quotes and phrases that appear most frequently over time across this entire online news spectrum. This makes it possible to see how different stories compete for news and blog coverage each day, and how certain stories persist while others fade quickly.
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
S. Hess, C. Muller, W. Frobenius, U. Reith, K. Klotz, и K. Eger. J Med Chem, 43 (24):
4636-46(ноября 2000)Hess, S Muller, C E Frobenius, W Reith, U Klotz, K N Eger, K In Vitro
Research Support, Non-U.S. Gov't United States Journal of medicinal
chemistry J Med Chem. 2000 Nov 30;43(24):4636-46..