The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.
M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI), San Francisco, CA, Morgan Kaufmann, (2004)