Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. You can find the latest release on the download page. See the Getting Started guide for instructions on how to start using Tika.
A. Okada, and Y. Kambayashi. Advances in Web-Based Learning : First International Conference, ICWL 2002, Hong Kong, China, August 17-19, 2002. Proceedings, (2002)
K. Bischoff, C. Firan, W. Nejdl, and R. Paiu. CIKM '08: Proceeding of the 17th ACM conference on Information and knowledge mining, page 193--202. New York, NY, USA, ACM, (2008)
I. Altingovde, Ö. Subakan, and Ö. Ulusoy. Information Processing & Management, 49 (3):
688 - 697(2013)Personalization and Recommendation in Information Access.