1. Noncommercial Products:
PJX specifically included significantly faster reading and writing of PDF documents, thread safety, “on demand” reading and parsing of PDF objects which greatly reduces memory usage and processing time, incremental update support to enable fast modification of PDF documents, reading PDF documents from either disk or memory, thorough documentation of the class library interface, support for J2SE collection classes and NIO, access to form/field objects, rudimentary support for insertion of images and watermarks, appending of large documents, and design patterns for recursive processing of PDF objects.
PDFBox is a Java PDF Library. This project will allow access to all of the components in a PDF document. More PDF manipulation features will be added as the project matures. This ships with a utility to take a PDF document and output a text file.
Xpdf is an open source viewer for Portable Document Format (PDF) files. (These are also sometimes also called ‘Acrobat’ files, from the name of Adobe’s PDF software.) The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities.
The POI project consists of APIs for manipulating various file formats based upon Microsoft’s OLE 2 Compound Document format using pure Java. In short, you can read and write MS Excel files using Java. Soon, you’ll be able to read and write Word files using Java. POI is your Java Excel solution as well as your Java Word solution. However, we have a complete API for porting other OLE 2 Compound Document formats and welcome others to participate.
2. Commercial Products:
Tonic Systems is the leading PowerPoint® automation specialist. Each of our products has been born from our experience developing solutions to real life business challenges, and each has been proven extensively in enterprise environments.
PDFTextStream is the ideal solution for Java applications and J2EE web services that need to rapidly and accurately extract text and document metadata from PDF files.
以上这些基本都是PDF的内容抽取,后来我使用过了Aperture这个框架,感觉很好用,虽然现在还没有正式发布,不过CVS上面的代码已经支持绝大多数格式的文档的信息抽取,包括Office系列(Word,Excel,PowerPoint,Publisher等,齐全的令人吃惊),OpenOffice系列,PDF,Plain Text等,还可以方便的进行扩展(我自己写了一些数据库的内容抽取),感觉是一个One size fits all的一站式的解决方案,很值得一试.