niyue

Archive for the ‘programming’ Category

Content Extract

In javaprogramming on 11月 29, 2004 at 6:17 上午

1. Noncommercial Products:

http://www.etymon.com/epub.html

PJX specifically included significantly faster reading and writing of PDF documents, thread safety, “on demand” reading and parsing of PDF objects which greatly reduces memory usage and processing time, incremental update support to enable fast modification of PDF documents, reading PDF documents from either disk or memory, thorough documentation of the class library interface, support for J2SE collection classes and NIO, access to form/field objects, rudimentary support for insertion of images and watermarks, appending of large documents, and design patterns for recursive processing of PDF objects.

http://www.pdfbox.org/

PDFBox is a Java PDF Library. This project will allow access to all of the components in a PDF document. More PDF manipulation features will be added as the project matures. This ships with a utility to take a PDF document and output a text file.

http://www.foolabs.com/xpdf/about.html

Xpdf is an open source viewer for Portable Document Format (PDF) files. (These are also sometimes also called ‘Acrobat’ files, from the name of Adobe’s PDF software.) The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities.

http://jakarta.apache.org/poi/index.html

The POI project consists of APIs for manipulating various file formats based upon Microsoft’s OLE 2 Compound Document format using pure Java. In short, you can read and write MS Excel files using Java. Soon, you’ll be able to read and write Word files using Java. POI is your Java Excel solution as well as your Java Word solution. However, we have a complete API for porting other OLE 2 Compound Document formats and welcome others to participate.

2. Commercial Products:

http://tonicsystems.com/products/

Tonic Systems is the leading PowerPoint® automation specialist. Each of our products has been born from our experience developing solutions to real life business challenges, and each has been proven extensively in enterprise environments.

We have developed our range of products in response to customer demand. These 100% java, server-side products are robust and scalable to meet the needs of the most demanding environments.

http://snowtide.com/home/PDFTextStream/

PDFTextStream is the ideal solution for Java applications and J2EE web services that need to rapidly and accurately extract text and document metadata from PDF files.

Update:

以上这些基本都是PDF的内容抽取,后来我使用过了Aperture这个框架,感觉很好用,虽然现在还没有正式发布,不过CVS上面的代码已经支持绝大多数格式的文档的信息抽取,包括Office系列(Word,Excel,PowerPoint,Publisher等,齐全的令人吃惊),OpenOffice系列,PDF,Plain Text等,还可以方便的进行扩展(我自己写了一些数据库的内容抽取),感觉是一个One size fits all的一站式的解决方案,很值得一试.

Jena with lastest release of MySQL

In javaprogrammingsemanticweb on 11月 18, 2004 at 12:20 上午

Sigh,真是郁闷,下载了最新release的MySQL 4.1.7,已经是general available的版本了,不知道为什么还是会有问题的,Jena在数据库中创建模型的时候抛出异常 WARN [main] (DriverRDB.java:382) – Problem formatting database
>> java.sql.SQLException: Syntax error or access violation message from server: “Specified key was too long; max key length is 1024 bytes”,
问了Jena的维护人Dave(很好的一个人,很快就回复我了,而且很认真,3q),得到的结论是MySQL在Jena里面设置key的长度不能超过250,否则就会抛出异常,但是默认应该不会超过250的,最后只能归结到Jena和MySQL 4.1.7的conflict上去了,换了MySQL 4.0.22果然就没有问题了(但是4.0.22的安装设置过程实在和4.1.7差太多了,不够人性化).浪费了很多时间在这个问题上,而且解决了之后并没有什么收获,唯一的感触是Jena-dev mailing list上有很多好人

Java Encoding Problem

In javaprogramming on 11月 16, 2004 at 11:18 上午

这几天一直被Java的编码问题所困扰,开始使用Protege编辑RDF,然后用Jena进行解析,其中Java IDE用的又是Eclipse,几样东西全都是Java的,Java的I/O的API又极其复杂,一出来乱码都不知道到底是哪个环节出了问题.现在终于发现没有任何一个环节有问题,是自己代码有问题,不过问题还没有完全解决,仅仅是完全弄明白了是怎么回事.Protege用UTF-8来保存RDF文件,没问题.Jena解析RDF不牵涉字符编码,没问题.Eclipse根据Java VM所在的平台来选择编码,Windows下用的是GBK.Java I/O API中FileReader和FileWriter用平台的GBK编码,标准输出流System.out用GBK编码.

开始的时候没考虑编码,直接Jena就开始处理了,弄的一头雾水是.后来把Jena先撇开,直接IO读写RDF,结果还是乱码.再后来发现应该直接用FileReader,FileWriter来读写,应该用InputStreamReader和OutputStreamWriter,然后要将二者的编码设成UTF-8来处理文件.这样的话写到磁盘上的文件就不会出现乱码了.但是输出到标准输出流上还是乱码,也就是Eclipse里面的控制台看起来还是乱码(因为System.out是GBK编码).以上只是简要的说了一下开始和结果,中间弯路走了实在太多不再敖述.

这个问题到现在还没想通要怎么解决,想到了两种方法都不是很好,一种是将System.out重定向到某个文件去,用UTF-8输出,然后用外部编辑器查看,但是明显很不方便,不利于调试而且以后每次关于这个System.out的时候都要重定向一下也很不方便.另一种方法是看看行不行重写System.out这个类,将编码改为UTF-8.但是不知道怎么重写还,而且改成UTF-8以后Eclipse的控制台也不确定行不行显示(因为Eclipse是采用容器的GBK编码的,不知道控制台会不会根据System.out的编码而改变编码).Sigh,没想法了我已经…

EIP问题集(1)

In programming on 11月 11, 2004 at 12:33 上午

最近开始着手一个Enterprise Information Portal的项目,我主要负责Portal里面的文档的元数据管理,想法是用RDF来对文档进行描述,然后应用到门户的文档分类浏览,检索等方面.

项目刚刚启动,已经碰到了许多问题了,苦于没人可以提供帮助.目前看到的问题主要有:

  1. 元数据到底在信息门户中可以有多大的应用,可以有什么应用,目前能够看到的应用也就是用于信息浏览与检索,看了一些Semantic Portal的例子,似乎也都是用于这些方面的,但让人觉得没有什么亮点.要说再和其他系统集成的话又感觉比较远了,迷惑中
  2. 元数据用RDF还是用OWL来描述比较好.现在Web上的很多元数据应用都是基于RDF的,像RSS,FOAF等等,基于OWL的好像没看到,但是RDF的表达能力又太弱了,也许现在还能够满足要求,要是以后有了新的需求的话再转到OWL就比较麻烦了.其实这又回到了第一个问题,到底有哪些应用是可以预期的?
  3. Encoding的问题.这是一个很实际的问题,用Protege编辑RDFS采用的是UTF-8编码,但是在Windows下用Jena解析的时候出错,中文全部不行正确解码,也不知道是Java的问题还是Jena的问题.