niyue

Archive for 2004年11月|Monthly archive page

Content Extract

In javaprogramming on 11月 29, 2004 at 6:17 上午

1. Noncommercial Products:

http://www.etymon.com/epub.html

PJX specifically included significantly faster reading and writing of PDF documents, thread safety, “on demand” reading and parsing of PDF objects which greatly reduces memory usage and processing time, incremental update support to enable fast modification of PDF documents, reading PDF documents from either disk or memory, thorough documentation of the class library interface, support for J2SE collection classes and NIO, access to form/field objects, rudimentary support for insertion of images and watermarks, appending of large documents, and design patterns for recursive processing of PDF objects.

http://www.pdfbox.org/

PDFBox is a Java PDF Library. This project will allow access to all of the components in a PDF document. More PDF manipulation features will be added as the project matures. This ships with a utility to take a PDF document and output a text file.

http://www.foolabs.com/xpdf/about.html

Xpdf is an open source viewer for Portable Document Format (PDF) files. (These are also sometimes also called ‘Acrobat’ files, from the name of Adobe’s PDF software.) The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities.

http://jakarta.apache.org/poi/index.html

The POI project consists of APIs for manipulating various file formats based upon Microsoft’s OLE 2 Compound Document format using pure Java. In short, you can read and write MS Excel files using Java. Soon, you’ll be able to read and write Word files using Java. POI is your Java Excel solution as well as your Java Word solution. However, we have a complete API for porting other OLE 2 Compound Document formats and welcome others to participate.

2. Commercial Products:

http://tonicsystems.com/products/

Tonic Systems is the leading PowerPoint® automation specialist. Each of our products has been born from our experience developing solutions to real life business challenges, and each has been proven extensively in enterprise environments.

We have developed our range of products in response to customer demand. These 100% java, server-side products are robust and scalable to meet the needs of the most demanding environments.

http://snowtide.com/home/PDFTextStream/

PDFTextStream is the ideal solution for Java applications and J2EE web services that need to rapidly and accurately extract text and document metadata from PDF files.

Update:

以上这些基本都是PDF的内容抽取,后来我使用过了Aperture这个框架,感觉很好用,虽然现在还没有正式发布,不过CVS上面的代码已经支持绝大多数格式的文档的信息抽取,包括Office系列(Word,Excel,PowerPoint,Publisher等,齐全的令人吃惊),OpenOffice系列,PDF,Plain Text等,还可以方便的进行扩展(我自己写了一些数据库的内容抽取),感觉是一个One size fits all的一站式的解决方案,很值得一试.

Google越来越versatile了

In other on 11月 23, 2004 at 9:39 上午

1. Google之英文名查询 

今天做英语翻译,碰到一道题目里面要翻”叶利钦”的,不知道他名字的英文是什么,于是很自然的想到到Google上面去搜.不过不知道用什么关键字应该,满碰运气的试了一下”叶利钦 英文名”,结果搜出来一大堆关于叶利钦的新闻什么的,正失望的时候,突然发现在最上方居然有一小块:

叶利钦: Yeltsin
其他网上字典: Dr.eye线上字典

就好像Google提供的天气预报的服务一样的,不过天气预报的话关键字里面要加上”tq”,像”tq 上海”这样的,而上面这个明显智能多了,cool毙了简直!

2. Google之定义查询

在Google的搜索栏里面输入“define:”+keyword就可以找到该关键字的定义:

http://blogger.org.cn/blog/Google 首页  
所有网站 图像 新闻 New! 网上论坛 网页目录
高级搜索
使用偏好

Definitions of WiFi on the Web:

Wireless Fidelity – Otherwise known as Wireless Networking, commonly using the 802.11b protocol. Hardware that displays the WiFi logo claims 802.11b compliance should interconnect seamlessly
www.bb4g.org.uk/glossary.asp

is a way of transmitting information in wave form that is reasonably fast and is often used for notebooks.
highered.mcgraw-hill.com/sites/0072464011/student_view0/chapter6/glossary.html

The popular name for 802.11b wireless networking. This standard replaces the cables in an ethernet network.
home.attbi.com/~pdas/glossary.html

wireless local area network: a local area network that uses high frequency radio signals to transmit and receive data over distances of a few hundred feet; uses ethernet protocol
www.cogsci.princeton.edu/cgi-bin/webwn

Google的feature list

Jena with lastest release of MySQL

In javaprogrammingsemanticweb on 11月 18, 2004 at 12:20 上午

Sigh,真是郁闷,下载了最新release的MySQL 4.1.7,已经是general available的版本了,不知道为什么还是会有问题的,Jena在数据库中创建模型的时候抛出异常 WARN [main] (DriverRDB.java:382) – Problem formatting database
>> java.sql.SQLException: Syntax error or access violation message from server: “Specified key was too long; max key length is 1024 bytes”,
问了Jena的维护人Dave(很好的一个人,很快就回复我了,而且很认真,3q),得到的结论是MySQL在Jena里面设置key的长度不能超过250,否则就会抛出异常,但是默认应该不会超过250的,最后只能归结到Jena和MySQL 4.1.7的conflict上去了,换了MySQL 4.0.22果然就没有问题了(但是4.0.22的安装设置过程实在和4.1.7差太多了,不够人性化).浪费了很多时间在这个问题上,而且解决了之后并没有什么收获,唯一的感触是Jena-dev mailing list上有很多好人

Java Encoding Problem

In javaprogramming on 11月 16, 2004 at 11:18 上午

这几天一直被Java的编码问题所困扰,开始使用Protege编辑RDF,然后用Jena进行解析,其中Java IDE用的又是Eclipse,几样东西全都是Java的,Java的I/O的API又极其复杂,一出来乱码都不知道到底是哪个环节出了问题.现在终于发现没有任何一个环节有问题,是自己代码有问题,不过问题还没有完全解决,仅仅是完全弄明白了是怎么回事.Protege用UTF-8来保存RDF文件,没问题.Jena解析RDF不牵涉字符编码,没问题.Eclipse根据Java VM所在的平台来选择编码,Windows下用的是GBK.Java I/O API中FileReader和FileWriter用平台的GBK编码,标准输出流System.out用GBK编码.

开始的时候没考虑编码,直接Jena就开始处理了,弄的一头雾水是.后来把Jena先撇开,直接IO读写RDF,结果还是乱码.再后来发现应该直接用FileReader,FileWriter来读写,应该用InputStreamReader和OutputStreamWriter,然后要将二者的编码设成UTF-8来处理文件.这样的话写到磁盘上的文件就不会出现乱码了.但是输出到标准输出流上还是乱码,也就是Eclipse里面的控制台看起来还是乱码(因为System.out是GBK编码).以上只是简要的说了一下开始和结果,中间弯路走了实在太多不再敖述.

这个问题到现在还没想通要怎么解决,想到了两种方法都不是很好,一种是将System.out重定向到某个文件去,用UTF-8输出,然后用外部编辑器查看,但是明显很不方便,不利于调试而且以后每次关于这个System.out的时候都要重定向一下也很不方便.另一种方法是看看行不行重写System.out这个类,将编码改为UTF-8.但是不知道怎么重写还,而且改成UTF-8以后Eclipse的控制台也不确定行不行显示(因为Eclipse是采用容器的GBK编码的,不知道控制台会不会根据System.out的编码而改变编码).Sigh,没想法了我已经…

Software of Interest

In software on 11月 15, 2004 at 1:13 下午

1. Buddyspace,团队协作的IM工具

Buddyspace是又一款IM工具,由英国公开大学的Marc Eisenstadt等人开发。Buddyspace是基于Java的免费软件,与其它IM软件最大的区别在于软件中增加了“虚拟地图”,每个人位于地图上一点,例如可以在办公室的平面图显示每个员工的位置,绿色的点即表示该员工在线。如果团队成员在工作时用这个软件,会给大家一种更真实的感觉。软件也允许用户创建一个聊天室,其它用户在发表意见前需要“举手”。种种功能都是为了使软件的使用者感觉与现实更接近。有兴趣可以看看软件的演示,或者下载一个来试用。

2. RADIO.BLOG

他通过让用户安装一个程序,便可以把自己挑选的歌曲制作成一个列表,显示在自己的Blog页面上,然后允许用户通过一个Flash的播放器收听。

Multi-Agent System Related Sites

In semanticweb on 11月 11, 2004 at 12:48 上午

1. Non Commerial Uses

http://agents.umbc.edu/
University of Maryland Baltimore County Agent Web

http://www.agentlink.org/
AgentLink III is the new European Co-ordination Action for Agent Based Computing, a network of researchers and developers with a common interest in agent technology. Launched on 1st January 2004, it follows on from AgentLink II , and will continue to provide resources and information on Agent-Based research across Europe.

http://jade.tilab.com/
JADE (Java Agent DEvelopment Framework) is a software framework fully implemented in Java language. It simplifies the implementation of multi-agent systems through a middle-ware that complies with the FIPA specifications and through a set of graphical tools that supports the debugging and deployment phases.

http://aglets.sourceforge.net/
Aglets is a Java mobile agent platform and library that eases the delopment of agent based applications. An aglet is a Java agent able to autonomously and spountanously move from one host to another.

2. Commerial Uses

http://www.agent-software.com/
AOS’s flagship product, JACK™, provides the tools required to develop autonomous software systems that are both goal-directed and reactive. Commercially deployed worldwide, JACK-based systems are built from distributed reasoning entities that cooperate to achieve their goals.

EIP问题集(1)

In programming on 11月 11, 2004 at 12:33 上午

最近开始着手一个Enterprise Information Portal的项目,我主要负责Portal里面的文档的元数据管理,想法是用RDF来对文档进行描述,然后应用到门户的文档分类浏览,检索等方面.

项目刚刚启动,已经碰到了许多问题了,苦于没人可以提供帮助.目前看到的问题主要有:

  1. 元数据到底在信息门户中可以有多大的应用,可以有什么应用,目前能够看到的应用也就是用于信息浏览与检索,看了一些Semantic Portal的例子,似乎也都是用于这些方面的,但让人觉得没有什么亮点.要说再和其他系统集成的话又感觉比较远了,迷惑中
  2. 元数据用RDF还是用OWL来描述比较好.现在Web上的很多元数据应用都是基于RDF的,像RSS,FOAF等等,基于OWL的好像没看到,但是RDF的表达能力又太弱了,也许现在还能够满足要求,要是以后有了新的需求的话再转到OWL就比较麻烦了.其实这又回到了第一个问题,到底有哪些应用是可以预期的?
  3. Encoding的问题.这是一个很实际的问题,用Protege编辑RDFS采用的是UTF-8编码,但是在Windows下用Jena解析的时候出错,中文全部不行正确解码,也不知道是Java的问题还是Jena的问题.

WWW的超链接分布图

In other on 11月 5, 2004 at 4:52 下午

The picture of Cyber Space, how amazing!

SuperLink Network

Semantic Web Related Sites

In semanticweb on 11月 4, 2004 at 2:37 下午

1. Non Commerial Uses

http://www.mindswap.org/
the first site on the Semantic Web

http://www.openrdf.org/
Sesame is an open source Java framework for storing, querying and reasoning with RDF and RDF Schema.

http://protege.stanford.edu/
Protégé is an ontology editor and a knowledge-base editor.

http://jena.sourceforge.net/
Jena is a Java framework for building Semantic Web applications. It provides a programmatic environment for RDF, RDFS and OWL, including a rule-based inference engine.

http://simile.mit.edu/
Semantic Interoperability of Metadata and Information in unLike Environments

http://www.dspace.org/
DSpace is a groundbreaking digital library system that captures, stores, indexes, preserves and redistributes the intellectual output of a university’s research faculty in digital formats.

http://kowari.sourceforge.net/
The Kowari MetastoreTM is an Open Source, massively scalable, transaction-safe, purpose-built database for the storage and retrieval of metadata.

http://4suite.org/index.xhtml
4Suite is a platform for XML processing and knowledge-management. It allows users to take advantage of standard XML technologies rapidly and to develop and integrate Web-based applications.

http://kaon.semanticweb.org/
KAON is an open-source ontology management infrastructure targeted for business applications. It includes a comprehensive tool suite allowing easy ontology creation and management, as well as building ontology-based applications.

http://librdf.org/
Redland is a set of free software packages that provide support for the Resource Description Framework (RDF).

http://www.ninebynine.org/RDFNotes/Swish/Intro.html
Swish is a framework, written in the purely functional programming language Haskell, for performing deductions in RDF data using a variety of techniques. Swish is conceived as a toolkit for experimenting with RDF inference, and for implementing stand-alone RDF file processors (usable in similar style to CWM, but with a view to being extensible in declarative style through added Haskell function and data value declarations). It explores Haskell as “a scripting language for the Semantic Web”.

http://www.w3.org/2000/10/swap/doc/cwm.html
Cwm is a general-purpose data processor for the semantic web, somewhat like sed, awk, etc. for text files or XSLT for XML. It is a forward chaining reasoner which can be used for querying, checking, transforming and filtering information. Its core language is RDF, extended to include rules, and it uses RDF/XML or RDF/N3 (see Notation3 Primer) serializations as required.

http://www.ontotext.com/kim/

KIM is a software platform for:

  • Semantic annotation of text.
    At more length: automatic ontology population and open-domain dynamic semantic annotation of unstructured and semi-structured content for Semantic Web and KM applications.
  • Indexing and retrieval (an IE-enhanced search technology).
  • Query and exploration of formal knowledge.

2. Commerial Uses

http://www.tucanatech.com/

With Tucana Information Management Suite at the core of your Enterprise Information Integration (EII) strategy you bring all the power of enterprise knowledge together and put it in the hands of your engineers, scientists, bankers, salespeople or managers.

http://www.siderean.com/
Siderean’s flagship product, Seamark Server, is a faceted navigation platform that delivers an effective and economical standards-based solution that dramatically improves information access across distributed repositories of content, data, software components and digital assets in the enterprise.

http://aduna.biz/index.html

AutoFocus helps you to search and find information on your PC, network disks, mail boxes, websites and enterprise information sources.

The Evalution of Semantic Web Portal

In semanticweb on 11月 3, 2004 at 9:57 上午

Semantic Web Portals � State of the Art Survey
Authors:
Holger Lausen, Michael Stollberg,
Rubén Lara Hernández, Ying Ding, Sung-Kook Han,
Dieter Fensel
DERI Technical Report 2004-04-03

Semantic Web Portal的定义:
1. It is a web portal. A web portal is a web site that collects information for a group of users that have common interests
2. It is a web portal for a community to share and exchange information
3. It is a web portal based on semantic web technologies.

文中分三层评估SW Portal:
1. Grounding Technologies
2. Information Processing
3. Information Access

其中Grounding Technologies又分为两层,分别是System Technology(包括数据库系统,文档存储管理等)和Semantic Web Technology(ontology表示和存储等,会牵涉到System Technology)
Information Processing主要包括工作流管理(文档的创建,发布,组织,访问和维护)和协同合作(同步和异步的协作)(其中还区分了group和community的不同,cool)
Information Access包括实用性和综合评价两部分,感觉没有什么有用信息在里面

最后对4个SW Portal进行了比较评估,结果如下表所示:

The Evalution Result of SW Portal