Archive for the ‘java’ Category

《I Love Lucene》总结

In java、programming on 1月 18, 2005 at 1:57 下午

前几天看到了TheServerSide上面的一篇文章I Love Lucene，感到对自己很有帮助，稍微总结了一下理理思路。

I Love Lucene

by Dion Almaer January 2005

Introduction

简要介绍了TheServerSite原来使用的搜索的方案，并由此引出Lucene。

High level infrastructure

从高层介绍Lucene的方案，这一方案主要分成了两大部分，一部分是建立索引，另一部分是对索引进行搜索。分别介绍了这两部分的主要的接口IndexBuilder和IndexSearch。

Building the Index: Details of the index building process

全文最重要的一个组成部分。该部分介绍了以下四个内容：

1. 应该索引的字段

2. 索引的方式：增量索引、批量索引

3. 索引源的类型

4. 索引结果的rank

What fields should compromise our index?

讲了一下不同的索引字段使用的数据类型的问题

What types of indexing?

采用了增量索引和批量索引结合的方式，定义了一个增量索引的间隔，每隔这一间隔进行一次批量索引，在间隔内的时间内进行增量索引。同时还介绍了一下Lucene中如何删除索引记录。

What to index?

ThreadIndexSource

介绍对不同的索引源索引的问题，如对数据库中的数据索引以及对文件系统中的文件的索引。还介绍了一下索引TheServerSide的论坛中的帖子时引出的一个小技巧。

How to tweak the ranking of records?

对不同的字段赋以不同的权值来对一个文档进行较合理的rank

Searching the index

Lucene的使用主要看来是建立索引比较复杂，搜索索引极其简单，这里稍微花了一点篇幅就讲清了都，主要介绍了IndexSearch类中的search方法和查询解析类CustomQueryParser。

Configuration: One place to rule them all

这部分主要介绍如何使用XML文件对搜索中的一些参数（如索引存放位置、字段权值等）进行动态配置，和Lucene其实没什么关系，主要还是说的是IoC（控制反转）的东西，讲了一下Apache Digester的使用。

XML Configuration File

Digester Rules File

Web Tier: TheSeeeeeeeeeeeerverSide?

用户搜索使用的Web界面，MVC结构。

SearchAssembler Web Action

根据用户输入构造查询语句，并讲查询语句交给IndexSearch处理，同时还负责封装查询结果。

Search View

表示层使用JSP ( for legacy reason)。根据TheServerSide上面的帖子回复，似乎说TheServerSide以后要用Apache的Tapestry了

Conclusion

总结，就是说Lucene很好。

▶ 评论

Deployment Descriptor Elements and Tag Library Descriptor

In java on 1月 18, 2005 at 8:58 上午

下文涵盖了所有Servlet的部署描述符文件web.xml中各个元素的含义。

web.xml Deployment Descriptor Elements

下文讲述了如何创建一个Tag Library描述符的步骤以及Tag Library描述符中各个元素的含义。

Creating a Tag Library Descriptor

The following sections describe how to create a tag library descriptor (TLD) file:

▶ 评论

Lucene in Action出版了

In java、other、programming on 1月 11, 2005 at 7:14 上午

http://www.lucenebook.com/

Lucene In Action At Amazon

by Erik Hatcher, Otis Gospodnetic

Lucene in Action

Lucene is a gem in the open-source world–a highly scalable, fast search engine. It delivers performance and is disarmingly easy to use. Lucene in Action is the authoritative guide to Lucene. It describes how to index your data, including types you definitely need to know such as MS Word, PDF, HTML, and XML. It introduces you to searching, sorting, filtering, and highlighting search results. Lucene powers search in surprising places–in discussion groups at Fortune 100 companies, in commercial issue trackers, in email search from Microsoft, in the Nutch web search engine (that scales to billions of pages). It is used by diverse companies including Akamai, Overture, Technorati, HotJobs, Epiphany, FedEx, Mayo Clinic, MIT, New Scientist Magazine, and many others. Adding search to your application can be easy. With many reusable examples and good advice on best practices, Lucene in Action shows you how.

What’s Inside

How to integrate Lucene into your applications
Ready-to-use framework for rich document handling
Case studies including Nutch, TheServerSide, jGuru, etc.
Lucene ports to Perl, Python, C#/.Net, and C++
Sorting, filtering, term vectors, multiple, and remote index searching
The new SpanQuery family, extending query parser, hit collecting
Performance testing and tuning
Lucene add-ons (hit highlighting, synonym lookup, and others)
Foreword by Doug Cutting, the inventor of Lucene

WHAT THE READERS SAY ABOUT THIS BOOK…

“I bought the Lucene in Action ebook, which is excellent and I can strongly recommend [it]. …Thanks to the authors for Lucene in Action, it’s given me the high level best practices I was needing.” — Steve S.

ABOUT THE AUTHORS…

A committer on the Ant, Lucene, and Tapestry open-source projects, Erik Hatcher is coauthor of Manning’s award-winning Java Development with Ant. Otis Gospodnetić is a Lucene committer, a member of Apache Jakarta Project Management Committee, and maintainer of the jGuru’s Lucene FAQ. Both authors have published numerous technical articles including several on Lucene.

正是我想要的书，不知道要多久中国才会引进，期待中…Amazon上面很多好书中国都看不到，语义网方面的书籍，除了宋炜博士的一本《语义网简明教程》就再也没有了，而且这个还不算是外国出版的书。科技慢人一步和科技书籍慢人一步是有一定联系的！

▶ 评论

Java各版本发布时间及代号

In java、other on 12月 30, 2004 at 3:18 下午

已发行的版本:
版本号	名称	中文名	发布日期
JDK 1.1.4	Sparkler	宝石	1997-09-12
JDK 1.1.5	Pumpkin	南瓜	1997-12-13
JDK 1.1.6	Abigail	阿比盖尔–女子名	1998-04-24
JDK 1.1.7	Brutus	布鲁图–古罗马政治家和将军	1998-09-28
JDK 1.1.8	Chelsea	切尔西–城市名	1999-04-08
J2SE 1.2	Playground	运动场	1998-12-04
J2SE 1.2.1	none	无	1999-03-30
J2SE 1.2.2	Cricket	蟋蟀	1999-07-08
J2SE 1.3	Kestrel	美洲红隼	2000-05-08
J2SE 1.3.1	Ladybird	瓢虫	2001-05-17
J2SE 1.4.0	Merlin	灰背隼	2002-02-13
J2SE 1.4.1	grasshopper	蚱蜢	2002-09-16
J2SE 1.4.2	Mantis	螳螂	2003-06-26
将来发行的版本:
J2SE 5.0 (1.5.0)	Tiger	老虎	已发布了Beta版本
J2SE 5.1 (1.5.1)	Dragonfly	蜻蜓	未发布
J2SE 6.0 (1.6.0)	Mustang	野马	未发布

另外听说J2SE 7.0叫dolphin.可以看出,从1.2.2开始就有规律了,都是昆虫或者动物的名称,J2SE应该改名叫J2ZE(Java 2 Zoo Edition)比较合适看来

▶ 评论

用XML构建Java Appilication的UI

In java、programming on 12月 28, 2004 at 12:58 上午

1.2

SwiX^ml, is a small GUI generating engine for Java applications and applets. Graphical User Interfaces are described in XML documents that are parsed at runtime and rendered into javax.swing objects.

SwingML
Swing Markup Language

SwingML is an effort to create a markup language to render in a web browser JFC/Swing based graphical user interfaces.

:: Java User Interface Design ::

We make Java look good and work well

JGoodies focuses on Java look, UI design and usability. We provide articles, libraries, example applications, desktop patterns and a Swing application architecture.

Jelly : Executable XML

Jelly is a tool for turning XML into executable code. So Jelly is a Java and XML based scripting and processing engine. Jelly can be used as a more flexible and powerful front end to Ant such as in the Maven project, as a testing framework such as JellyUnit , in an intergration or workflow system such as werkflow or as a page templating system inside engines like Cocoon .

Thinlet is a GUI toolkit, a single Java class, parses the hierarchy and properties of the GUI, handles user interaction, and calls business logic. Separates the graphic presentation (described in an XML file) and the application methods (written as Java code).

JDesktop Network Components (JDNC)

The goal of the JDesktop Network Components (JDNC) project is to significantly reduce the effort and expertise required to build rich, data-centric, Java desktop clients for J2EE-based network services. These clients are representative of what enterprise developers typically build, such as SQL database frontends, forms-based workflow, data visualization applications, and the like.

JDNC leverages the power of J2SE and Swing while providing a higher level API, as well as an optional XML markup language, which enables common user-interface functionality to be constructed more quickly, without requiring extensive Swing or GUI programming skill. Additionally, JDNC simplifies the task of connecting a rich client to a J2EE backend, including JDBC and WebServices.

▶ 评论

文档文档

In java、learning on 12月 3, 2004 at 6:55 上午

今天看Pluto看了半天,一边对着Apache,一边对着developerWorks把两边找到的Pluto的资料加在一起看,还是没看懂怎么在Pluto上面部署portlet,相关的文档太少了,而且还不是最新的,看了两个多小时,连一个HelloWorld的portlet都没搞定,sigh,文档到用时放恨少~

▶ 评论

Content Extract

In java、programming on 11月 29, 2004 at 6:17 上午

1. Noncommercial Products:

http://www.etymon.com/epub.html

PJX specifically included significantly faster reading and writing of PDF documents, thread safety, “on demand” reading and parsing of PDF objects which greatly reduces memory usage and processing time, incremental update support to enable fast modification of PDF documents, reading PDF documents from either disk or memory, thorough documentation of the class library interface, support for J2SE collection classes and NIO, access to form/field objects, rudimentary support for insertion of images and watermarks, appending of large documents, and design patterns for recursive processing of PDF objects.

http://www.pdfbox.org/

PDFBox is a Java PDF Library. This project will allow access to all of the components in a PDF document. More PDF manipulation features will be added as the project matures. This ships with a utility to take a PDF document and output a text file.

http://www.foolabs.com/xpdf/about.html

Xpdf is an open source viewer for Portable Document Format (PDF) files. (These are also sometimes also called ‘Acrobat’ files, from the name of Adobe’s PDF software.) The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities.

http://jakarta.apache.org/poi/index.html

The POI project consists of APIs for manipulating various file formats based upon Microsoft’s OLE 2 Compound Document format using pure Java. In short, you can read and write MS Excel files using Java. Soon, you’ll be able to read and write Word files using Java. POI is your Java Excel solution as well as your Java Word solution. However, we have a complete API for porting other OLE 2 Compound Document formats and welcome others to participate.

2. Commercial Products:

http://tonicsystems.com/products/

Tonic Systems is the leading PowerPoint® automation specialist. Each of our products has been born from our experience developing solutions to real life business challenges, and each has been proven extensively in enterprise environments.

We have developed our range of products in response to customer demand. These 100% java, server-side products are robust and scalable to meet the needs of the most demanding environments.

http://snowtide.com/home/PDFTextStream/

PDFTextStream is the ideal solution for Java applications and J2EE web services that need to rapidly and accurately extract text and document metadata from PDF files.

Update:

以上这些基本都是PDF的内容抽取,后来我使用过了Aperture这个框架,感觉很好用,虽然现在还没有正式发布,不过CVS上面的代码已经支持绝大多数格式的文档的信息抽取,包括Office系列(Word,Excel,PowerPoint,Publisher等,齐全的令人吃惊),OpenOffice系列,PDF,Plain Text等,还可以方便的进行扩展(我自己写了一些数据库的内容抽取),感觉是一个One size fits all的一站式的解决方案,很值得一试.

▶ 评论

Jena with lastest release of MySQL

In java、programming、semanticweb on 11月 18, 2004 at 12:20 上午

Sigh,真是郁闷,下载了最新release的MySQL 4.1.7,已经是general available的版本了,不知道为什么还是会有问题的,Jena在数据库中创建模型的时候抛出异常 WARN [main] (DriverRDB.java:382) – Problem formatting database
>> java.sql.SQLException: Syntax error or access violation message from server: “Specified key was too long; max key length is 1024 bytes”,问了Jena的维护人Dave(很好的一个人,很快就回复我了,而且很认真,3q),得到的结论是MySQL在Jena里面设置key的长度不能超过250,否则就会抛出异常,但是默认应该不会超过250的,最后只能归结到Jena和MySQL 4.1.7的conflict上去了,换了MySQL 4.0.22果然就没有问题了(但是4.0.22的安装设置过程实在和4.1.7差太多了,不够人性化).浪费了很多时间在这个问题上,而且解决了之后并没有什么收获,唯一的感触是Jena-dev mailing list上有很多好人

▶ 评论

Java Encoding Problem

In java、programming on 11月 16, 2004 at 11:18 上午

这几天一直被Java的编码问题所困扰,开始使用Protege编辑RDF,然后用Jena进行解析,其中Java IDE用的又是Eclipse,几样东西全都是Java的,Java的I/O的API又极其复杂,一出来乱码都不知道到底是哪个环节出了问题.现在终于发现没有任何一个环节有问题,是自己代码有问题,不过问题还没有完全解决,仅仅是完全弄明白了是怎么回事.Protege用UTF-8来保存RDF文件,没问题.Jena解析RDF不牵涉字符编码,没问题.Eclipse根据Java VM所在的平台来选择编码,Windows下用的是GBK.Java I/O API中FileReader和FileWriter用平台的GBK编码,标准输出流System.out用GBK编码.

开始的时候没考虑编码,直接Jena就开始处理了,弄的一头雾水是.后来把Jena先撇开,直接IO读写RDF,结果还是乱码.再后来发现应该直接用FileReader,FileWriter来读写,应该用InputStreamReader和OutputStreamWriter,然后要将二者的编码设成UTF-8来处理文件.这样的话写到磁盘上的文件就不会出现乱码了.但是输出到标准输出流上还是乱码,也就是Eclipse里面的控制台看起来还是乱码(因为System.out是GBK编码).以上只是简要的说了一下开始和结果,中间弯路走了实在太多不再敖述.

这个问题到现在还没想通要怎么解决,想到了两种方法都不是很好,一种是将System.out重定向到某个文件去,用UTF-8输出,然后用外部编辑器查看,但是明显很不方便,不利于调试而且以后每次关于这个System.out的时候都要重定向一下也很不方便.另一种方法是看看行不行重写System.out这个类,将编码改为UTF-8.但是不知道怎么重写还,而且改成UTF-8以后Eclipse的控制台也不确定行不行显示(因为Eclipse是采用容器的GBK编码的,不知道控制台会不会根据System.out的编码而改变编码).Sigh,没想法了我已经…

▶ 评论

niyue

tao of yue

Archive for the ‘java’ Category

《I Love Lucene》总结

Deployment Descriptor Elements and Tag Library Descriptor

Lucene in Action出版了

Lucene In Action At Amazon

by Erik Hatcher, Otis Gospodnetic

Lucene in Action

What’s Inside

WHAT THE READERS SAY ABOUT THIS BOOK…

ABOUT THE AUTHORS…

Java各版本发布时间及代号

用XML构建Java Appilication的UI

:: Java User Interface Design ::

Jelly : Executable XML

文档文档

Content Extract

Jena with lastest release of MySQL

Java Encoding Problem

归档

其他