Created in: 2006-05-27 21:52:09
Author: martin
Size: 14823 bytes
Last updated: 2006-05-27 21:52:09
This test application uses the jLibrary API to extract documents published on the Spanish's Official Nation Bulletin (BOE). In this bulletin you can find infromation about several topics related to the nation operation, and it used to be published daily.
It has information about public tenders, entrance examinations, political information, etc. It's a very valuable source of information for particulars and companies.
The application uses a simple parser to obtain the contents of all the PDF documents published for an entire month. Once the documents are got, they are stored on a jLibrary repository using the jLibrary API. Note that you have to have a repository with "boe" name already created to successfully run this example.
This is a simple example of how to use jLibrary API programatically to grab and store content crawled from other place. You could extend this example adding categories, relations, more information to the documents, but for simplicity I have maintained things very simple. This example can also serve you as source to build other extractions systems, for example from databases.
This example assumes that you have created a repository with the name "boe".:
In the first lines of the BOECreatorTest file you can see how to use the jLibrary API to connect to that repository:
[1] Credentials credentials = new Credentials();
credentials.setUser(User.ADMIN_NAME);
credentials.setPassword(User.DEFAULT_PASSWORD);
[2] ServerProfile profile = new ServerProfile();
profile.setType(ServerProfile.LOCAL);
profile.setLocation("");
profile.setName("Local server");
[3] ClientInterface jci = profile.getJCI();
[4] RepositoryService rs = jci.getRepositoryService();
[5] SecurityService ss = jci.getSecurityService();
[6] Ticket ticket = ss.login(credentials, "boe");
Now that you have a ticket to access to a repository, the next step is grabbing that repository:
Repository repository = rs.findRepository("boe", ticket);
repository.setServerProfile(profile);
As you can see on the example above, once you have found the repository, you should set the associated server profile. This isn't a mandatory step, but it's useful when you're building upon the jLibrary client interface because this interfaces expects to can find the server profile directly from the repository objects.
This example puts the crawled documents on a directory structure like year/month/day. So, when any of these directories do not exist it has to create it. The following snippet shows how the sample creates a directory:
[1] Directory directory = repository.getRoot();
Directory yearDir = null;
[2] if (existsChild(ticket, rs, directory, String.valueOf(year))) {
yearDir = (Directory)getChild(directory, String.valueOf(year));
} else {
System.out.println("[BOECreatorTest] Creating directory " + String.valueOf(year));
[3] yearDir = rs.createDirectory(ticket, String.valueOf(year), "BOE for year : " + year, directory.getId());
}
Getting the children of a directory is a very easy task but requires some background explanation. If you are familiar with repository configuration options then you should know that jLibrary repositories can be lazy or non-lazy. A non-lazy repository is a repository that when you grab it, it has all the information of its nodes inside the grabbed Repository instance. A lazy repository is a repository that when you grab it, it only has the first child level information, that's it the root node.
Handling non-lazy repositories is trivial as you have all the information and the only thing you have to do to get the children of a node is call to the getNodes() method from the Node class.
Handling lazy repositories is a bit different. Every node has a hasChildren() method that will tell you if the node has children or not. Then you have to play with this method and the getNodes() method to check if you have to look for the node's children. Look at the existChild sample method:
private boolean existsChild(Ticket ticket,
RepositoryService rs,
Directory parent,
String childName) throws Exception {
[1] if (((parent.getNodes() == null) || (parent.getNodes().size() == 0))&& (parent.hasChildren())) {
// This is a lazy node
[2] Collection child = rs.findNodeChildren(ticket, parent.getId());
parent.setNodes(new HashSet(child));
}
if (parent.getNodes() != null) {
Iterator it = parent.getNodes().iterator();
while (it.hasNext()) {
Node node = it.next();
if (node.getName().equals(childName)) {
return true;
}
}
}
return false;
}
To create a document the first thing you have to do is build a Document instance with all its attributes. This class contains all the information for the document. Building a document instance is pretty trivial if you already have the data:
private static DocumentProperties createDocumentProperties(
Ticket ticket,Repository repository, Directory parent, BOEFile boeFile)
throws Exception {
[1] PDFExtractor extractor = new PDFExtractor();
ByteArrayInputStream bais = new ByteArrayInputStream(boeFile.getContent());
HeaderMetaData hmd = extractor.extractHeader(bais);
bais.close();
[2] DocumentMetaData metadata = new DocumentMetaData();
metadata.setAuthor(Author.UNKNOWN);
metadata.setKeywords(hmd.getKeywords() != null ? hmd.getKeywords() : "boe");
metadata.setLanguage(hmd.getLanguage() != null ? hmd.getLanguage() : "es_ES");
metadata.setTitle(hmd.getTitle() != null ? hmd.getTitle() : boeFile.getName());
metadata.setUrl(boeFile.getURL());
metadata.setDate(new Date());
[3] Document document = new Document();
document.setTypecode(Types.getTypeForFile(boeFile.getName()));
document.setName(boeFile.getName());
document.setDescription(hmd.getDescription() != null ? hmd.getDescription() : metadata.getTitle());
document.setExternal(false);
document.setMetaData(metadata);
document.setNodes(new TreeSet());
document.setNotes(new HashSet());
document.setParent(parent.getId());
document.setReference(false);
document.setRepository(parent.getRepository());
document.setImportance(Node.IMPORTANCE_MEDIUM);
document.setCreator(ticket.getUser().getId());
document.setSize(new BigDecimal(boeFile.getContent().length));
document.setDate(new Date());
document.setMetaData(metadata);
document.setPath("/" + boeFile.getName());
[4] Category unknownCategory = findUnknownCategory(repository);
DocumentProperties docProperties = document.dumpProperties();
docProperties.addProperty(DocumentProperties.DOCUMENT_ADD_CATEGORY,
unknownCategory.getId());
[5] docProperties.addProperty(DocumentProperties.DOCUMENT_CONTENT, boeFile.getContent());
return docProperties;
}
Once you have the document properties then creating the document is as easy as it was creating a directory:
DocumentProperties properties = createDocumentProperties(ticket, repository, dayDir, boeFile);
[1] Document document = rs.createDocument(ticket, properties);
[2] dayDir.getNodes().add(document);
Finally, when you are done with the repository you should always closing your connection with it to release resources. This is as simple as calling the disconnect method from the security service:
// Logout
ss.disconnect(ticket);
If you execute the program you will get a repository with all the content of the Spanish BOE. You can see here a screenshot:

As you can see, the jLibrary API is a higher level interface to work with jLibrary repositories. This API lets you abstract from the JCR low level details and easily work with repositories data. Note that all the code from this example could be also be done by hand on a web application using the JCR API, but for working with jLibrary repositories this is a very productive way for doing things.
You can find more information about this API browsing the source code, looking at the javadoc or going to the forums and mailing lists.
-Martin