jLibrary

Open Source Document Management System from your Desktop

  • Developers
  • Help & Support
  • Enterprise
  • Documents & Articles
jLibrary » Development » Using the jLibrary API to add information to a repository.html

Information

Created in: 2006-05-27 21:52:09

Author: martin

Size: 14823 bytes

Last updated: 2006-05-27 21:52:09

Categories

Developers

You may be interested in...

jLibrary Developers Guide

Using the jLibrary API to add information to a repository

This test application uses the jLibrary API to extract documents published on the Spanish's Official Nation Bulletin (BOE). In this bulletin you can find infromation about several topics related to the nation operation, and it used to be published daily.

It has information about public tenders, entrance examinations, political information, etc. It's a very valuable source of information for particulars and companies.

The application uses a simple parser to obtain the contents of all the PDF documents published for an entire month. Once the documents are got, they are stored on a jLibrary repository using the jLibrary API. Note that you have to have a repository with "boe" name already created to successfully run this example.

This is a simple example of how to use jLibrary API programatically to grab and store content crawled from other place. You could extend this example adding categories, relations, more information to the documents, but for simplicity I have maintained things very simple. This example can also serve you as source to build other extractions systems, for example from databases.

Example structure

This example is very simple consts only of three classes:
  • BOECreatorTest: This is the main class. This class will crawl information for an entire month. Note that this is really a huge amount of information (+600Mb). If you want, you can modify it to extract only information from a single day.
  • BOEParser: This is a parser class. It's used to access to the HTML web pages and crawl the PDF files information. It uses Apache HttpClient library to get the binary content and HtmlParser library to parse web page information.
  • BOEFile: This is a simple holder class to store the contents from each downloaded entry.

Connecting to a repository

This example assumes that you have created a repository with the name "boe".:

In the first lines of the BOECreatorTest file you can see how to use the jLibrary API to connect to that repository:

[1]   Credentials credentials = new Credentials();
      credentials.setUser(User.ADMIN_NAME);
      credentials.setPassword(User.DEFAULT_PASSWORD);
 
[2]   ServerProfile profile = new ServerProfile();
      profile.setType(ServerProfile.LOCAL);
      profile.setLocation("");
      profile.setName("Local server");

[3]   ClientInterface jci = profile.getJCI();
[4]   RepositoryService rs = jci.getRepositoryService();
[5]   SecurityService ss = jci.getSecurityService();
[6]   Ticket ticket = ss.login(credentials, "boe");
  • [1] The first thing you must do is build a credentials object. Here I used the default admin username and password.

  • [2] jLibrary uses the concept of server profiles. A ServerProfile instance defines the name, location and type of the repository you are going to access. On this example I have created a server profile to access to a local repository, but accessing to a remote repository would be as easy as changing the second line by profile.setType(ServerProfile.REMOTE) and putting the correct location.

  • [3] To access to server services, jLibrary uses the concept of Client interfaces (JCI: jLibrary Client Interface). From a profile, you can obtain a suitable client interface. jLibrary creates automatically the appropiate client interface for the kind of server you are going to access. For example, if you are going to access to a remote server, then jLibrary will instantiate a web services based client interface to access to that server; this is done without having to worry about a single web services line of code.

  • [4][5] From a client interface you can obtain the services to access to information. jLibrary exposes three different services: search, security management and repository access.

  • [6] The final step if login into the repository. You can use the login method from the security interface. This method would return a Ticket instance, this ticket will identify your session with the jLibrary server and you will have to use it on every operation done with the server.

Finding a repository

Now that you have a ticket to access to a repository, the next step is grabbing that repository:

      Repository repository = rs.findRepository("boe", ticket);
      repository.setServerProfile(profile);

As you can see on the example above, once you have found the repository, you should set the associated server profile. This isn't a mandatory step, but it's useful when you're building upon the jLibrary client interface because this interfaces expects to can find the server profile directly from the repository objects.

Creating directories

This example puts the crawled documents on a directory structure like year/month/day. So, when any of these directories do not exist it has to create it. The following snippet shows how the sample creates a directory:

[1]   Directory directory = repository.getRoot();
      Directory yearDir = null;
[2]   if (existsChild(ticket, rs, directory, String.valueOf(year))) {
          yearDir = (Directory)getChild(directory, String.valueOf(year));
      } else {
          System.out.println("[BOECreatorTest] Creating directory " + String.valueOf(year));
[3]       yearDir = rs.createDirectory(ticket, String.valueOf(year), "BOE for year : " + year, directory.getId());
      }
  • [1] Here we get the repository's root. We will use this directory as the parent of the directory we are going to create.
  • [2] On this point we check if the directory was already created. We will look deeply at this method on the next section.

  • [3] The createDirectory of the repository server allows us to create a directory. It takes four parameters: the ticket that identifies our session on the server, the name of the new repository, the description of the repository and the parent's directory id. As you can see, creating directories as many other jLibrary operations is really simple using the jLibrary API.

Getting the children of a directory

Getting the children of a directory is a very easy task but requires some background explanation. If you are familiar with repository configuration options then you should know that jLibrary repositories can be lazy or non-lazy. A non-lazy repository is a repository that when you grab it, it has all the information of its nodes inside the grabbed Repository instance. A lazy repository is a repository that when you grab it, it only has the first child level information, that's it the root node.

Handling non-lazy repositories is trivial as you have all the information and the only thing you have to do to get the children of a node is call to the getNodes() method from the Node class.

Handling lazy repositories is a bit different. Every node has a hasChildren() method that will tell you if the node has children or not. Then you have to play with this method and the getNodes() method to check if you have to look for the node's children. Look at the existChild sample method:

      private boolean existsChild(Ticket ticket,
                                  RepositoryService rs, 
                                  Directory parent,
                                  String childName) throws Exception {
		
[1]       if (((parent.getNodes() == null) || (parent.getNodes().size() == 0))&& (parent.hasChildren())) {
              // This is a lazy node
[2]           Collection child = rs.findNodeChildren(ticket, parent.getId());
              parent.setNodes(new HashSet(child));
          }
          if (parent.getNodes() != null) {
              Iterator it = parent.getNodes().iterator();
              while (it.hasNext()) {
                  Node node = it.next();
                  if (node.getName().equals(childName)) {
                      return true;
                  }
              }
          }
          return false;
      }
  • [1] Here we use the getNodes() and the hasChildren() methods to check if the node has been already loaded.

  • [2] If the node does not have been loaded then we use the findNodeChildren method from the repository service to obtain the list of child for that node. This method is very simple and only takes a ticket instance and the id of the parent directory.

Building a document

To create a document the first thing you have to do is build a Document instance with all its attributes. This class contains all the information for the document. Building a document instance is pretty trivial if you already have the data:

      private static DocumentProperties createDocumentProperties(
          Ticket ticket,Repository repository, Directory parent, BOEFile boeFile)
              throws Exception {
 
[1]       PDFExtractor extractor = new PDFExtractor();
          ByteArrayInputStream bais = new ByteArrayInputStream(boeFile.getContent());
          HeaderMetaData hmd = extractor.extractHeader(bais);
          bais.close();
		
[2]       DocumentMetaData metadata = new DocumentMetaData();
          metadata.setAuthor(Author.UNKNOWN);
          metadata.setKeywords(hmd.getKeywords() !=  null ? hmd.getKeywords() : "boe");
          metadata.setLanguage(hmd.getLanguage() != null ? hmd.getLanguage() : "es_ES");
          metadata.setTitle(hmd.getTitle() != null ? hmd.getTitle() : boeFile.getName());
          metadata.setUrl(boeFile.getURL());
          metadata.setDate(new Date());
				
[3]       Document document = new Document();
          document.setTypecode(Types.getTypeForFile(boeFile.getName()));
          document.setName(boeFile.getName());
          document.setDescription(hmd.getDescription() != null ? hmd.getDescription() : metadata.getTitle());
          document.setExternal(false);
          document.setMetaData(metadata);
          document.setNodes(new TreeSet());
          document.setNotes(new HashSet());
          document.setParent(parent.getId());
          document.setReference(false);
          document.setRepository(parent.getRepository());
          document.setImportance(Node.IMPORTANCE_MEDIUM);
          document.setCreator(ticket.getUser().getId());
          document.setSize(new BigDecimal(boeFile.getContent().length));
          document.setDate(new Date());
          document.setMetaData(metadata);
          document.setPath("/" + boeFile.getName());

[4]       Category unknownCategory = findUnknownCategory(repository);		
          DocumentProperties docProperties = document.dumpProperties();
          docProperties.addProperty(DocumentProperties.DOCUMENT_ADD_CATEGORY,
              unknownCategory.getId());

[5]       docProperties.addProperty(DocumentProperties.DOCUMENT_CONTENT, boeFile.getContent());

          return docProperties;
      }
  • [1] In this example I use one of the jLibrary text filters to automatically extract the document metadata. jLibrary has several text filters and has contributed them to the Apache Jackrabbit project. You can find them on the org.jlibrary.core.search.extraction package.

  • [2] One of the attributes of a document is its metadata. Here we will fill the metadata fields with the metadata extracted before. As the PDF extractor can also return null values if it is not able to find the metadata then we must supply default values.

  • [3] Now it's time to create the document instance and set all the required fields.

  • [4] One of the required fields is a category for the document. Here I set the unknown category. Take a look to the example source code to see how to find that category, is very simple. Note that here we could also have look for the category using the repository service methods.

  • [5] Finally we add the binary content to the properties. Note that this property is not mandatory. This is very important, because if you only want to update a document name, you don't need to pass all the document content through a remote call.

Creating a document

Once you have the document properties then creating the document is as easy as it was creating a directory:

      DocumentProperties properties = createDocumentProperties(ticket, repository, dayDir, boeFile);
[1]   Document document = rs.createDocument(ticket, properties);
[2]   dayDir.getNodes().add(document);
  • [1] The createDocument repository service method allows you to create a document. It's really simple and it only takes the document properties and the ticket information.
  • [2] Once a document is created it should be added to the local directory instance that was created previously.

Disconnecting from a repository

Finally, when you are done with the repository you should always closing your connection with it to release resources. This is as simple as calling the disconnect method from the security service:

      // Logout
      ss.disconnect(ticket);

Conclusions

If you execute the program you will get a repository with all the content of the Spanish BOE. You can see here a screenshot:

As you can see, the jLibrary API is a higher level interface to work with jLibrary repositories. This API lets you abstract from the JCR low level details and easily work with repositories data. Note that all the code from this example could be also be done by hand on a web application using the JCR API, but for working with jLibrary repositories this is a very productive way for doing things.

You can find more information about this API browsing the source code, looking at the javadoc or going to the forums and mailing lists.

-Martin

 

Copyright © 2004-2006 Martín Pérez Mariñán & others. Created with jLibrary. Design by Andreas Viklund.

Eclipse, Built on Eclipse and Eclipse Ready are trademarks of Eclipse Foundation, Inc.

SourceForge.net Logo Donate to this project
Built on EclipseTM RCP Hosted at sourceforge.net