From Mainframe to NoSQL – Part 2

So last we left off, I had described a scenario where a prospect had asked about how mainframe data (as described by COBOL copybooks) could be ingested into MarkLogic.  I described an approach involving combining Legstar (and its ability to translate copybook descriptions into XSD) with MarkLogic and the MarkLogic Java API.  Because I was using a prospect’s proprietary copybook, I didn’t have sharable samples at the time.  Now, thanks to some custom copybook creation, I finally have something to share that is more concrete. In the last post I also alluded to not only creating a mocked-up copybook but also needing to find an editor that would allow me to work with EBCDIC data.  After searching for something native for Mac, I had a small eureka moment and realized that I could use the same tools that I had used for the EBCDIC to ASCII translation in the first place (i.e. Legstar) to simply go in the other direction.  So instead of editing  things in an EBCDIC editor, I was able to created a quick and dirty conversation program. So in the full example download there will also be a sample program for generating the EBCDIC data.

On to the working example.  For this, let’s start with the prerequisites:

  1. If you don’t already have a copy, download the developer version of MarkLogic from our developer site. Installation interactions are platform specific but in all cases are very straightforward.
  2. If you don’t already have one set up, create a REST API instance (and corresponding database) inside of your local MarkLogic installation as per these instructions.
  3. Also if you don’t already have a copy, download Eclipse as well. The standard version is fine.
  4. After Eclipse is installed and set up, the next thing to do is install the legstar plug-in for Eclipse.  This is done by going to Help -> Install New Software in eclipse and entering http://www.legsem.com/legstar/eclipse/update as the URL in the text box on the pop-up window (on my version the text box is titled “Work with:”).  Once entered, the “Legstar Eclipse plugins” should show up for selection.  Select them all and click “Finish”.
  5. Finally (for now), create a new Java project (File -> New -> Java Project). Give it any name you like.

Once that’s done,  the basics are in place and it’s time to look at some mainframe data and associated structure.  We will start with a sample copybook as follows:

01 CUST-RECORD.
     05 CUST-ID PIC 9(5) COMP-3.
     05 CUST-NAME.
         10 CUST-LAST-NAME PIC X(15).
         10 CUST-FIRST-NAME PIC X(10).
     05 STREET-ADDRESS PIC X(20).
     05 CITY PIC X(20).
     05 US-STATE PIC X(02).
     05 OTHER-STATE-PROVINCE PIC X(20).
     05 COUNTRY-CODE PIC X(3).
     05 POSTAL-CODE PIC X(10).
     05 NOTES PIC X(40).

The above example is a simple one, contrived to show a few things that map well to XML, namely:

  • A hierarchy. This is evidenced in the CUST-NAME grouping example.
  • A diversity of data types (OK I threw in a number amongst the text but you get the idea).
  • Fields which may or may not contain values (US-STATE and OTHER-STATE-PROVINCE).

For the last point, I should note that there are better and more complex ways to do this with COBOL copybooks, some of which were actually used in the customer example that I could not share (e.g. a REDEFINES clause for one). Additionally there are other concepts which map well to XML such as repeating groups (an OCCURS clause in COBOL) that also were used in the customer example.  However, since the scope of this post is not so much to teach COBOL (and I would be a terrible teacher in this regard anyway) but simply to demonstrate some inter-operability, I kept things simple. So now that we have a copybook, the first thing to do is generate an XSD from it.  Assuming eclipse and legstar are installed correctly, this is accomplished by choosing “New structures mapping” from the LegStar menu as follows:

CreateXSD-1

A popup will then appear where you would choose the XSD name as follows:

SelectXSD

Finally, the copybook mapping itself is specified:

PasteCopybook

The last step can be accomplished by either pasting in the COBOL fragment (as illustrated) or the file can be selected. The result will be an XSD, created in the Java project root directory which can browsed (courtesy of the XSD viewer) by simply double-clicking on the file.  The resulting viewer window should look as follows:

viewXSD

The next step is to generate the transformers.  One of the features of the legstar package is to create data transformers from a legstar-generated XSD.  Doing this simply involves a few more clicks in Eclipse as follows:

Generate1

Generate2

You’ll then notice that your project has been populated with a number of generated classes, consisting of a few POJOs and many transformation-related classes.  To finish though, you still have to write some code yourself, but not that much.  At a high level, the code involves doing the following:

  • Open the EBCDIC file for reading
  • At each record, transform into XML
  • Write the XML to the MarkLogic database

The main code loop is below:

List bytes = new ArrayList();
FileInputStream fis = new FileInputStream(fileName);
int nextByte= -1;

int rec=0;
while ((nextByte=fis.read()) != -1)
{
	if (nextByte != EBCDIC_LF)
	{
		bytes.add((byte)nextByte);
	}
	else
	{
		System.out.println("Found newline, processing record "+rec+"\n\t");

		// Load into a byte array
		int i=0;
		byte[] barray=new byte[bytes.size()];
		for(Byte b:bytes) barray[i++] = b.byteValue();

		StringWriter sw = new StringWriter();
		// Get the XML version of the record according to the copy-book
		xmlTransform.toXml(barray, sw);
		System.out.println("\t"+sw);

		// Get a POJO representation to get the ID easily - but a POJO might also be used for persistence (perhaps in another example...)
		CustRecord cr = objectTransform.toJava(barray);
		long custId = cr.getCustId();

		StringHandle sh = new StringHandle(sw.toString());
		// Persist the XML document into the DB, giving it a URI based on the customer ID in the record
		docMgr.write("/mainframe/customer_"+custId+".xml", sh);
		System.out.println("Record "+rec+" written to the database");
		rec++;
		bytes.clear();
	}
}

In addition to the core logic of transforming each record to an XML document and loading into the DB, there are some other necessary code items such as preparing the EBCDIC byte array, as well as getting a POJO representation of a record to pull out the ID that is used for constructing the URIs of the respective XML documents. In fact, if all of the JAXB annotations were present in the Legstar generated classes, we could have skipped the step of creating the XML representation of each record and simply fed the POJO to MarkLogic using MarkLogic’s JAXBHandle. However for illustrative purposes and so as not to muck with the generated POJO classes, the above example suffices.

Outside of the code loop there are housekeeping items including creating the transformation object and creating the MarkLogic DatabaseClient and XMLDocumentManager objects (i.e. the connection to the DB) as in the snippet below:

// MarkLogic connection
DatabaseClient client = DatabaseClientFactory.newClient("localhost", 8072, "admin", "admin", Authentication.DIGEST);
XMLDocumentManager docMgr = client.newXMLDocumentManager();

// XML transformer for converting copybook-described data to XML
CustRecordXmlTransformers xmlTransform = new CustRecordXmlTransformers();

// A POJO transformer (used to pull the object ID easily)
CustRecordTransformers objectTransform = new CustRecordTransformers();

And aside from your try/catch blocks, helper variable declarations, etc. that’s pretty much what’s needed to get things done.

So what does the data look like once it’s been loaded?  For that we go to MarkLogic’s query console to check the contents of the database (by clicking on the query console explore button). A screen shot is below:

explore-db

If we then click on the 3rd record, we can see the XML in all of its glory:

record-3

The result is nicely formatted XML, without the need to pre-create any models in the DB.  Pretty cool stuff (again, if you’re a geek).

For the full code example, including sample copybooks, sample data (and utility program to generate the EBDCIC data), you may download from here.

For any questions, ping me on twitter @kenkrupa

Advertisements

One Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s