MJ ETL Strategies
The paramount aspect of MJ's Extract, Transform and Load process is determining how to best map MAPPER data reports to entities.
For example, several MAPPER reports may comprise the data set for a single entity, where each MAPPER report (or range of MAPPER reports) represents a grouping, category or "partition" of data. Grouping or categorization of data by location may imply an attribute common to data stored at the location. In an inventory system, storage might be arranged so all items in reports 20 to 30 of drawer B are back ordered.
Mapping the data fields in a MAPPER report to entity attributes is also important. Attributes are stored as columns in the table where entity data is persisted; needless to say, column definition and structure is a key part of a migrated application's data model.
Useful things to consider about MAPPER report data before migrating include:
After migration, how would your organization like to use the data? Is the data a useful input for business intelligence tools? Will the data be exported to another application, or updated from another process?
Data that is properly typed and stored is easer to share, exchange, integrate and calculate. This means storing calendar information such as shipping date as a date type and not a string, and numbers like quantity and price as integer and decimal (respectively), and not as characters.
Similarly, discrete, enumerated values such as status codes and indicators should identified as such and stored appropriately. In conjunction with Hibernate, MJ provides support for such enumerated data.
How does the MAPPER application currently use the data? For example, does the application mix date, time and status text in a single, custom formatted field? In such a case, it may be easier to continue to treat the data as string even though it contains chronological information.
Determine if the line type (leftmost character on a line) identifies a distinct entity type, which is typical, or marks additional attributes of an existing entity.
In a MAPPER report, lines containing column-formatted fields usually begin with a tab character and belong to the primary entity the MAPPER report represents. However, trailer lines sometimes follow a tab line that specify an entirely different entity. For example, in the sample MAPPER report Footwear Orders - Information (RID 12D3954), trailer asterisk lines contain the customer information for an order item. MJ considers entity data embedded in trailer lines as inline.
For backward compatibility in a migrated application, the line type may be preserved in the entity as a "data origin" attribute. Since the tab character isn't the best discriminator in a modern relational table, this attribute is usually enumerated for clarity.
Identify an entity's business key and foreign key(s). A business key could be an order ID, part number or customer name, or a combination of fields. MJ introduces a primary surrogate (synthetic) key for each entity, but only business keys impart meaningful identities, relationships and constraints to the data.
Consider how an entity is accessed in the MAPPER application, such as
@BFN
or@HSH
. MJ makes special provisions to support binary search and hash lookup, but such requirements should be known in advance.Explore an iterative approach to refine entity and attributes characteristics as part of MJ's ETL development process. For example, it may make sense to initially define attributes as string types, allowing a sample of data to be loaded and analyzed using SQL or other relational tools. Canonical data types can then be determined, specified in the Spring configuration and the ETL process re-run.