Documentation

No results
    gitHub

    Reverse-engineering

    The reverse-engineering function is used to import information into pre-existing or empty data models.  Reverse engineering is a useful feature to facilitate the modeling and documentation of existing instances.  After the reverse engineering has been performed, the user can enrich the model created, by filling descriptions, recording constraints that were not obvious from the stored data, documenting relationships, etc…

     

    Hackolade supports the reverse engineer of different types of data sources:

    • a JSON or YAML document: schema inference of JSON or YAML documents into a Hackolade Studio data model for any target
    • a JSON Schema or YAML Schema: conversion of JSON Schema files into a Hackolade Studio data model for any target
    • a Data Definition Language file (DDL): from Oracle, Microsoft SQL Server, MySQL, PostgreSQL, Hadoop Hive, Snowflake, Teradata
    • an XSD schema file from another ER tool, such as erwin, ER/Studio, PowerDesigner, or other
    • an Excel file template: first export a Hackolade model (even an empty one) to generate an Excel file for the target of your choice.  Then you may bulk edit your model or create a new one before ingesting it back into the application.
    • target-specific instances

     

    This feature is also available via the Command-Line Interface.

    Reverse-engineering process

    This process includes a series of steps outlined below.  After going through the steps, the user is free to enrich the model for documentation purposes, and to generate forward-engineering scripts.

     

    The time taken to reverse-engineer a MongoDB instance depends on several factors:

    • sample size defined in Tools > Options > Reverse-Engineering > Record Sampling
    • number of collections
    • size of the respective sampled documents

    The process will stop sampling if the total size of the sample reaches 500 megabytes.

    If your documents are fairly heterogeneous, a high sampling size will only increase processing time with no added value.  So you may want to experiment by starting with a really low sampling size, for example just 1 document per collection, and make sure that it all works fine.  Then gradually increase to 10, 50, 100, 250, 500, etc...  until you find the acceptable balance between processing time and a representative sample.

    You can also see (and send us if the necessary) the log file by going to Help > Access Application Logs > HackoladeRE.log 

    Choose the model target

    Create a new model and select the target of your choice.  The reverse-engineering process is fairly similar for all targets, with local nuances when needed.  But the terminology and connection protocols can be very different.

    Choose the reverse-engineering function

    This screen will vary depending on the selected target.

    Tools  Reverse-Engineer  MongoDB Collection

     

    Establishing a connection

    The options here will vary depending on the selected target.  Refer to the pages below to find the specific details.  The images below refer to MongoDB.

    Reverse-Engineering Connection Settings

     

    TIP:  note, if you have instances with a very large number of databases, you may wish to declare a particular database directly in the connection string by appending the address with: /<name of the database you wish to access directly>, for example: localhost/NBA

     

    Reverse-Engineering Connection Selection

     

    For more information on the various authentication and encryption protocols, please consult these pages.

     

    Selecting one or more collections/entities

    Reverse-Engineer MongoDB Collection selection

     

    Sampling of one or more collections/entities

    Sampling is currently limited for performance reasons to at most 10.000 documents, adjustable via a user parameter in Tools > Options > Reverse-Engineering, or up to 100k documents with the Command-Line Interface

     

    Since MongoDB 3.2, collections are sampled with the $sample operator via the /core/aggregation-pipeline. This provides efficient random sampling without replacement over the entire collection, or over the subset of documents specified by a query.

     

    In MongoDB 3.0 and 2.6, collections are sampled via a backwards-compatible algorithm. It takes place in three stages:

    - Hackolade opens a cursor on the desired collection, sorted in descending order of the _id field.

    - sampleSize documents are randomly selected from the stream, using reservoir sampling.

    - Hackolade performs a query to select the chosen documents directly via _id.

     

    Note that this sampling mechanism is originally provided in Open Source by MongoDB, and adapted to the specific needs of Hackolade.

     

    Probabilistic inference of schema

    From the sample set gathered above, measurable metrics can be applied to the schema, including schema depth and width, and class interactions, as described here:

    RevEng schema inference

     

    Note that this inference mechanism is originally provided in Open Source by MongoDB, and adapted to the specific needs of Hackolade.

     

    In this step, when the instance being reverse-engineered is v3.2 and above, stored validator setup is leveraged as well, using the db.getCollectionInfos() method 

     

    Special cases

    Pattern field detection

    As of v2.5.0 of Hackolade, we've also introduced the ability to detect pattern fields during the reverse-engineering process.  For any sub-object at a given level, we can infer the presence of a pattern field if we detect the recurrence of similar regex patterns in field names.

    Reverse-Engineering pattern field detection 

     

    This pattern field recognition was further enhanced in v4.2.3 with the following logic:

    a) attribute must be a complex data type (object or array)

    b) in a list of successive complex attributes in a branch of the schema hierarchy, the structure of each of these complex attributes must be exactly identical 

    c) the names of these successive complex attributes with identical structure must be similar as per either:

    - an ID-like structure of the same length of at least 12 characters that can only include hexadecimal numbers and dashes;

    - or by by using Levenshtein distance. The distance parameter may be modified in Tools > Options > Reverse-Engineering (default = 1)  An online Levenshtein distance calculator can be found here.

     

    Inference of Foreign Key relationships

    With v4.2.13, the option was introduced to attempt foreign key relationship inference.  This features is only available for the MongoDB target, and is for documentation purposes only, as Foreign Key relationships are not enforced by the database.  For each field in a collection with the data type ObjectID, the application finds if the sampled value can be linked to a document in the same collection (for a recursive FK) or another collection in the database.  The cardinality is set by default to "0...n" on the child side, but can be adjusted manually by the user.

     

    Transformation into a Hackolade collection schema

    In this final step, the derived schema needs to be converted and persisted into the Hackolade notation, so it can be visualized, adapted and enriched in the application.

     

    Conflict resolution during merge into an existing model

    When reverse-engineering into an existing model, it is possible that source objects already match existing objects in the model.  This could be container names and/or entity names.  When the process detects such case, the user is asked to decide on the logic to be used during processing:

     

    Conflict resolution dialog

     

    Keep both: leaves the objects in the mode untouched, and adds new objects in the model

    Replace: overwrites the existing objects with the new ones

    Merge: combines the new objects with the existing objects in the model.  New detected attributes will be added.  Existing properties of attributes are maintained.

    Cancel: abort process

     

    By checking the option "Apply to all conflicts", the user tells the process to apply the same choice to all subsequently detected conflicts, so the question is not asked again.

     

    The feature is also available as an argument in revEng Command-Line Interface commands, but can only "apply to all conflicts".