Mapping Unstructured Data: How-to
When creating an ODBC driver, it is essential to convert data from the data source into a relational format. This is necessary because ODBC and SQL require and operate on data that is organized with a fixed schema and returned in a tabular structure. By utilizing the SimbaEngine SDK, you can efficiently map your data into a relational format, enabling you to build an ODBC driver compatible with various application vendors, including Power BI, Tableau, Qlik Sense, Alteryx, Informatica, MicroStrategy, Logi Symphony, Excel, and more. This post will explore a few scenarios and demonstrate how each can be mapped into a relational format.
A straightforward example is data stored in CSV files. In this scenario, let’s assume that each row in the CSV file has the same number of columns. An example of such data might look like this:
Row 1,3.450,Data Value 1,1999-09-08 Row 2,6.789,Data Value 2,2000-01-23
In this example, it’s easy to map to a relational model because the data is already represented relationally. The only difficulty is determining the metadata for the columns in the CSV. You could do a few things: map the columns to string, create another CSV file specifically for holding metadata about the other CSV files, create another file (such as XML) that is associated with each CSV file and contains the metadata about that file, or put in a header line containing the metadata to name a few. The only important thing is that you can create code which exposes this data and metadata.
A more complex example involves using a JSON data source. There are several ways to map this type of data. Consider the following (contrived) data sample:
{
"order": {
"id": "1",
"value": "customer1",
"receipt": {
"items": [
{
"value": "item1",
"price": "5.55"
},
{
"value": "item2",
"price": "4.22"
}
]
}
},
"order": {
"id": "2",
"value": "customer2",
"receipt": {
"items": [
{
"value": "item3",
"price": "3.45"
}
]
}
}
}
Notice that in the second JSON document, there is one element in the “items” array, while there are two in the first. To map this data to a relational model, there are two different approaches. In both cases, it is essential to know the complete set of fields present in all the JSON documents. This can be achieved by having the user enter them manually, have the driver sample the available documents (perhaps scanning through the first N documents), or by using a combination of these methods.
Once you have identified the complete set of fields, you can begin mapping this data to a relational model. A straightforward approach is to flatten the data into columns, with each element in the array becoming a separate column. This way, the data would be represented as:
| id | value | receipt.items[0].value | receipt.items[0].price | receipt.items[1].value | receipt.items[0].price |
|---|---|---|---|---|---|
| 1 | customer1 | item1 | 5.55 | item2 | 4.22 |
| 2 | customer2 | item3 | 3.45 | <NULL> | <NULL> |
In the second row, there is no second element in the array, so NULL values are used instead when returning the relational data. Similarly, if a field appears in one document but not in another, NULL will be substituted for the missing field. While there are other approaches for mapping this data, we will leave those as an exercise for the reader.
A third and final example would be taking a key-value store and mapping that to a relational model. There are many different types of key-value stores, with many different subtleties, and it would make for a very long post if we were to consider all of them. Instead, let’s imagine a hypothetical key-value store which follows the example below:
key1
value1
value2
child2
key2
value3
In this case we have two “master” keys (key1 and key2), where key1 has two values (value1 and value2), and value2 also has a child (child2). key2 only has one value (value3) with no other children. A simple mapping of this would be to treat the keys as columns, their immediate values as columns, and any subsequent children as columns as well. Applying this to the above example would then result in the following mapping:
| key | firstLevel | secondLevel |
|---|---|---|
| key1 | value1 | <NULL> |
| key1 | value2 | child2 |
| key2 | value3 | <NULL> |
Note that the column names have been synthesized and the types are not specified for this example. It would be easy to adapt the strategies employed in the previous example, which is also left as an exercise for the reader.
There are many other ways to represent this data relationally and numerous other data models that can be mapped. Over the years, we have encountered a variety of data types and successfully mapped all of them into sensible relational representations. Once this mapping is complete, it takes just five days to have your read-only ODBC driver up and running, connecting to the application vendor of your choice!