Whenever we speak of big data we immediately speak about structured, partially structured and unstructured data. In this blog I will examine this broad categorization of data and attempt to provide a finer description of the categories.
I will use the following dimensions to give a finer description of the categories are:
- Availability of meta-data that describes the data
- Semantics of the data
- Format of the data.
- Content of the data
It is important to have a good understanding of the characterizations of the data as this will help in understand what information need to get out the data and how we can utilize the extracted information.
This category of data has a well defined meta-data describing the content and the content is primarily scalar. I use scalar in a loose sense to mean the data content can be easily mapped into a appropriate typed data structure and can be manipulated (e.g. aggregation) and visualized. Classic example includes relational data. Typically, the semantics of the data content is established based on the context of data – example sales data, customer data, service data, order data etc.
The format of structured data is typically tabular, JSON, XML.
Partially Structured Data
This category of data typically either has explicit or implicit meta-data describing the content and the content can be a combination of scalar values free form text. Examples include relational data with string column which has a user comment – service request transaction with a column for user comments. Log data or machine generated data which hsas implicit meta-data whose content typically includes a combination of scalar values (example – IP Address, date), complex text strings encoding multiple pieces of key information (example – browser and OS info). Typically, the semantics of the data content is established based on the context of the data.
The format of the partially structured data is mostly tabular combined with JSON and XML in the mix.
This category of data has typically has meta-data (implicit or explicit ) describing the content which can have complex data (example word or PDF document or video or music) or has no meta-data (plain text file or music).
The content in the former one is mainly non-scalar and can include text, binary, semi-structured and structured content. This category is the most complex of all the categories with the data having a complex format. If there is meta-data about the unstructured data then the first step would be to use it and extract the sub-categories and then focus on each sub-category. If there is no meta-data the extraction will require proprietary knowledge of the data and can be complex.
The content in the later one is typically small large volumes of text which require manual interpretation of NLP to extract information.
Understanding the categories of data, their source has a profound impact in preparing the data for downstream analysis.