Data Engineering — Day 25

Annamariya Tharayil
3 min readOct 24, 2020
Photo by Mika Baumeister on Unsplash

Today I would like to write about Data Engineering. I will try covering some topics that come as part of the Data Engineering process.

After the internet revolution, when companies started moving online, they started generating data. Initially, companies did not care much, as it was not a large amount. As time passed by, companies began to grow in size, and numerous companies started having their online presence. Well, at one point, companies felt the need to be visible online so that they get more customers.

Growth of Data

The above image depicts the exponential amount of data generated throughout the years.

From where exactly is the data generated?

The data can come from different sources, could be images that we share on social media, it could be files uploaded on the internet, it could be blogs that we post, it could be items that we buy from an e-commerce website, etc. Everything that we do while using a machine is part of the data.

How can we store such a massive amount of data?

It is not feasible to have a database to store data. Apart from size, we can have different varieties of data too. How can we manage to store such data? We can use Big Data Technologies to store the data. Big Data Technologies supports not only storing, but also analysis, mining, and visualization.

Characteristics of Big Data are as follows,

  1. Velocity → Data is frequently flowing into the system from multiple sources and is expected to be processed in real-time.
  2. Variety → Different variety(images, video, text, etc) of data can come as part of Big data.
  3. Volume → Like we have mentioned before, a massive amount of data is expected to be processed.
Big Data Ecosystem

Data engineering would involve collecting data from the source, storing the data, transforming data into meaningful data, finding trends in the data, and visualizing the data.

  1. Collecting Data → We can collect data directly from the users of our application. In such a case it is called “Primary Data”. We can also collect data from Primary Data providers. Such data would be called “Secondary Data”. Data collection can be done in batch, or it can be streaming data.
  2. Data Storage → When it comes to storage of data we need to think about how we want to store it, should it be structured format, semi-structured, or unstructured. Based on the format, we decide on the technology used for storing data. Apart from storage, we need to decide if we want to store our data in the cloud or create our Hadoop clusters. ETL process can be part of this stage. ETL is three interrelated processes that involve Extracting data from the source, Transforming data into the required format, and Loading into the Storage area.
  3. Data Mining/Analytics → Data mining involves analyzing data to find trends or patterns and new information. Data mining is to identify the interests of users using their purchase patterns. For eg, People who buy chocolate, tend to buy ice cream.
  4. Data Visualizations → Data Visualization is a representation of Data in a visual format. It could be graphs, pie charts, etc.

I tried mentioning few concepts of Data Engineering in my post. I will put on another post explaining Big Data Technologies in depth some other day.

I would love to hear your feedback about my posts. Do let me know if you have any comments or feedback.

--

--