What is Data Science?
Data can be defined as information representing a qualitative status (surveys and opinion polls for instance) or a quantitative measure of magnitude. Unstructured data comprises text, images, audio and video files whereas structured data is usually in tabular / rectangular form (think Excel files) and comprises columns (called variables) and rows (called observations).
Data science is a discipline that uses computer programming to analyze data and derive from it understanding, insight and knowledge. A data scientist is then someone who integrates coding with statistical knowledge to extract insights from data.
A data scientist analyses different sources of data depending on the industry and field of study; for instance geospatial data, scientific data, financial data, political data, transportation data, and tourism data just to name a few.
It is estimated that 3.5 quintillion bytes of data are created each day in 2023 (a quintillion is a billion billion or 1 followed by 18 ceros). As a consequence of the advent of affordable sensors, single board micro-controllers and computers, the Internet of Things (IoT) is a great contributor of data. About 73.1 ZB (zettabytes) are expected to be generated by the year 2025 (one zettabyte is equivalent to a one trillion gigabytes — for data storage units please see references). Because of this, a new industry has emerged that deals with the storage of massive amounts of data.
Depending on the type of data needed to be stored and subsequently processed, two types of storage systems exist as of today: data warehouses and data lakes.
Data warehouses bring together data from disparate sources into a single data repository, and host the data in a clean and organized manner that allows for immediate analysis. A data lake on the other hand store data as-is, allowing for a wider range of analysis to be performed. A data mart is a smaller, more focused version of a data warehouse that deals with a domain-specific subset of the data, allowing for faster discoveries of insights. These types of data storage systems can be built on the cloud and/or on an organization’s hardware infrastructure, or a combination of both.
Databases are different from data warehouses and lakes in that they are built for fast querying and transaction processing instead of analysis. Their focus is on updating real-time data.
The life cycle of data science involves the stages of data ingestion, data storage and processing, data analysis and modeling, data storytelling (communication). Ingestion refers to all processes associated with collecting data, from manual entry, to web scraping to real time streaming of data from devices. Storage and processing involves the methodology of pre-processing data and how to store it to facilitate subsequent analysis. Data analysis in turn is concerned with identifying biases and outliers, patterns, ranges, and distribution of values for single variables and correlation among variables. It drives the generation of hypothesis for testing and modeling to predict future outcomes on new data. Storytelling refers to the effective communication of the found data insights in a way that is easily understood by decision makers.
The skills necessary to become a data scientist include computing programming, statistical analysis and mathematical knowledge, data modeling (machine learning) and data visualization. The industry has grown to encompass several job titles within the data science landscape that goes beyond that of the data scientist. I mention a few of them here:
Data Analyst: analyze and visualize data; and are often asked to communicate the results of their analysis.
Marketing Analyst/Scientist: use scientific methods to engage with marketing data, supporting decision-making by extracting insights from customer behavior.
Business Analyst: help to optimize business’ systems and processes through data-driven decisions.
Business Intelligence Developer: design strategies oriented to allow businesses’s finding of relevant information to make decisions promptly and efficiently.
Data Engineer: develop and maintain data infrastructure and makes it available to data scientists and analysts.
Machine Learning Engineer: design, build and maintain artificial intelligence algorithms and put them into production.
Data Modeler: design, improve and maintain data models.
Database Administrator: database administration and maintenance.
Data Architect: develop the required architecture for data management that best support business needs.
Software Engineer: build software that use data infrastructure to empower end users’ utilization of data.
Data Storyteller: create the narrative that best express the findings in the data.
It is always a good idea to glance job openings in your industry of interest for any of these job titles to have a better idea of the skills being required so you can prepare accordingly.
Call to Action:
What sources of data do you generate on a daily basis? Write them down!
How would you store and pre-process the data you generate to facilitate further analysis?
What type of data professional would you hire to analyze your data and what insights could be discovered? Are any of the possible insights relevant in helping you spend less and earning more money?
Is the organization you work for collecting data? How does it use it to derive data-driven decision making?
References:
About the author: Martin Calvino is a Visiting Professor at Torcuato Di Tella University; a Computational Biologist at The Human Genetics Institute of New Jersey — Rutgers University; and a Multimedia Artist. You can follow him on Instagram @from.data.to.art