In this Digital age today, data science has become the top skill and sexiest job of the century.
Data science projects do not have a nice clean life cycle with well-defined steps like software development lifecycle (SDLC), but they are non-linear, highly iterative, and cyclical between the data science team and various other teams in an organization.
SAS Institute, the leader in Analytics developed its own method called SEMMA (Sample, Explore, Modify, Model & Assess) for data mining.
However, many companies have adopted a standard workflow of data science called CRISP-DM (Cross Industry Standard Process for Data Mining). It was developed by a consortium of companies like SPSS, Teradata, Daimler, and NCR Corporation in 1997.
With any method, the process is similar which involves the following steps:
Business Understanding: This is the basic and first step as understanding business problems is extremely important for data scientists to move forward.
Data Acquisition: Based on the business problem the next step is to understand and acquire the data which is needed. Identify the sources from where it is available, and who is responsible for providing that data. It can come from various data sources like customer data, demographic data, third-party data, weblogs, social media data, streaming data like sensor data, and audio or video data. The main challenge is to decide whether the data is up-to-date and clean for model consumption. With the Internet of Things in full swing, data acquisition into a Big Data platform is an important step.
Data Preparation: This is also called as data wrangling phase which takes almost 60% of overall project time. Collected data has to be formatted, treated for any missing values, abnormalities, or seasonality from the data, and made ready for model consumption.
Modeling: This is the core activity of a data science project that requires writing, running, and refining the programs to analyze and derive meaningful business insights from data. Often open-source tools like R, and Python and commercial tools like SAS, and IBM SPSS are used to create the statistical models. Various machine learning techniques are applied to data based on the business problem.
Evaluation: There are several methods to compare the developed models and then use the best model for deployments. Typical comparison methods are AUC – area under the curve, Confusion matrix, Gain/Life charts, Root Mean Squared Error, etc.
Deployment: Once the most suitable model is identified above, it is further tested with live data and then deployed into the production environment.
There are further steps as well such as monitoring the live model performance, observing any degradation and new models are developed which are again compared with the live model.
Data Science has evolved beyond normal predictive modeling into recommendation engines, text mining, deep learning, and Artificial Intelligence. The foundation still remains the same data gathering, data cleaning and then applying various algorithms.