In this Digital age today, data science has become the top skill and sexiest job of the century.
Data science projects do not have a nice clean life-cycle with well-defined steps like software development lifecycle (SDLC), but they are non-linear, highly iterative and cyclical between the data science team and various others teams in an organization.
SAS Institute, the leader in Analytics developed its own method called SEMMA (Sample, Explore, Modify, Model & Assess) for data mining.
However, many of the companies have adopted a standard workflow of a data science called CRISP-DM (CRoss Industry Standard Process for Data Mining). It was developed by a consortium of companies like SPSS, Teradata, Daimler and NCR Corporation in 1997.
With any method the process is similar which involves following steps:
Business Understanding: This is the basic and first step as understanding business problem is extremely important for data scientist to move forward.
Data Acquisition: Based on the business problem the next step is to understand and acquire the data which is needed. Identify the sources from where it is available, who are responsible to provide that data. It can come from various data sources like customer data, demographic data, third party data, weblogs, social media data, streaming data like sensor data, audio or video data. Main challenge is to decide whether data is up-to-date and clean for model consumption. With Internet of Things in full swing, data acquisition into Big Data platform is important step.
Data Preparation: This is also called as data wrangling phase which takes almost 60% of overall project time. Collected data has to be formatted, treated for any missing values, any abnormalities or seasonality from the data and make it ready for model consumption.
Modelling: This is the core activity of a data science project that requires writing, running and refining the programs to analyse and derive meaningful business insights from data. Often open sources tools like R, Python and commercial tools like SAS, IBM SPSS are used to create the statistical models. Various machine learning techniques are applied to data based on the business problem.
Evaluation: There are several methods to compare the developed models and then use the best model for deployments. Typical comparison methods are AUC – area under curve, Confusion matrix, Gain/Life charts, Root Mean Squared Error etc.
Deployment: Once the most suitable model is identified above, it is further tested with live data and then deployed into production environment.
There are further steps as well such as monitoring the live model performance, observe any degradation and new models are developed which are again compared with live model.
Data Science has evolved beyond normal predictive modeling into recommendation engines, text mining, deep learning, Artificial Intelligence. The foundation still remains the same of data gathering, data cleaning and then applying various algorithms.