Data Science is the science of analyzing raw data using statistics and machine learning techniques with the purpose of drawing insights from the data. In simple words, Data Science is the process of using data to find solutions/to predict outcomes for a problem statement.
It is an interdisciplinary field which combines techniques and processes from computer science, statistics, mathematics, information science, graphics, business domain knowledge and other related scientific techniques and algorithms.
It is the process of extracting meaningful information from the given set of data points. The data can be structured or unstructured. It is basically a “data-driven” technology which uses a various combination of interdisciplinary techniques to get the useful data. Here the data is immensely huge such that it can draw and understand the correct interrelation between datapoints after analysing a huge number of datasets, so it uses various forms of distributed systems. Data Science also heavily depends upon related fields like Artificial Intelligence, Machine Learning and Deep Learning.
Components of Data Science
It is an umbrella of three components
• Organising the data – After successfully applying the data handling mechanisms on the data, there comes the next part of organising the data. Organising the data refers to the planning and execution of the physical storage system and structure of the data.
• Packaging the data – In this process, the data is wrapped with the wrapper consisting of logical elements so that it is in a presentable format. It is the process where prototypes are created, the statistics is applied to the data and proper visualisation is created.
• Delivering the data – This process makes sure that the final output delivered to the end client is accurate and is delivered to the concerned client.
Data Science Life Cycle
Data Science is a field which is rapidly evolving but can be summed up as seven stages of their life cycle.
• Business Understanding – It is the complete understanding of the business requirement and its specifications in the correct context. It should answer each and every possible business domain question in the correct context. It should classify the specifications depending upon their context for easy design and processing. Any anomaly should be detected at this stage.
• Data Mining – This is the stage where data is gathered related to the business requirements. Finding the right data takes time and effort. We should query the source of the data. If the data is in the database, the job becomes simpler else we need to scrape web pages for data.
• Data Cleaning – It is also a time-consuming process after the collection of data. The inconsistency in the representation of data or mis-spellings, for example, necessitates the process of data cleaning and preparation. Missing data is another part which should be taken care of else it can throw a lot of errors during creating a model.
• Data Exploration – This is the analysis part after cleaning of data. This stage is where we understand any useful pattern in our data. Pandas is a useful tool to analyse a given subset of data. It can be used to plot histogram or any other distribution curve to analyse the general trend or even to give it a visualisation effect. Using all these data we can build a hypothesis for our problem statement.
• Feature Engineering – Feature is an entity which can be measured or it is an attribute to any phenomenon. If we are predicting the performance of some student in a class, then a probable feature would be their IQ level. This stage directly predicts the accuracy of the next stage or the model we build.
• Predictive Modelling – It is at this stage that Machine Learning finally comes into the picture. A good model is not that just trains its model and is obsessive over the accuracy but also applies statistical methods to test that the outcomes from the model are accurate.
It is at this stage where the Data Scientist should carefully decide which model should be used. The choice of model depends upon various influencing factors such as size and quality of the data, how much computational time and efforts can be invested on the data and the type of data which we want for our problem statement. The accuracy of the model can be evaluated by a process called k-fold cross-validation or PCC (Percent Correct Classification).
• Data Visualisation – It is a combined field of statistics, mathematics, psychology, communication, graphics and art to provide the ultimate communication in an effective yet in a visually appealing manner. It is the stage where one can represent the outcome to different business requirements and project them in a way that different businesses can understand.
Applications of Data Science
• Healthcare – With a large volume of useful clinical databases flowing in by virtue of Data Science, medical practitioners are able to diagnose disease faster and come up with advance researches. New treatment options are being explored for existing and newfound diseases.
• Autonomous cars – Various car manufacturing companies like Tesla, Renault Nissan, Volkswagen and Ford use predictive analytics for their self-driving cars. A large number of sensors are installed all over the vehicle surface to capture real-time data. Using the combined technologies of Machine Learning, Data Science and predictive analytics various features like automatic speed adjustment, lane detection, drunk driving can be implemented in a car.
• Logistics – It can help drivers to locate optimal and safe driving routes. It also track the vehicle in case of breakdown or failure.
• Entertainment – Have you ever wondered how Spotify suggests you the perfect song for your mood or how Netflix can suggest your next watch? Youtube can pop out the next recipe you will be interested to cook – all based on your past and present activities or searches. This is all possible by the combined technology of Data Science and Machine Learning.
• Finance and Stocks – Stock Exchange can be the best example of the application of Predictive Analytics. Various Financial companies exploit Data Science mechanism to extract vital information and process them to know the current and future trend.
• Cyber Security – Data Science and Machine Learning can be used to detect thousands of new malware on a daily basis and protect your system from any cyber threat.
• Targeted sales and advertising – Flipkart and amazon can pop up advertisements based on your past shopping history. Various digital banners popping on your websites are all decided by the Data Science algorithms. These are all targeted based on a user’s search behavioural pattern.
• Speech recognition – Google Voice, Siri, Cortona are all examples of speech recognition based on the Data Science algorithms. You simply speak out your message to all these assistants and they are there to help you out. The speech is processed and converted to text by the use of Natural Language Processing algorithms, which is also a Data Science concept.
• Airline route planning – Airline industries use Data Science extensively to improve their strategies and investments. They can analyse on these points to improve their profitability like whether to halt the flight in between two stoppings or fly directly to the destination, how much to invest in the customer’s loyalty program, how much maximum they can delay the flight etc.
• Gaming – Games like Zynga, Nintendo, EA Sports have led the gaming experience to a new level – all made possible because of Data Science.
• Augmented reality – Game like Pokemon GO have become a high trend. This kind of games provides an experience of augmenting reality based on computing knowledge and Data Science algorithms to provide the best viewing and gaming experience.
M.Tech (VLSI Design and Embedded system)
BS Abdur Rahman University