More

    DATA SCIENCE AND DATA SCIENTISTS

    Any discussion about Big Data will not be complete without discussing about Data Science and its relation with Big Data.

    Data Science can be considered as the extraction of knowledge from large volumes of data that are structured (e.g. RDBMS, Excel) or unstructured (e.g. emails, videos, photos, social media, and other user-generated content). Data Science may be considered as a continuation of the field of data mining and predictive analytics.

    Data Scientists are qualified people with strength and patience to tunnel through lots of information and the technical skills in writing algorithms to extract insights from these mountains of information. Data scientists apply expertise in data preparation, statistics, and machine learning to investigate complex problems in many various domains, such as marketing optimization, fraud detection, setting public policy, etc.

    While some see no distinction between data science and statistics, some consider it is a distinct field with specific skill sets, training techniques and goals. For the purpose of this note, we will assume that Data Science is more than just statistics.

    THREE FACETS OF DATA SCIENCE

    Lynda.com’s Techniques and Concepts of Big Data with Barton Poulson, describe about three facets of Data science, which are coding, statistics and domain knowledge. It also says about the Data Science Venn Diagram.

    • Statistics is the mathematical knowledge or training (e.g. probability) and helps in generating the right results.
    • Domain knowledge is the knowledge about the domain in which the research is done (e.g. Marketing) and is very important for a proper research. According to many researchers like Svetlana Sicular of Gartner, it is easier to turn domain people into Hadoop than making Hadoop people gain the domain knowledge.
    • A fair amount of coding knowledge (even a little bit), can be handy in many areas such as creating exploration and manipulation of data sets, transformations of data from various sources into common formats before processing etc. Having knowledge in coding also helps in Algorithmic thinking to get through a problem.

    Another version of the Venn diagram I could find, describe the three facets as:

    • Math and statistics knowledge (statistics)
    • Substantive expertise (domain knowledge)
    • Hacking skills (coding).

    You can read more from the reference links.

     

    Combination of different facets of Data Science

    According to the Venn Diagram, different combination of skills has some significance:

    • Combination of Statistics and Domain knowledge is often what traditional researchers possess.
    • Statistics and coding together can result in machine learning based researches and applications. An email spam filter is an example.
    • Combination of Domain knowledge and Coding, without statistics, is considered as a danger zone, as you are very unlikely to derive successful conclusions without statistics.
    • Finally, a combination of Statistics, Domain knowledge and Coding, is what can be called as Data Science.

     

    TYPES AND SKILLS OF DATA SCIENTISTS

    Almost everyone talk about “data science,” “big data,” and “analytics.” However, there is a lack of clarity around the skill sets and capabilities of their practitioners. This lack of clarity has frequently led to missed opportunities.

    To address this issue, the authors of the book “Analyzing the Analyzers” surveyed several hundred practitioners via the Web to explore the varieties of skills, experiences, and viewpoints in the emerging data science community, and has documented in the book. Here is a quick summary of it:

    Data scientists were classified into four categories, with subtypes:

    1. Data Developer
      • Developer, Engineer
    2. Data Researcher
      • Researcher, Scientist, Statistician
    3. Data Creative
      • Jack of all trades, Artist, Hacker
    4. Data Businessperson
      • Leader, Businessperson, Entrepreneur

    The book also classified the skill sets into 5 categories, with sub skills:

    1. Business
      • Product Development, Business
    2. ML/Big Data
      • Unstructured data, Structured Data, Machine Learning, Big and Distributed Data
    3. Math/OR
      • Optimization, Math, Graphical Models, Bayesian/Monte Carlo statistics, Algorithms, Simulation
    4. Programming
      • System Administration, Back End Programming, Front End Programming.
    5. Statistics
      • Visualization, Temporal Statistics, Surveys and Marketing, Spatial Statistics, Science, Data Manipulation, Classical Statistics.

    The book finally finds out what all skill categories and their percentage are available for each data scientist category.

    Each data scientist type category were having some knowledge from all skillset categories, but the distribution percentage of skillset category per data scientist category varied from one data scientist category to another. For instance, Data Businessperson had a high percentage of Business skill set and Data Developer had high percentages of ML/Big Data and Math/OR skills.

    You can find the distribution as per the research in “Chapter 3: A Survey of, and About, Professionals”, under the heading “Combining Skills and Self-ID”.

    REFERENCES: 

    1. https://en.wikipedia.org/wiki/Data_science
    2. https://www.facebook.com/dan.ariely/posts/904383595868
    3. https://en.wikipedia.org/wiki/Machine_learning
    4. Lynda.com’s Techniques and Concepts of Big Data with Barton Poulson
    5. http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
    6. Analyzing the Analyzers -An Introspective Survey of Data Scientists and Their Work by Harlan Harris, Sean Murphy, Marck Vaisman.

    Recent Articles

    OAUTH – FREQUENTLY ASKED QUESTIONS FOR INTERVIEWS AND SELF EVALUATION

    Why is refresh token needed when you have access token? Access tokens are usually short-lived and refresh tokens are...

    SUMO LOGIC VIDEOS AND TUTORIALS

    Sumo Logic Basics - Part 1 of 2 (link is external) (Sep 29, 2016)Sumo Logic Basics - Part 2 of 2...

    GIT – USEFUL COMMANDS

    Discard all local changes, but save them for possible re-use later:  git stash Discarding local changes...

    DISTRIBUTED COMPUTING – RECORDED LECTURES (BITS)

    Module 1 - INTRODUCTION Recorded Lecture - 1.1 Introduction Part I – Definition

    BOOK REVIEW GUIDELINES FOR COOKBOOKS

    Whenever you add reviews for the book, please follow below rules. Write issues in an excel.Create an excel...

    Related Stories

    Leave A Reply

    Please enter your comment!
    Please enter your name here

    Stay on op - Ge the daily news in your inbox