12 July 2018

Data Scientist

This blog explains the soft & technical skill requirements to become Data Scientist / Analyst, in summary.

Definitions:
#1: Role to analyse and interpret complex digital data, especially in order to assist a business in its decision-making.
# 2: Responsible for collecting, analyzing and interpreting large amounts of data to identify ways to help a business to gain.
# 3: “A data scientist is: better at statistics than any software engineer and better at software engineering than any statistician”
Data science requires knowledge of a number of big data platforms and tools, including HadoopPigHiveSpark and MapReduce, and programming languages that include structured query language (SQL), PythonScala and Perl, as well as statistical computing languages such as R.
Data scientist vs. Data analyst: there is overlap in many of the skills, there are significant differences.
·      Data analyst varies depending on the company, in general, these professionals collect data, process that data and perform statistical analysis using standard statistical tools and techniques. Analysts also identify patterns and make correlations in data sets to identify new opportunities for improvements in business processes, products or services. 
·      Data scientists are responsible for those tasks and many more. These professionals are equipped to analyze big data using advanced analytics tools and are expected to have the research background to develop new algorithms for specific problems. They may also be tasked with exploring data without a specific problem to solve.

Data scientist skills
  • Soft skills required for data scientists include intellectual curiosity combined with skepticism and intuition, along with creativity.
  • Interpersonal skills are also a critical part of the role, and many employers want their data scientists to be data storytellers who know how to present data insights to people at all levels of an organization.
  • They also need leadership skills to steer data-driven decision-making processes in an organization 
  • And finally technical skill set: Spark, Hadoop, Hive, Pig, SQL, Neo4J, MySQL, Python, R, Scala, Tensorflow, A/B Testing, NLP, anything Machine Learning 
Eight Data Science Skills That Will Get You Hired  Statistical tools used for data analysis
  1. Programming Skillsstatistical prog language, like R or Python & DB querying language like SQL.
  2. Statistics
  3. Machine Learning
  4. Multivariable Calculus & Linear Algebra
  5. Data Wrangling
  6. Data Visualization & Communication
  7. Software Engineering
  8. Data Intuition 
  1. Descriptive Statistics
  2. Inferential Statistics
  3. Analysis Variance
  4. Correlation and Regression
  5. Bayesian analysis
  6. Factor Analysis
  7. Discriminant Analysis
  8. Cluster Analysis
  9. Survival Analysis
Technical Skills
  • Math (e.g. linear algebra, calculus and probability)
  • Statistics (e.g. hypothesis testing and summary statistics)
  • Machine learning tools & techniques (e.g. k-nearest neighbors, random forests, ensemble methods, etc.)
  • Software engineering skills (e.g. distributed computing, algorithms and data structures)
  • Data mining, Data cleaning and munging , incl. predictive ànalysis
  • Data visualization (e.g. ggplot and d3.js) and reporting techniques
  • Unstructured data techniques
  • R and/or SAS languages
  • SQL databases and database querying languages
  • Python (most common), C/C++ Java, Perl
  • Big data platforms like Hadoop, Hive & Pig
  • Cloud tools like Amazon S3
Books
  • Essentials of Biostatistics : An Overview with the help of Software
  • Biostatistics book
Certifications

Reference material:
In Summary of Data Scientist softwares:
  • The R language is widely used among statisticians and data minersfor developing statistical software[7] and data analysis.
  • The base Apache Hadoop framework is composed of the following modules: Hadoop Common ; HDFS ; Hadoop YARN ; Hadeep MapReduce
  • Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on their MapReduce and Google File System. The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell scripts.
  • Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. 
  • A/B testing is a way to compare two versions of a single variable, typically by testing a subject's response to variable A against variable B, and determining which of the two variables is more effective.
  • Statistical natural-language processing (SNLP)
  • R R is a programming language (It is a GNU project) and free software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing.
  • The SAS language is a computer programming language used for statistical analysis, 
  • Apache Hadoop ( /həˈdp/) is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.
  • Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query and analysis.
  • Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig LatinPig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark
  • Amazon S3 is object storage built to store and retrieve any amount of data from anywhere – web sites and mobile apps, corporate applications, and data from IoT sensors or devices. It is designed to deliver 99.9% durability, and stores data for millions of applications used by market leaders in every industry. 
  • Neo4j is a graph database management system developed by Neo4j, Inc. Neo4j is available in a GPL3-licensed open-source "community edition", with online backup and high availabilityextensions licensed under the terms of the Affero General Public License.
  • Scala combines object-oriented and functional programming in one concise, high-level language.
  • MySQL is an open-source relational database management system.
  • Apache Spark is an open-source cluster-computing framework. Components: Spark Core ; Spark SQL ; Spark Streaming ; MLlib ; GraphX 
  • TensorFlow is an open-source software library for dataflow programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks.
  • In web analyticsA/B testing (bucket tests or split-run testing) is a controlled experiment with two variants, A and B.[1] [2] It is a form of statistical hypothesis testing or "two-sample hypothesis testing" as used in the field of statistics.
  • Natural-language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. 
  • Python is an interpreted high-level programming language for general-purpose programming. Python features a dynamic type system and automatic memory management. It supports multiple programming paradigms, including object-orientedimperativefunctional and procedural, and has a large and comprehensive standard library.

Data Science: Course / Diploma in Bangalore - Nov 2023

    Note: Diploma / Course are different...Course is for Students or Working professionals.

Regarding Programming:
  • Non-computer science background people can opt for FrontEnd Developer / Data Analyst:
    • Front End Developer (FED): CSS, HTML5, Python / GoLang / Yeml programming
      • Prospectus: Good for non-core programming folks. Market is there but > 60 yrs...!? 
    • Data Analyst: Data Science / Analyst experience
      • Prospectus: Good to have (Special) Education or Academia focused Data Science have wide range of opportunities..
    • Program Manager with PMP cert & some experience also helps to find opportunities across globe & across industry.

No comments:

Post a Comment