This blog explains the soft & technical skill requirements to become Data Scientist / Analyst, in summary.
Definitions:
Definitions:
#1: Role to
analyse and interpret complex digital data, especially in order to assist a business in its decision-making.
# 2: Responsible for collecting, analyzing and interpreting large
amounts of data to identify ways to help a business to gain.
# 3: “A data scientist is: better at statistics than any software engineer and better at software engineering than any statistician”
# 3: “A data scientist is: better at statistics than any software engineer and better at software engineering than any statistician”
Data science requires knowledge of a number of big data
platforms and tools, including Hadoop, Pig, Hive, Spark and MapReduce, and
programming languages that include structured query language (SQL), Python, Scala and Perl, as well as statistical computing languages such as R.
Data scientist vs. Data analyst: there is overlap in many of the skills, there are significant
differences.
· Data analyst varies depending on the company, in
general, these professionals collect data, process that data and perform
statistical analysis using standard statistical tools and techniques. Analysts
also identify patterns and make correlations in data sets to identify new
opportunities for improvements in business processes, products or services.
· Data scientists are responsible for those tasks and many
more. These professionals are equipped to analyze big data using advanced
analytics tools and are expected to have the research background to develop
new algorithms for
specific problems. They may also be tasked with exploring data without a
specific problem to solve.
Data scientist skills- Soft skills required for data scientists include intellectual curiosity combined with skepticism and intuition, along with creativity.
- Interpersonal skills are also a critical part of the role, and many employers want their data scientists to be data storytellers who know how to present data insights to people at all levels of an organization.
- They also need leadership skills to steer data-driven decision-making processes in an organization
- And finally technical skill set: Spark, Hadoop, Hive, Pig, SQL, Neo4J, MySQL, Python, R, Scala, Tensorflow, A/B Testing, NLP, anything Machine Learning
Eight Data Science Skills That Will Get You Hired | Statistical tools used for data analysis |
|
|
Technical Skills
- Math (e.g. linear algebra, calculus and probability)
- Statistics (e.g. hypothesis testing and summary statistics)
- Machine learning tools & techniques (e.g. k-nearest neighbors, random forests, ensemble methods, etc.)
- Software engineering skills (e.g. distributed computing, algorithms and data structures)
- Data mining, Data cleaning and munging , incl. predictive ànalysis
- Data visualization (e.g. ggplot and d3.js) and reporting techniques
- Unstructured data techniques
- R and/or SAS languages
- SQL databases and database querying languages
- Python (most common), C/C++ Java, Perl
- Big data platforms like Hadoop, Hive & Pig
- Cloud tools like Amazon S3
- Essentials of Biostatistics : An Overview with the help of Software
- Biostatistics book
- Certified Analytics Professional (CAP) from INFORMS
- Cloudera Certified Professional: Data Scientist (CCP:DS)
- EMC: Data Science Associate (EMCDSA)
- SAS Certified Predictive Modeler using SAS Enterprise Miner 7
- PG Program in Data Science
- https://towardsdatascience.com/why-so-many-data-scientists-are-leaving-their-jobs-a1f0329d7ea4
- https://blog.udacity.com/2014/11/data-science-job-skills.html
- IBM Cognitive Class - https://cognitiveclass.ai
- The R language is widely used among statisticians and data minersfor developing statistical software[7] and data analysis.
- The base Apache Hadoop framework is composed of the following modules: Hadoop Common ; HDFS ; Hadoop YARN ; Hadeep MapReduce
- Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on their MapReduce and Google File System. The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell scripts.
- Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
- A/B testing is a way to compare two versions of a single variable, typically by testing a subject's response to variable A against variable B, and determining which of the two variables is more effective.
- Statistical natural-language processing (SNLP)
- R is a programming language (It is a GNU project) and free software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing.
- The SAS language is a computer programming language used for statistical analysis,
- Apache Hadoop ( /həˈduːp/) is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.
- Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query and analysis.
- Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark
- Amazon S3 is object storage built to store and retrieve any amount of data from anywhere – web sites and mobile apps, corporate applications, and data from IoT sensors or devices. It is designed to deliver 99.9% durability, and stores data for millions of applications used by market leaders in every industry.
- Neo4j is a graph database management system developed by Neo4j, Inc. Neo4j is available in a GPL3-licensed open-source "community edition", with online backup and high availabilityextensions licensed under the terms of the Affero General Public License.
- Scala combines object-oriented and functional programming in one concise, high-level language.
- MySQL is an open-source relational database management system.
- Apache Spark is an open-source cluster-computing framework. Components: Spark Core ; Spark SQL ; Spark Streaming ; MLlib ; GraphX
- TensorFlow is an open-source software library for dataflow programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks.
- In web analytics, A/B testing (bucket tests or split-run testing) is a controlled experiment with two variants, A and B.[1] [2] It is a form of statistical hypothesis testing or "two-sample hypothesis testing" as used in the field of statistics.
- Natural-language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.
- Python is an interpreted high-level programming language for general-purpose programming. Python features a dynamic type system and automatic memory management. It supports multiple programming paradigms, including object-oriented, imperative, functional and procedural, and has a large and comprehensive standard library.
Data Science: Course / Diploma in Bangalore - Nov 2023
- Data Science Introduction: https://www.w3schools.com/datascience/ds_introduction.asp
- Data Science(DS) Simplification: https://www.youtube.com/watch?v=6USOEP1d9HA
- IBM Data Science: https://www.coursera.org/professional-certificates/ibm-data-science
- Udamy Data Science: https://www.udemy.com/course/the-data-science-course-complete-data-science-bootcamp/
- Data Science course in B'lore: like Data Matix or AnalytixLabs fee will be 50k to 90 k
- IIM, B'lore costs ~7 lakhs - link1: link2
- IISc, Business Analytics costs Rs. 2.5 Lakhs
Note: Diploma / Course are different...Course is for Students or Working professionals.
Regarding Programming:
- Non-computer science background people can opt for FrontEnd Developer / Data Analyst:
- Front End Developer (FED): CSS, HTML5, Python / GoLang / Yeml programming
- Prospectus: Good for non-core programming folks. Market is there but > 60 yrs...!?
- Data Analyst: Data Science / Analyst experience
- Prospectus: Good to have (Special) Education or Academia focused Data Science have wide range of opportunities..
- Program Manager with PMP cert & some experience also helps to find opportunities across globe & across industry.
No comments:
Post a Comment