Data scientist: a machine that turns coffee into linear models.
It seems like everyone and their manager wants a data scientist in their company to boost profits and use #bigdata, yet there does not seem to be a good definition of what a data scientist is supposed to do or even what kind of knowledge and expertise he/she must possess. From Drew Conway’s famous Venn diagram that probably oversimplifies things, to the recent length discussion on CrossValidated, the aptly-named stack exchange for statisticians, that probably overcomplicates it,
I will not try to present a succinct, yet encompassing definition which is just going to get lost in the sea of failed attempts. But we can at least enumerate the plethora of inter-disciplinary skills that data scientists are expected to have. The degree requirements alone showcase the versatility of this position, ranging from a degree in any of the following: Computer Science, Statistics, Applied Math, Physics, Engineering, or basically any quantitative field. On top of this, the degree can also be either a BSc, MSc or PhD in any of these areas. Now, turning to the skills, we can split them into a few broad areas of expertise, and the more the better when it comes to a candidate possessing them. So basically, you’re expected to be familiar with every concept described below.
- R & Python - You want a scripted language for fast prototyping, and these two are equipped with excellent data manipulation (numpy, pandas) and visualization (ggplot, matplotlib), in addition to machine learning frameworks (scikit-learn).
- Parallelism & MapReduce - And the flavor-of-the-month implementation, which used to be Hadoop, but is being overtaken by Spark
- Algorithms & CS fundamentals - Time/space complexity can prove to be very useful if you happen to use a model with O(n³) training time, which is going to take years in the age of #bigdata.
- Databases & Relational Algebra - Although 90% of the time, you'll get a csv file, it doesn't hurt to be familiar with SQL, NoSQL etc.
- Linear & Matrix Algebra - The backbone of machine learning and statistics, a must-know since half of machine learning is matrix products (OLS, neural nets, PCA, Gaussian processes, recommendation systems, the list goes on).
- Probability - It's important to know your distributions and their assumptions, e.g. when you use a Gaussian for non-normal data (which probably happens often
- Frequentist statistics - Correlation, t-tests, maximum likelihood estimation and statistical significance broadly is the bare minimum, but really, this is one area where it is vital to have some background so as to avoid pitfalls such as p-value hacking (wheor false positives
- Bayesian statistics - When you realize that all of the above is just Bayesian inference with a flat prior and that it's actually more intuitive than the ridiculous definition of a frequentist confidence interval*, you will come to see its superiority, only to realize that obtaining full posterior probability distributions is actually intractable for the slightly large data sets that you will mostly encounter.
- Supervised learning - Whenever you hear about ML in the news, its yet another breakthrough and an application of supervised algorithms, and you should definitely know as many as you can. From tree-based methods and their famous representative random forests, to deep learning and convolutional nets , to max-margin methods (SVM), there is an endless supply, though the aforementioned are ones that consistently perform well and are commonly used.
- Unsupervised learning - The untamed wilderness, where the data can always be clustered and the evaluation criteria don't matter. Although it's a legitimate area of research, it's most often used as a mere preprocessing step (PCA) for supervised methods. However, the clustering methods are still useful when the goal is to find some structure.
- Overfitting - If there is key insight from ML, it's that you shouldn't test on the same data set you trained. Although this is something that anyone with at least a bit of ML knowledge will know, there is surprising number of people that are not familiar with this concept. For those who think this obvious, there is a simple mistake that many make, which is using feature selection on the dataset before setting aside the test. Since feature selection works by looking at the relationship between the features and what you're trying to predict, you have now implicitly used the test set's values that you will later be trying to predict, which will probably lead to overfitting.
- in the limit of an infinite number of estimations of the confidence interval, the true value of the parameter will lie in 95% of all random samples of the given size, simply put.
Lastly, how does the workflow of a data scientist look like? This is probably the least contentious area to define:
- Data acquisition - Although sometimes you have to scrape the data from the web etc., most of the time you are just given a #bigdata file.
- Data cleaning/munging (the hard part) - If you are blessed enough to obtain a relatively clean data set, then you are either a kaggler or really lucky, because an exorbitant amount of time is used to shape up your features, remove redundant and noisy ones etc. All this is only exacerbated by the fact that this step is by far the most drudging.
- Data modelling - Now that you've shaped up your data, you can finally show off your machine learning and statistics skills to amaze everyone with the predictive power at your disposal. Turns out, the simple models usually work well enough, and no one likes the black-box random forests and neural nets, since you don't know how your model arrived at the answer. Well, at least you used a library and didn't code your own algorithms.
- Data presentation - Whatever insights you may have gleaned through sweat and tears, it's the presentation that matters a lot. You need to showcase your results and your effort and amaze everyone with the power of machine learning that enamored you in college. Don't forget to use ggplot for some beautiful bars and graphs (no pie charts!) that show how much profit this will bring.
So there you have it, yet another list of competences a data scientist is supposed to have, but we are probably no close to finding the pithy definition. Still, it’s probably useful to be versed in these categories. Whatever I’ve omitted is probably either inapplicable to most data scientists, or is easy and quick to learn (e.g. plotting and visualization).