Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Data Science Is America’s Hottest Job (bloomberg.com)
237 points by dsgerard on May 18, 2018 | hide | past | favorite | 211 comments


I recently quit a gig and spent a few months looking for a data science/machine learning gig. I was surprised just how gatekeeped these positions were. Everyone wanted a PhD or a masters degree. I have 15 years of experience in software. I’ve done everything to low level game programming and graphics programming to web development, to AI (but not as a specific position title).

The gatekeeping in this field surprised me because in my study of data science and machine learning I did not think the practical use of these techniques was that hard. The math isn’t even that hard if you had to implement these algorithms from scratch. It’s just linear algerbra and calculus, which anyone with an engineering degree is going to at least have exposure to. I couldn’t get the time of day from anyone. Not even a call back to prove that I knew or could learn what was needed to be effective. It was incredibly frustrating and disappointing.

Data science / machine learning is not that hard, and you are turning away good candidates for bullshit reasons. Stop it. At least bring them in and talk to them. Jesus.


When my company posted a data science job we received something like 300 applications on the first day. When looking at them most were people with no experience but a fresh degree or bootcamp grad plus all the other people like you who want to change careers from software with portfolios of data science projects.

The field is absolutely saturated with people who want to be a data scientist but have no experience. This is where some of that gate keeping comes from.

The people who have it easy are the ones with a MS or PHD and years of experience doing data science work at companies under their belt. There are very few of these people right now.

There is this idea that data science is needed everywhere and there is a HUGE supply of jobs. As an example if you search for data scientist jobs at glassdoor.com in San Francisco there are ~2000 jobs. If you search for software engineer in San Francisco there are ~9000. Similar ratios can be found in any major tech city. Data science does not scale like software engineering in companies but the narrative out there is that this is the job to be in and there is this huge unmet need. It is all hype.


The field is absolutely saturated with people who want to be a data scientist but have no experience. This is where some of that gate keeping comes from.

What are the data science "gotchas?" A lot of people can pick up basic programming in a weekend, but they wouldn't necessarily know what they don't know and might well get deeply mired in problems with concurrency or algorithmic complexity.

It's such "gotchas" which justify gatekeeping. Otherwise, gatekeeping is just unproductive manipulation of the market.


There are largely three branches of data science jobs, each with their own typical gotchas.

1) Data engineering. I suck at this, don't ask me.

2) Inference. One big gotcha is often of the form of not accounting for all the sources of variation in your estimator and thinking you have something when you don't (often coming from unaccounted sources of correlation in time or space or repeated measures). Another is that correlation isn't causation. This pops up in surprising ways. Or things not being as independent as you thought.

3) Prediction/classification. Gotchas take as many forms as the things you look at, but the birds eye view is that you apply a method and it works ok, but either not well enough, or you then try it in the real world and it doesn't generalize as well as it did on your test set. The ways models break down depend heavily on the model and the data, so the way to diagnose and fix the issue depends on both understanding your toolkit really well and understanding the context of the data (business logic, etc). Another gotcha is in understanding uncertainties of your predictions. If I predict that this word is a noun, how sure am I of that? Many beginners skip those kinds of questions, but don't realize it.

I'm a data scientist with (barely) a bachelor's in physics working with mostly PhD's and, while the academic degree based gatekeeping is bad and frustrates the shit out of me, I get why it's there. The learning investment to learn the basics is dwarfed by the learning investment to be able to flexibly apply the right things at the right times and tweak/fix them as appropriate.


I mean, in my graduate ML classes the first "homework" assignment was always just a problem set of probability and distribution questions. They were simple questions, but you'd be surprised how many people dropped the class after that first assignment.

There are a lot of people out there who have memorized how to implement k-means and PCA but would absolutely struggle to understand what they're actually doing or interpret the results in a meaningful way. A HUGE part of being a data scientist is presenting information in a useful way. That's why PhDs are favored because with their experience having to write grants to get funding for their research they're exactly the type of people that can take a naive problem, work it to a result, and then sit in front of a board room of non-technical people and explain why their result was worth the money that was given to them.


That's why PhDs are favored because with their experience having to write grants to get funding for their research they're exactly the type of people that can take a naive problem, work it to a result, and then sit in front of a board room of non-technical people and explain why their result was worth the money that was given to them.

My wife has a PhD in comparative lit, but she now works in banking. Her PhD gave her a superpower: Reading. She can quickly absorb large amounts of text with abstruse, complex, and subtle distinctions and tell you very nitpicky things about it. She thinks financial/banking regulations are light reading in comparison to the stuff she waded through to get her PhD. (It's also quite surprising: the number of C-level people in banking who have little patience for reading. She's won a number of boardroom battles because she has actually read things.)


I think training in literary analysis is a secret superpower for life. It gives people tools to make reliable inferences from text about motivations, assumptions, biases, etc., and to spot attempts to use phrasing to hide or obscure things. Great for reading emails, contracts, reports, etc.--and for editing your own writing for clarity (or not, if that's what you're going for...).


> I think training in literary analysis is a secret superpower for life

what exactly do you mean by literary analysis here? i have (in my opinion) extremely good reading comprehension skills, in that i can read and understand the literal meaning of almost any text (provided i understand the context), and i got an 800 on the critical reading section of the SAT. on the other hand, i can't for the life of me read a book and pick out any of the major themes without having them spoon-fed to me. i was always terrified when i was expected to have my own opinion about a text to use as the topic for a paper.


The surface area of "gotchas" in data science is much larger than software engineering because you have all the "gotchas" from programming/software engineering AND all the new "gotchas" from data science.

For data science the big one is over fitting, which everyone talks about, but can happen in really insidious ways in production. You have to be very disciplined and careful with the data to prevent over fitting.

Another big one is productionizing data science, which in my opinion most data scientists don't have a ton of experience with.

The actual training of the models part of data science isn't that hard, its actually making it work with the crappy data that exists in the real world and putting it into production that are the really hard parts.


I think the obvious "gotchas" are problem definition (Am I formulating the problem in a way that will allow me to create value? a concrete example: am I modeling churn correctly?), overfitting, target leaks, and model trouble shooting / improvement (i.e. the model is doing OK, can it do better? How much better? How do we get there? Remembering that small performance gains can mean big $ at scale). On the reporting side, how confident that what I'm reporting is real? This is where the "science" training is helpful. Programming experience is relevant in the sense that implementation is important, i.e. it's far too easy to introduce critical target leak bugs when engineering features.

Of course we can abstract the root argument; for a given job, among those qualified to fill that job, there exists at least one person who has auto-learned the skills required to perform the job. This is probably true.


I was recently back filling a data science position on my team and experienced the same thing. I was getting on average 5 new resumes a day and for the most part it was individuals who were currently in a data science MS program somewhere with no experience.


And you thought, "what a great opportunity to hire someone who has some fundamentals and a lot of potential for growth!", and hired the one that seemed most promising?

If I'm reading between the lines of your comment correctly, I don't think that's what you mean, I think you mean that this was a problematic experience. If that's right, I don't really understand that reaction: this is a nascent field, the ready-made experienced work-force is expected to be much smaller than the demand, with most positions filled by new entrants gaining experience rather than folks who already have it. This is the fundamental challenge and simultaneously the great opportunity of running a business in a nascent field!


> The people who have it easy are the ones with a MS or PHD and years of experience doing data science work at companies under their belt. There are very few of these people right now.

Are there enough of those people to meet the demand, even recognizing that it is smaller than the overall software engineering demand? If not, it seems like the parent's point still holds.

Edit: But thanks for the perspective on the size of the market and level of hype, that's a useful data point.


Every data scientist started out with no experience. This gatekeeping will have to come down if the industry actually wants data scientists who have experience.


Of the skills necessary to succeed in data science, the ability to program is actually the least important. Perhaps you were over-emphasising your software skills at the expense of really demonstrating proper understanding of the science of data analysis. A data science position is not going to require your experience in low-level game programming, for example. In fact, being able to program at all is secondary, since many can pick up some basic skills in languages like Octave, R or even Python, to support their mastery of the math. And it's also not so much math implementation but also a genuine scientist's eye on how to approach a problem. There are no cookie-cutter formulas you can throw at a non-trivial problem: you have to really understand the field and the art of analysis to do it well.


This right here is a perfect example of why this position is gatekeeped so much. Having a PhD doesn’t automatically mean you think like a scientist of a mathematician, and having a bachelor’s degree and 15 years of experience writing code and then putting in the effort to read Elements of Statistical Learning, compeleting several online courses in machine learning and data science, studying probability, combinatorics, graph theory, does not mean you don’t have a scientific mindset. A scientific mindset can be learned. You are not special because you have a PhD and I don’t. The only difference between you and me is that I had the ability to learn for free what you paid for. But when you go “oh hah, he’s just a coder and only has a bachelors degree” and you won’t even call me to talk to me that means you are missing out, and you are gatekeeping. You are not special and you are not smarter than me. You’ve just read a book I haven’t and wrote a white paper. I can read that book too. I can write a paper too.

I can do everything you can do.


This comment reads as full of hubris. It is easy to say you can do what someone else does, but the fact is you haven't done it, don't have the evidence to show that you can, and are exasperated that others don't see in you what you see in yourself. Having a PhD in a relevant field is an objective signal - not proof - of capability.

When these jobs are hot and candidates are plentiful, using signals to narrow the field to a group you can more rigorously interview is typically a more effective use of time than buying into everyone's self-belief. Candidly, I find most individuals from a programming background vastly underestimate the skillset required in this space. I know that is not an uncommon perception and you are likely being penalized for it, fairly or unfairly. I'll say this: anyone who refers to data science as "just linear algebra and calculus" would be immediately removed from any candidate pool I was managing.

Others have evidence of capability, you do not. Programming experience is not evidence enough to elevate you above candidates with more reliable and relevant credentials. A shelf full of books is not evidence either. You either need to find a version of this job created by people that don't really know what it is they want (hint: if the Data Science JD says "Excel" that's an indicator, it's not too uncommon) to create a work history, find a way create a portfolio that you can use as evidence (e.g., Kaggle competitions, hobbyist projects with available datasets), or network with others in the industry and academia such that they will vouch for you.


This is all very sensible as long as it's a buyers' market. What I get frustrated by is a sellers' market where the buyers gripe about how there are not enough qualified candidates. At that point, employers should be figuring out how to identify and cultivate inexperienced potential. I thought I had sensed some of this "there aren't enough qualified candidates!" hype in data science (for instance, that's the vibe I got from the article we're discussing), but the first-hand reports from people on this thread are making me think that probably isn't the case; that is, that people in the field feel that they can be very choosey and still fill their roles. If so, it seems to me that this is all working as intended.


I don't think most organizations possess the ability to cultivate these skills. I know mine doesn't, and from the outside-looking-in we should be more likely than most.


> I can do everything you can do.

Sorry but how would you even know?

I understand your frustrations well since I don't have a BS.

Still, as much as you seem to want to talk about how capable you are, you can't seem to understand the perspective of employers. Given what you've said about your history I suspect this is not due to a lack of intelligence but empathy.

Employers have to go through many candidates, each of which has some true capability but of which the employer can only see some signals. Signals have varying degrees of quality, and interviewing candidates costs time and money. That being the case, it is only natural that they try to use the strongest signals they have.

Nobody believes that there are not capable people who do not have an MS or PhD as you seem to be suggesting. The reality is that the proportion of people who have a BS and can do the job is much less than the number with MS or PhDs, and so it's one of the more effective filters they have at their limited disposal.

I'm sure companies are not happy about skipping great candidates like yourself, but they have not figured out a way to do so that is scalable and cost efficient. It's a difficult problem but maybe you can figure it out.

> You are not special because you have a PhD and I don’t. The only difference between you and me is that I had the ability to learn for free what you paid for.

How much do you even know about PhD programs? It seems like not much because PhD candidates, at least in the US, get paid. It's not much but they certainly are not paying for their education.

I apologize for this unsolicited advice but your lack of humility is frankly very off putting. You sound like a very hard working person. PhD programs are extremely difficult to both be admitted into and to finish. I'd expect that you would respect others like yourself who are very hard working.


I commend you for writing this levelheaded reply. I was about to write one with essentially the same arguments but a much more aggressive tone.


I think you are seriously underestimating the value of first hand experience conducting scientific research. It beats into you a mindset you can't get anywhere else. There are no books or manuals or docs to read and master it. There's only the scientific method. You start with a question, conduct experiments, analyze results, adjust your hypothesis, and repeat the process. You get very comfortable with saying "I don't know". You right at the border of the unknown and trying to navigate further.

Companies are using a PhD as a proxy for having research experience because it's the the only qualification like it out there. It's a poor proxy because not all PhDs are created equal.


> You start with a question, conduct experiments, analyze results, adjust your hypothesis, and repeat the process.

This is missing my pet step: doing the literature review.

I’m pretty ambidextrous when it comes to Python and R, so I’m not typically a combatant in the data science language flamewars.

But... for as much as the Python community likes to assert their superior coding chops, I’ve observed that the R community does a much better job of reading about prior art.


> This is missing my pet step: doing the literature review.

One of the earliest, most important, and most useful lessons I learned from a senior grad student: "a day in the library can be worth a week at the bench."


Or this: you spend 4 months on a project and belatedly notice schmidhuber proved a much better result in 1985 and yells at you and you get kicked out of academia


You don't get kicked out of academia for that... Well maybe if you're a student and you lose your funding. It's really embarrassing and I'm sure frustrating too. I've reviewed a couple of papers where they didn't do their literature review, there were lots of obvious (to people who read) prior work, and I've had to straight reject their papers. I feel bad for them but do your literature review folks!


I was kidding!!! But people do take literature review seriously (which on the balance is good)


What sorts of positions are we actually talking about here? How many companies actually need a research lab? I don't think anyone contests that a PhD is very useful if your job is to actually write and publish papers, attend conferences, present at conferences, etc. But many data scientists and machine learning engineers working in industry don't do any of that, and many companies have no need to have anyone on staff doing any of that either.


Of course it's possible that there are competent data scientists without graduate degrees; the parent was simply suggesting possible reasons why you might have been passed over for those positions, such as positioning your previous positions as more engineering-focused and less research-focused. I doubt the intent was to "gatekeep" you, but to give you suggestions on how to improve your search in the future.

This is a grossly defensive overreaction to the parent reply.


I don't think anyone doubts that the skills can be learned outside of school or fancy degrees.

The problem is that there are hundreds of applicants in your situation WITHOUT experience. There are usually a couple of PHD or MS applications WITH experience for every job. Who do you think the company would give preference to?


You are input to a classification routine. The job of those designing the routine is to reduce the false positive rate to as close to 0 as they can without incurring too high a false negative rate.

So this sucks if you don't fit the model well in a way that has you often end up as a false negative - but that doesn't' mean the model is broken.


No. The model has a problem with overfitting. That’s my point.


Saying it doesn't make it true.

You are claiming there is a generalization problem that causes extra error in practice. Another perfectly viable hypothesis is that the classifier is working fine, it's just tuned for true positive rate and accepts a higher false negative rate to get it. Specificity vs. sensitivity is a fundamental trade off, not a training issue (though that can make both worse)


I feel you bro. I've been there and done that. The thing is is that they are completely happy losing out on seeing the few candidates that are mixed in among the whole pool of non-PhDs (who are 85%+ posers) when they can select among the PhDs (who are 85%+ legit). It's a simple risk aversion tactic among companies and has nada to do with you personally. Like the man says, "nothin' personal, it's just business".


I'm sure you're extremely smart and dedicated -- I've only dreamed of putting in that much work -- but have you considered your mindset might be holding you back? Would you hire someone who had never worked in low-level game programming and didn't have a bachelors but told you "I can do everything you can do"?

If you're serious about getting a PhD-level job without a PhD, getting someone to recommend and vouch for you is even more important than usual. Since you're up to date with papers, why not email researchers you admire with questions that demonstrate you deeply understand their work? Many will be too busy, but some will probably be impressed by your determination. Once you have a relationship, see if you can assist with their research, even if initially it's just grunt work. It will take time, but integrating yourself into the academic "web of trust" and maybe getting your name on some papers is the only plausible way you can expect a company that doesn't know you to take you seriously.


Data science is not a PhD level job. PhDs may be well suited for it, certainly, but 80% of data science can be done by peoplenwth bachelor's degree in stats / math-heavy science.

There's always a domain specificity that sometimes comes from grad school, but data is data and industries are filled with SMEs who understand the domain.

Source: I hire data scientists for Fortune 500 companies.


>You are not special because you have a PhD and I don’t.

>I can do everything you can do.

To demonstrate that to an employer wanting PhD workers, go get a PhD like the other PhDs did. Claiming you can do what they do when you haven't done what they have done is not going to cut it.

>you won’t even call me to talk to me that means you are missing out, and you are gatekeeping

Gatekeeping = not spending unnecessary money and wasting unnecessary time.

An employer saves significant money and time by not having to interview everyone claiming they can do what PhD can do but didn't bother to get one. Your skilled workers don't have to stop producing and do interviews, your HR people don't need to spend time and money booking flights, hotels, and such for candidates. You don't have to work through 500 resumes with 30 PhDs in the pool - you sift through 30 resumes.


If you're in NYC, Chicago, or Boston we'll give you a shot! Email me sandy.vanderbleek@publicismedia.com. Thanks.


Every job is gatekeeping. You chose a path in life and it wasn't data analysis. Why would anyone interview you when there are 100's of other people that have the actual skill-set they are looking for?


Sure. Unjust as it may seem though, requiring a PhD in science is a free and simple way for employers to narrow down the field of candidates to those who are good at data science. It's the data science equivalent of requiring a CS degree when looking for a programmer. It's natural for employers to use that tool.


But if data scientist is the sexiest job of the 21st century, and everyone's going to need one or more, there won't be enough PhD's to go around. Someone's going to have to "take a chance" on the lesser degreed or risk falling behind their competitors that have data scientists of some effectiveness.


There are not as many data scientist jobs as the hype makes it seem. Maybe this will change in the future but right now there is not such a lack of qualified candidates to justify taking chances on lesser degree candidates.

All this hype makes it so everyone wants to be a data scientist. You get people who change careers to go into this new hot career. You also have a pool of people who have been working with data well before the hype with experience. The people trying to break into data science will have a very hard time competing with the people with experience over the pool of jobs out there.


What value does an advanced degree, in your opinion, bring to a data science position then? To hear you say it, it brings no value.


What value does a bachelor’s degree have if some kid can just learn ruby and JavaScript on his own and go make $100k/yr at some startup? Of course a degree has value but it is possible to provide value without one. This is especially true in 2018 when anyone can learn pretty much anything online, and download whatever papers they want online and read them. Your mistake is in thinking that somehow a degree makes you special. It doesn’t. It just means you paid for a head start. Anyone can learn anything you already know and they can surpass your skill, whether or not you have a piece of paper certifying your knowledge.


Thats a pretty ignorant statement. I'd wager that there aren't many people who could/would put the same effort into self studying that would be required to pass a CS degree with decent grades. They also would probably not learn a lot of stuff that is not interesting to them, while CS students have no choice than to go through the materials. As an employer this also tells a story about who is taking the easy route vs working through a complete program over the span of years. Obviously there are exceptions, but when there are a lot of applicants for a position, it's just an easy filter for employers.


You assume that everyone can teach themselves with the same efficiency and effectiveness as quality guided instruction from real teachers they can directly interact with. Some can, but most cannot. Education is not a waste.


Prestige and artificially pumping salaries. It’s all about brand building.

I don’t have an advanced degree, just a bachelors of math from 1995, and have been breadboarding (and more), and coding since the 80s

I can follow along with ML and have implemented toys with the ML algorithms in a couple days.

It’s bourgeois intellectualism. Like a law firm only hiring from Harvard

ML is automated schema design. And the current methodology has known limits of applicability

This is “Mongo DB”, “devops” like hype all over again.


Well, to be more precise its like a law firm hiring only from those with law degrees.


Just FYI many (most?) people with STEM PhDs don't pay for it, but in fact get a stipend.


To be fair, I didn't pay for my PhD, but I gave up the opportunity to earn more than a grad student's stipend for six years.


You know PhD candidates have their tuition paid for and get a stipend. They don’t pay for it like you made it seem or seem to think.


Actually no, a lot of PhD's are a bit smarter than you, some PhD's are a lot smarter than you


the ability to program is actually the least important

The days of a researcher producing a model to be re-implemented for production by a programmer are over, or very nearly so. A working data scientist now is expected to produce something that can run in production. That’s something a PhD doesn’t teach and that many PhDs find an uphill struggle.


What would be examples of such production-ready models? Things written using TensorFlow et al?


If you being paid to research a predictor or a classifier then as soon as you have a model demonstrably better than what’s running in prod, you should be able to just drop it straight into the CICD pipeline and off it goes to make money for your employer - or for you. No more chucking a prototype over the fence to a programming team to rewrite in C++ or Java to make it “prod grade”.

Right now doing this is somewhere between “state of the art” and “new normal” depending on where you sit.


This answer belies the reality that the knowledge needed to conduct actual data science, short of genuine R&D (which is itself at best 10% of data science positions, if that), can be acquired by a sufficiently intelligent person within 6-12 months of applied training.

We have developed tools to deliberately abstract away the complexity of the underlying math, and they work well. They work exceedingly well. I once took a semester-long class in data science where untrained, mathphobe business students were running various kinds of regression models on cleaned up data in WEKA (poor choice of software, yes) by the end of it.

Most data scientists can and should treat the algorithms themselves as black boxes the same way software engineers treat R-B BSTs as black boxes, without necessarily having to know how they work off the top of their heads.

I once worked with a Stanford "data scientist" at a top tech company who couldn't immediately recall Bayes' Rule. He wasn't stupid. His knowledge was just structured in a way that reflected the reality of his day-to-day; this is how it works for all of us.

The primary skills needed are: munging, featurization, analysis (basic stats and then a few other things like ROC, etc), and perhaps most importantly (and the thing I see PhDs in particular chronically fail at) operationalization. You do not need to know heavy math to run a model over data, which is the maximum level of sophistication required for most applications of data science that generate real business value.

I see similar foolishness in data science as I do blockchain, to be quite honest: people hype up and gravitate towards the cutting edge, while forgetting, ignoring, or blatantly obscuring the power of simple math on big data. I guess it's important that people have an inflated view of the complexity of what most data scientists are doing because having the role at least somewhat cloaked in mystique boosts salaries in the long run.

Any argument that a reasonably intelligent person can't be trained to be a legitimately effective data scientist is a counterproductive lie.


Judging from the replication crisis in many fields, maybe getting a PhD doesn't teach the scientific mindset as much as we'd hoped.


When a man's salary depends on him doing his job poorly, is it really all that surprising that he will cut corners?

If you don't reward (By issuing grants for) negative results, or null results, or replicating prior studies, why do you expect that scientists will aim for any of those outcomes?


It seems to me since we don’t yet know the truth around large swaths of reality (dark matter and such topics), a replication crisis is exactly what we should expect

You have to eliminate a lot of possibilities in a universe with this much detail

The real issue, again IMO, is more of an “expectation crisis”. We expected to repeat this and failed to. Because we’re still a ridiculously ignorant species lacking conscious awareness of many aspects of reality


This is all true, yet I think I would say the same thing about implementing most software projects in the modern world. At the top of most projects there is a head architect whose time coding is bordering on counterproductive. Then there are layers where only the middle is primarily concerned with programming well.. followed by ones who think more about oddities and integration with the environment it will run in.

Is Data Science really fundamentally different? Or is this PHD who barely programs going to either do tasks like cleanup terrabytes of data or risk that a coder with no idea will introduce a bias in the data during that process?

I find the whole emergence of the field fascinating, but I kind of feel like it is just techs recreating actuaries with what is actually a less specific education.

+ (And a worse academic style career path of going through an education you may never get to use instead of going up from apprentice to master)


In my experience working as a data scientist in the health tech field for the past decade (as an "epidemiologist" and "health informaticist" before "data scientist" was a title), I've found folks who just pick up data science from an engineering background often have a simplistic view of working with data generated from human subjects.

My training heavily stressed bias and confounding, study design, problem specification, validation, and interpretation of results. I've seen a lot of software engineers dabbling in machine learning jump to training a deep learning model for a problem where regex would suffice. I've also seen multiple people build models that reflect the data collection instead of the biologic/medical process, and present it without even realizing how wrong their results are. The problems in these cases is often their "objective" measures of performance (e.g., precision, recall, accuracy, AUC, whatever) look pretty good, but they don't see that it's because the model AND data are both heavily influenced by this larger problem. For example, is an increase in complexity of a particular disease due to people actually being sicker or because some payer changed a reimbursement program so now billing departments are using higher acuity diagnosis codes for their patients?

That said, the best engineer I've ever met didn't have a college education. I know a bunch of awesome data scientists who have taken pretty circuitous journeys to their current career. So, "PhD required" seems like it'll lead to a lot of false negatives. So, acknowledging that, my main point is that doing data science in a meaningful and ethical way—particularly when it involves human subjects—requires a lot more thought than just being able to implement some machine learning algorithm.


Thank you for this. I couldn’t agree with you more. I see the exact same kind of risk aversion hiring with data scientists that I’ve seen in the past with engineers hiring other people without a specific engineering background because of the assumption that those people don’t have the right mindset or skills. My entire point is that this kind of hiring process weeds out potentially great candidates based on the false assumption that not having a PhD or some other hard to get qualification somehow makes you unqualified. I think this is the result of this “field” being relatively young in industry and it will change as it gets demystified over the next few years.


I've also seen multiple people build models that reflect the data collection instead of the biologic/medical process, and present it without even realizing how wrong their results are. The problems in these cases is often their "objective" measures of performance (e.g., precision, recall, accuracy, AUC, whatever) look pretty good, but they don't see that it's because the model AND data are both heavily influenced by this larger problem.

This is great. It's something I notice I have to mentor my junior data scientists and ML practitioners on with some regularity. Data science and machine learning aren't necessarily useful without some additional domain knowledge and discipline to recognize the need to view the data from many different perspectives and with many different relationships highlighted. Too often they see high P/R numbers and call the job done, or something similar.


I work as a Data Scientist, but my grad degree is in Political Economy. On the math/stats side I'm definitely weaker than some of my colleagues (and I appreciate their skill, and learning from them). But I often see people with much stronger math skills than myself make very critical errors when trying to make inference on data generated by humans.

I think a nice team can have some balance, with enough overlap that we can find common ground. I'd be uncomfortable without knowing I can rely on some of my peers on some of the heavier algorithmic or SDE stuff.


If your job applications emphasize the same material as this comment, it's no wonder you aren't getting callbacks. You say you've got a wide range of programming experience. Great, that's usually evidence of intelligence. You say you've done AI in some fashion. Also good.

Then you go off on a jag about how simple the math is and how easy to implement the algorithms are. Totally true, but this isn't what data scientists are paid for. Data science/machine learning positions are more about understanding the limitations and pathologies of the algorithms, the data, and their interactions. This isn't necessarily hard, but it can be -- and the pay tends to scale in proportion to the difficulty. Since theory doesn't provide much guidance -- you can learn everything that anyone knows in a year -- employers will necessarily prefer someone who can demonstrate practical experience with data. Selecting for PhDs is one way of filtering for that.

If you've got experience with data, emphasize that and you should get callbacks.


This is spot on. As a hiring manager for these type of positions, the primary thing I look for is a track record of using data to achieve business results. This of course creates a barrier for entering this field and makes those with this experience even more highly sought after.


I actually think that machine learning can be much harder than you are claiming here, and that your inexperience is showing when you make such claims. Sure a toy problem from some dataset in a book or pulled off the internet is easy. So is implementing a k-means algorithm. But go to a large corporation with 100 datasets, all with 100 fields each, each having significant seasonal bias (among other biases), and build something that lasts, works, is clean, and is better than someone else can build. You need to convince them to trust you, maybe you can write an algorithm that works, but the business people cant understand that algorithm, and so they have to trust your word.


Often this comes down to the ability to communicate, work with people, and understanding real world business. In addition, often the more simple a learning method used the better. I actually find most businesses needs can be solved by straightforward analytical methods such as traditional statistics. Very few situations require actual machine learning and it is often misapplied.

In fact, having a PhD does not tell you a whole lot about these skills.

More advanced modern techniques such as deep neural networks, reinforcement learning, etc.. are all extremely proficient at certain niche problems but these do not come up nearly as often in a business context.

This is why I don't advertise myself as a machine learning engineer. Rather, I am a business consultant that knows when, and when not to utilize machine learning methods.


So true. Missing from university textbooks is the fact that real world data is often dirty as hell, inconsistent, distorted, biased, incomplete, and/or just plain invalid (measured the wrong source or too imprecisely to be useful).

Unless your DS group is big enough to warrant hiring data engineers / cleaners, as a data scientist it'll be your job not only to eventually choose the algorithm, but foremost, to confirm that the data is sufficient to serve the intended purpose of mining it, ideally before you waste a lot of time curating it or paying for a raw data dump you can't use.


Saying ML is "just linear algebra and calculus" is a gross simplification.

As a dev, you can pick up some popular ML frameworks and learn the basics relatively quickly. The difficulty in this field comes from the amount of theoretical knowledge you need to interpret your results.

All these data science bootcamps/learn quick schemes are like teaching a blind person to drive a racecar. He can work the pedals and the steering wheel - but has no idea where he's going.


Exactly this. Implementing algorithms that work on well-defined clean data isn't difficult. Working with naive problems where the right algorithm isn't clear or can't be directly implemented is difficult. In many cases cost of implementation is significant. You can't just apply a model to millions of points of data and blindly expect to get anywhere without fundamentally understanding the data you're working with first. That's what separates data scientists from data engineers, and the ability to take a naive problem and work with it to a conclusion in a manner that's understandable both from a technical and business point of view is exactly why having a PhD is typically a barrier of entry because that's all academics in data science do all day!


Data science is like security: it's very easy to come up with something that looks really good but is horribly broken in subtle ways that you won't find until it's too late. I think there's a lot of value in having lots of people be literate at it, but part of that literacy is understanding that it's a hard problem and that it does need attention from dedicated specialists.


Along the same line, I find the hardest thing to get good at for junior data-sciency people is how to tell if your model sucks. Too easy to trick yourself into thinking you have a good model.

I think this is an essential, foundational skill for modern data science, not unlike some knowledge of R or Python, but an order of magnitude or two harder to learn. A lot of the difficulty is in lack of good educational material. I also think this is significantly more important than a deep understanding of various modeling approaches; you can learn that as you go, so long as you know if what you have now is "good enough" or if you need a fancier approach.

Unfortunately you can't just throw k-fold CV around, average some fit metrics, and call it a day.


I would certainly agree with you, but I don't think someone with a master's degree is necessarily an expert. What's important, IMO, is having practical experience in data science, which may or may not come with a graduate degree.


No, it takes a lot more than a master's degree. Even a PhD is only really entry-level. There's no substitute for practical experience and talking to someone very experienced. Shame there aren't a lot more very experienced statisticians and data scientists out there.


Shame there aren't more concert pianists out there... but given the leverage big data can give (a small percentage change in what a big organization does can still save or earn big bucks) this is a competitive position, just as playing concerts is only for the few.

You want a better employee than your competitors have. You'll take all the expertise you can get, plus the ability to invent new techniques as needed. Everybody wants the very best 'cause that last 1% of skill can still be worth a lot of money. Everybody wants someone who can hit a lot of long balls for their team, as it were. There might be a floor on the skill level you could make some use of, but there's no ceiling.


> it's very easy to come up with something that looks really good but is horribly broken in subtle ways that you won't find until it's too late.

My experience has been that this doesn't matter in a lot of cases. A product can be both, highly useful and subtly broken.

One of the biggest boons to my productivity was realizing when something is good enough and to move on. You can waste so much time tuning for precision and recall.


Well - there _is_ a bit of Student's Paradox involved in requirements-gathering for machine learning or analytics, and it's not always apparent that you're using the right algorithm for the problem or what specific metrics you actually need to be optimizing. Most of the time when I've gone through User Story collection processes, the actual end-state users describe is vague, not based on hypothesis testing, and not supportable by the data available onhand. A big part of this process is discovering what the data _can_ tell you, and if necessary to rewrite the technical requirements entirely to align with the actual business need.

Even when a customer's given me a "clean" dataset, I've had to write 400+ lines of code to do the feature engineering on a relatively straightforward logistic regression. Then there's all the other times when a customer asks me to deploy one type of algorithm, and their business problem is actually solved by an entirely different class of algorithm entirely.

Zayd over at Stanford has a nice blog post [1] describing why machine learning is several more dimensions of complexity compared to traditional software development. There _is_ a specific set of Data-first skills that is complemented by dev and CS experience, but a fundamental reason why ML projects fail is due to lack of appreciation for the many different skillsets needed to succeed.

[1] http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.ht...


Your article eloquently sums up my thoughts on the matter. Data science is like software engineering, but with bugs that tend to have particularly far error localities. You don't get a clean orderly traceback, instead you get vague symptoms, where the model behaves "weird" for no obvious reason.

I've found encapsulating the data preprocessing steps in pure functions helps ensure that the data cleaning can easily/quickly be debugged. When it comes to the actual model, there is no substitute for thoroughly understanding the characteristics of your dataset. Finally when it comes to model selection, a good scoring metric is absolutely necessary; this is entirely dependent on what you're actually trying to accomplish with the model. So there is little universal advice.

When it comes to the long iteration cycle, the only bandaid I've been able to find is a solid test suite, and thorough code review. This makes it less likely you'll introduce unintended problems. Basically you have to move slow and deliberately, instead of "move fast and break things."

Far from impossible, but certainly more difficult than "traditional" software engineering. It's like transitioning from debugging interpreter tracebacks generated by a simple toy script, to debugging a >40k LOC application written in a dynamic language, which happens to be intermittently segfaulting in production.


In real life, problems are hard and require theoretical investigation to do well.

That's the actual job! The theoretical investigation. Not the programming. The programming is really easy. which is part of why having 15 years in software doesn't have so much weight.


How many data scientists really do theoretical investigation? From what I see not too many. It feels a little like looking only for devs who can write complete operating systems and then having them write CRUD apps.


That's basically the job - people who can get, transform and interpret the data correctly.


90% of data science time is spent sourcing and then cleaning the data. 10% is spent preparing and delivering reports to stakeholders. The ML time is a rounding error.

The job in which you are magically given a pristine dataset and can just tweak ML models all day doesn’t exist.


The running assumption on HN seems to be that ML/data science jobs are easy because you can just "plug in" already implemented algorithms and just expect things to work. I think the barrier for most of these jobs is a proper statistics education--one that involves active exploration of real world data that's imperfect and has to be cleaned. Very rarely is there an obvious solution to big data problems and much of the work is tedious parameter tuning that absolutely does require knowledge of the math and domain and not just how to write k-Means.


I think the barrier for most of these jobs is a proper statistics education--one that involves active exploration of real world data that's imperfect and has to be cleaned.

It's a bit of both really. If the data science job is in the field of predictive maintenance for example, then a (relatively) simple model may be sufficient to add business value straight away, and the hard part is a deep understanding of the potential failure modes of the machinery you're predicting on and the kinds of sensors used to gather the raw data. There's no "one size fits all".

This is one of my favourite papers on the subject, written by one of the lecturers on the DS course I did: https://users.cs.duke.edu/~cynthia/docs/RudinETAL2010.pdf


This 1000 times. I work in data warehousing and Business Intelligence. The closest I have gotten to data scientists are actuaries. Most who don't have a clue about data management, the process to get your data to be consistent and clean so you can do your analysis. Instead of creating standardised processes and tables, each analysis is started from scratch leading to maintenance nightmare and inability to cross-reference the results with any other data sets.

Whoever is running data scientist units needs to realise you need IT/DBA/programming people in the mix. Statisticians in my experience cannot design databases, concepts of code reuse and normalisation do not exist in their vocabulary. A great deal of time is wasted doing repetitive tasks that someone with a programming background would have assisted.


If you're a programmer by formation, then indeed you will be given a programming intensive, theoretically non-intensive task (or be found to be qualified for such a job). If you are more theoretically inclined, unsurprisingly, you end up with a more theoretically heavy job.


I would say the non-programming part, the theoretical investigation is really easy. Plus the majority of data science do data cleaning 80% of their time.


If you're a programmer by formation, then indeed you will be given a programming intensive, theoretically non-intensive task (or be found to be qualified for such a job). If you are more theoretically inclined, unsurprisingly, you end up with a more theoretically heavy job.


I agree it got quite crazy. I had several years experience as a data scientist and 2 years of PhD work in machine learning and after a few years not in software (running a non profit) it was incredibly hard to get call backs. I ended up getting totally burnt out on the search and the field in the end.


I think it might be a easier path if you apply for "data engineering" jobs and convince internally to give you a chance at data science. Get data science on your resume and look for next job.


Agreed. A DE job is based on more traditional software skills, expecting you to understand the priorities and purposes of an given ML technique, without having to choose, architect, or defend the choice of that technique over alternatives. I can see most of the OP's objections as being more appropriate to the superciliousness of hiring software PhDs into DE jobs rather than DS. The DS role is more quantitative and theoretical (befitting a PhD), while the DE is more an engineer and experimentalist (befitting a less academic mindset).


I think you're wrong about the math requirements. The math is not very abstract, but requires a much deeper understanding of statistics than your post suggests. That's probably why you struggled to get hired.

My partner has a successful career in data science, but she has multiple degrees in mathematics - significantly more than your typical engineer is exposed to.


Well to put it in a data science terms, the employers are looking for precision, at the expense of recall. They don't care that they don't find all the good data scientists so long as the people they do find are good data scientists.

I actually think that requiring an advanced degree doesn't help toward that end.

In the finance world we call these people quants. And actually in my experience having a phd does very little for someone; the critical skill required is software engineering.

Thinking about investment strategies is about testing hypotheses, and you can't test things properly if you don't know a few things about how to organize code. This is actually an insidious problem, because there's nobody telling you how to actually build an alpha generating strategy. And if you can't investigate properly, you fall into all the traps (you make excuses to do the following): too many features, choosing too small a sample that happens to do well, filtering the data in so may ways one of them is bound to "work", and so on.

Imagine that you're a chef, but you can't chop stuff effectively. You would then work around that limitation, maybe work on dishes where it's not needed, or perhaps get a junior guy to do the chopping. You might think this solves the problem, but actually it just swaps one problem for another, because now you need to communicate and coordinate with this other person. Or you don't explore that whole area of food with chopped stuff in it.

CI pipelines, version control (branching, diffs, etc), database maintenance, a bit of OS basics. All things that tended to differentiate the productive quants from those who merely thought they were useful. I've seen things done in a few weeks that others had spent years not achieving.

As for the ML skills themselves, you do need a bit of math to do it, and the math is relatively easy to learn. There's loads of materials to help you as well. What's not explained so much is certain philosophical issues around what is being examined. A course in economics has examples of these things: endogeneity, Lucas critique (which is Hume rehashed), experiment design (do the observations mean what you think?).


That's very valid because Recruiters are dumb confused. I'm an undergraduate engineer with decent experience in Analytics and my resume doesn't even get into some company because I don't have Masters while my friend who did an MS in Data science from UK is also not getting a job because she just pivoted her career from Mainframe to data science with that MS and they don't want a fresher. Hence you are left with a minimal talent pool that gets to hop here and there increasing their market value while the demand doesn't reduce and at the same time with this fancy word PHD every company is ants to get them or at least an MS just for God knows what reason. In an ideal scenario, You are just exploring data and getting actionable insights ( I'm not talkign about publishing paper nor programming self.driving cars). I'm an average guy knowing R and Python with Pandas and Scikit learn. Anyone who can connect the dots can learn these tools and become a good data scientist, but the blind belief of these recruiters just kill those hope and look for the hardly available job hoppers. Such a paradox self created by them.


This is a bizarre comment. You have 15 years of software development experience and you are surprised this experience doesn't carry over to data science or ML directly?

Why should someone bring you in? Do you have a kaggle profile with your notebooks that people can look at? Have you replicated research papers ? Do you have any independent results that you can share, publish, or talk about?


There was a popular quote in 70's & 80's in corporate America "nobody ever got fired for buying an IBM machine" and that had more to do with branding than anything else. It's the same reason certain law firms hire only from Harvard, you are not second guessing the candidate, as far as the qualifications are concerned. It certainly does not guarantee that the candidate will be good but does fit the requirements.

I think this is a problem for candidates who don't meet the set requirements, but this can also be an opportunity to differentiate from rest of the pack.


I had the same experience recommending private sector jobs to a scientist friend of mine. He just finished a Ph.D. in chemistry at a very prestigious program. This (inevitably) involved doing a lot of analysis of experimental results in Python. "You're not a real data scientist" was the over-arching conclusion of his effort to get a job doing data analysis, even for comparatively lame old-school BI type analyses. This is very silly and unbelievably short-sighted.


For what it's worth, I do have a PhD in math with research in machine learning, and I still had a hard time getting calls back. To be fair to myself, I'm speaking from my experience of grad-school-to-industry transition, and the lack of industry experience is a likely big reason for not getting as many calls back as I had anticipated.


You might be a great candidate but how would anyone know? Something on your application needs to stand out. If you don't have a degree that's fine but recruiters need some way to filter through resumes. I'm generally not a fan of cover letters but write one and make a case if you have one.


A data scientist is a statistician who programs.. :). That’s the real issue. The gate keepers will be someone with said advanced degree who is filtering for people like them. Ironically that’s a bad way to filter but it happens.


I had the same experience in my last job search. I only did a bachelors. Even with many projects on my resume that were completely data science focused, my response rate was almost zero for those types of positions. My response rate was very good otherwise. It's frustrating. Based on my conversations at data science/analytics meetups, my knowledge is greater than that of the average employed data scientist.

My suggestion is to get a job somewhere with a data science team, do a data science side project with company data, and demo it to the data scientists. That is probably your best bet to get a foot in the door.


> Based on my conversations at data science/analytics meetups, my knowledge is greater than that of the average employed data scientist.

Sounds like you may already be too good at data science to ever get any actual experience of it.


Not sure if you're intending to be snarky or not, but I mentioned the fact that I already had plenty of professional experience doing data science.


Hi, if you're still looking for a data science role and are based in NYC get in touch. sandy.vanderbleek@publicismedia.com Thanks.


The Data Scientist "shortage" is another scam by the big tech companies to drive down wages. My boss had hundreds of applications for the position he was looking to fill and the majority were so well qualified that it was basically a coin flip between multiple PhDs with great resumes.

Tech worker's arrogance reminds me of "made in the USA" factory workers who thought their jobs were safe. If people don't start pushing back wages will be pushed down to global equilibrium. Based on how things are going I'm predicting neo-feudalism thanks to the joys of open-boarders globalism putting all power into the hands of corporations.


wouldn't a real or fake shortage of talent drive wages up, not down?


A fake shortage drives wages down by increasing supply.

Also, you create a fake shortage by driving wages down: "we have so many positions that we cannot fill (at the price that we are willing to pay)"


Yes, no.

Yes if it's real. No if it's fake.

This still ignores the question of if entry limits for jobs are set reasonably or not though, you can create a real shortage by limiting yourself needlessly. At which point those who think that it is needless will call it "fake", so which word you use also depends on opinion and interpretation of data :-)


The average CS student didn't do that great in LA/Calc. Most CS programs don't require vector calc. At an average uni, the vector calc class/linear algebra will be extremely forgiving and you really don't need to understand the material. I don't think any undergrad CS program in the country teaches enough statistics and probability to understand some of the harder latent models or generative modeling. The math can get arbitrarily hard enough where a lot of decently ranked universities don't even teach the classes required to understand some papers, but that's overkill. An Udacity gold star isn't really a replacement for that either.

So basically, people like you are a huge risk for a company. You really can't prove you have anything more than a practical ability to use the algorithms. Most algorithms are hard to implement from scratch (robustly, let alone correctly), and you would be doing so by reading papers you have a scant understanding of..

I'd bet a large sum of money 2/3 of the candidates applying for a machine learning role without a PhD could not even provide any derivation of OLS or something like that if it came up in an interview.

Incoming downvote train, choo choo


What're some examples of a "harder latent model"


Here’s a really hyperbolic example: https://arxiv.org/pdf/0908.4425.pdf


This is commonplace. I think it's due to the dilbert effect. If you can't judge candidates based on ability, you have to judge them based on accreditation. Larger organizations tend to have more layers of non-technical people evaluating technical candidates.

It's not particular to ML, though ML is worse than others. CS degrees are luckily becoming less needed for basic programming jobs.


that's why I don't bother ML/data tracks at conferences. What's the point of these "ML for everybody" tracks?


You're right, it's not that hard. A lot of the gatekeeping is because the departments are run by nontechnical people who just don't understand your skill set. They also need to guard the bullshit and not let anyone realise it's just really shoddy programming at the end of the day.


I love the sarcasm, but id advise you to mark it as such.

Hn readers are very literal on the whole.


this is 100% wrong


That's not a rebuttal. It's absolutely non-technical gatekeepers looking for certificates.

Graduates get mad because they realize they've wasted a bunch of money on something they can learn for free, so they reject the notion and make others go through the same abusive system.


Do you have any proof that non technical managers are setting the hiring bar and writing the requirements for data science positions at most companies, which is what it seriously sounds like you’re implying. The guy above you was joking


I'll be completely honest, as your standard "data scientist" who hopped on the bandwagon and came from having a PhD in academia in an unrelated field, I cringe at these articles. I'm not entirely sure why. I think it may be two-fold:

1. A little bit of the selfish "oh no, the secret's out, at what point is my salary going to drop when the demand is met by the dedicated Master's degrees and bootcamps?"

and

2. These articles seem so incredibly corny, it's almost embarrassing. The "hottest job"? Ahhhh, stop it. But these things go in an out of phase, similar to back in the day when "anesthesiologist assistants" (CRNAs, AAs) were the hottest thing for Bloomberg to talk about. It will not last forever.

The irony is that I probably only knew "data science" (always in quotes) existed because I read one of these cheesy articles. I mean, we all know that statistics have been around forever, but that there were dedicated positions where you could run stats, build models, and then deploy them all in a single role was foreign to me.

So it's a combination of a potentially irrational fear of self-preservation, and laughing at the state of affairs where some basic stats work will pull in that kind of money.

I tend to have fears about the future, always wanting to hedge myself so I don't become outdated. In the data science sense, I see the field becoming super super broad and eventually saturated with new supply, so I debate on whether I should pivot into management of analytics in general or not. I.E. getting my hands off the keyboard. Ultimate goal would be to help define, strategically, how statistics/data mining/machine learning/yada/yada/yada are used at a company.


My goodness, are you me? I've been having exactly the same thoughts. Provided one finds a data science role that roughly aligns with your training / interests, the actual work is comically easy compared to what you go through in academia.

A boot camp can easily teach someone to, say, estimate a linear model or run k-means. I dread the future when the industry decides the right way to put up barriers is by creating ever less-realistic interview loops that are even more coin-flippier, dice-rollier, card-shufflier.


A boot camp probably won't tell the process to make sure a linear model is the right choice. It probably also won't tell you when you should and shouldn't use k-means. And in most cases you probably won't have very good answers as to the certainty of your models.

Imagine you're hiring someone to build a house for you. Would you feel comfortable with someone who's just been drilled on how to use individual tools? I would want someone who had been taught a step by step process for how to put together a house.


> A boot camp probably won't tell the process to make sure a linear model is the right choice.

Very true. On the other hand, it's a pleasant rarity when I see positions that appear to index more heavily on, "how well is this person able to conceptualize the problem and choose an appropriate method?" than "can this person do X?"

Lots of folks can do X; fewer can conceptualize a research question and choose the appropriate X; even fewer can carry out the X and communicate robustly what it means.

The latter two start to get into squishy territory, but also are where the value is. They also seem to get the least focus in advertising / recruiting / interviewing data scientists.

It reminds me of studying evaluation methods in planning. One that people are really familiar with (at least anecdotally) is cost-benefit analysis. Conceptually, it's very simple. The problem is that the costs and benefits that are hardest to measure are very often NOT measured. And they're very often the sorts of things that people find the most important. So, you end up with an answer that encodes a ratio of easily measured things rather than important things.

So too with data science. Easier to check whether someone can remember basic probability rules and carry out a linear regression than it is to diagnose whether someone can reason carefully about an amorphous business problem.


If the example of coding bootcamps is indicative, then there will be a period of hiring of Data Scientists with minimal background, and an institutional learning period of "oh they are cheaper but don't deliver results we can use", and thus the job will still remain one with more openings than qualified candidates for some time to come.


Just add a grain of AI or blockchain and you’ll be fine!


Do a linear regression, call it ML with AI and you'll be running your own team in a week


Just the other day, I heard someone talking about a "single layer neural network with no activation"...


y=mx+b :D I won


with just a single neuron!


What does that look like?


Genius.


Hmm... do you offer career counseling services? How can I sign up? ;-)


Stop worrying & go vertical, aka use these (and any other momentarily fashionable) tools in the niche domain you are a PhD expert, so that you will never become outdated.


I don't think it's irrational. As somebody with just a masters in a tangentially related field (economics), I find it surprising how low the barrier to entry is for this job (I got in, after all). Anybody with a decent undergraduate level understanding of stats could learn to do most of the production stuff that companies generally do in a few weeks, at best.


I agree completely with the above, and I feel like I'm in a similar position. Mind if I contact you to discuss a bit more?


Sure! My screen name @protonmail.com


Thanks! Just wrote you.


My wife is a data scientist and this is one of her favorite quotes, from Dan Ariely:

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it...

https://www.facebook.com/dan.ariely/posts/904383595868


"Deep learning is like big data: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it..."

Tim Hopper https://twitter.com/tdhopper/status/916383020835368960


That's a pretty good analogy. Substitute "Big Data" for some other popular trend ("Blockchain", "AI", "ML", "Deep learning", "IoT", "Agile") and it still holds true.


This is beautiful


Big data's also like a big dick. Everyone's running around telling their clients they've got an 8 incher if they buy.


Recruiter here. Data Science is one of ~3 roles I generally don't recruit for. Not because Data Scientists are super hard to find, but more because companies seem to all want data scientists and all want them to do completely different things and none of them really understand those things. Makes for a pretty lame hiring experience unfortunately. :(


> all want them to do completely different things and none of them really understand those things

It's the HR equivalent of deciding to rebuild your solidly performing website with some hot new v0.1 frontend javascript framework purely because it's the new hot stuff. Perpetuated and reinforced by C-Suite level desires to appear trendy and cutting edge.

When I ran an Analytics/BI team, I was well aware of what I needed and wanted. But constantly had to fight to not have HR label open positions as Data Science roles.

Once I got them to start posting the position with Data Analyst and Data Strategist titles instead, the team satisfaction scores and attrition rate markedly improved. The (very competitive) pay rate and actual work was the same as it was under Data Scientist titles. But it better scoped the candidate pool and aligned expectations up front, rather than hiring a bright-eyed and overqualified Masters/PhD graduate and being the company that disillusions them to the reality of most "data science" roles.

Although I can understand the desire to ride the hype train for those executives. Being pragmatic improved my team and benefited my company, but cost me the ability to add Data Science as a buzzword to my resume, and "Created, managed, and scaled a Data Management and Operational Analytics/BI organization" doesn't have as much market value.


I know a mid-level manager at a Fortune 100 company. Somewhere up the management chain, someone read an article about how data science is the hot new thing that your company has to be doing or it will have missed the boat. So word was sent down that their next project was to do data science. So now they are launching a data science project to analyze their business data, which this manager is overseeing. They're hiring data scientists and all that entails. It doesn't come from a business requirement or some organically generated need, top management just wants to be able to say their company is using data science if they're asked about it.


That's unfair. "Not missing the boat" is a natural need for a company. Everyone with an MBA has studied companies that missed the boat. Throwing some money at various new fads isn't necessarily irrational.


Nevertheless, it does make one reconsider million dollar compensation packages if all those guys can come up with is to deploy randomness. Either they have foresight and a plan or they don't. Either way is fine, but only the first option supports the million dollar salaries for executives. If they are just like any other mortal... they didn't even do anything to identify new trends, they just followed what was in the major business magazines.


I actually see this as a smart move in the context of a Fortune 100 company. It's often a good idea to build something (in this case, a data science capability) before you need it, so that when you do need it, it's already up and running. Also: there's value in giving any new technology a test drive.


That's normal. The company to experiment with data science and understand what can be done with it on a low risk project.


For every 'real' data science job which requires deep knowledge of statistics and machine learning, there are 50 DBA positions where they want you to know some python.


I was hired as a "technical game data scientist" and all I do is fix stuff with python, celery, redis, and if I'm lucky Redshift. To be fair, they know this isn't really "data science" but it's always what needs to happen in the next sprint so anything interesting is shoved out.

I think a lot of companies are hiring "data engineers" but don't know it. The guy who started making Superset (https://github.com/apache/incubator-superset) wrote this and I think it's apt.

https://medium.freecodecamp.org/the-rise-of-the-data-enginee...


And for every company which wants to have a DBA there are thousands which just don't care, or claim having a DBA is just too expensive as the programmers are enough. The results are quite terrible then... but almost nobody cares. Just my small remark after long time of looking for a job as a postgres DBA :).


Call them Quant roles like in finance, everyone understands and it makes life easier.


You think you've got it bad, imagine how it is for us candidates.


I have a love / hate relationship with bucketing a whole lot of formerly separate disciplines into the data science label.

In my case, I'm a reasonably solid R / Python programmer (who occasionally dabbles in racket, clojure, and others). I've got the sort of applied statistical training of someone who took a quant-heavy course load in a PhD program. I've even (by title) been a data scientist a couple times now.

Having recently decided to reenter the job market, I'm reminded that finding the RIGHT data science role is a major challenge. When someone wants a data scientist, they may be looking for someone with a lot of specialist depth in operations research, financial forecasting, machine learning, data-focused software engineering, or some other not-at-all-universal area of expertise.

In some ways, I almost think that someone with a bootcamp level understanding of stats may be at an advantage. Whereas I'm very inclined to be, "Oh! You want this other kind of person. Would you like me to put you in touch with one?" I think someone more junior is inclined to be "How hard can [X] be?"


As a "junior" applying to data science positions, I have a 50/50 mix of "no way I'm qualified for that", and "that sounds like a really fun problem."


Welcome to impostor syndrome. It NEVER goes away. The worst part of impostor syndrome is that sometimes, you really aren't qualified, and so you can never fully move past the self-doubt.

You don't sound like the Dunning-Kruger sort, so I'd chase the fun sounding problems in organizations where you can use more senior people that you respect as sounding boards / mentors.

Good luck (to us all)!


To be fair though, can't you say the same about the title "software engineer". Without further qualifiers such a title can run quite the gamut of skill-sets.


I see more segmentation in software roles: front-end, back-end, full-stack, embedded, devops, realtime systems are all labels that get you in roughly the right place.

Beyond that, there are labels that refer to specific technologies that get you even closer: Rails, Elixir, Django, React, QT, C#/Winforms, Verilog.

There really aren’t the equivalent shorthand filters in data science.


> There really aren’t the equivalent shorthand filters in data science.

There are, though, right? The catch-all "Data Scientist" title is being split into Data Engineer, Machine Learning Engineer, Data/BI Analyst, etc. Each of those, in my experience, have clear definitions.


FTA (examples of data science):

  targeting health-care customers for hospitals 
  people who can turn social-media clicks and user-posted 
    photos into monetizable binary code is among the biggest 
    challenges facing U.S. industry
  “sentiment analysis,” or finding a way to quantify how
    many tweets are trashing your company or praising it
  determine how customers prioritized paying bills
  “recommendation engines” - those programs that predict
    what you may want to buy next
  advertising
My background is traditional business intelligence, finding actionable data for high level leaders.

A common response on this thread is that little data science is actually occurring in the business world. It would be incredibly useful to me (and I suspect other readers of HN), to hear from other participants on the thread what data science and methods they are using.

I'll kick it off with data analysis examples from my workplace:

  1 - Analysis of patient accruals to various clinical trials.  
    Mainly tracking against goal numbers.  No statistics or ML.
  2 - Analysis of tissue collection opportunities to answer the question:  
    Are we obtaining samples useful to future research opportunities?  No statistics or ML.
  3 - Creating models to accurately predict patient accrual rates for individual studies across various 
    different variables (race, ethnicity, gender, age).
    Simple statistics, probably just a linear regression model.  This is a new effort.
  4 - a long, on-going, and currently unsuccessful attempt to extract useful data from pathology 
    reports (free text descriptions from pathologists examining various collected tissues for 
    both medical treatment and research purposes)
In addition, I know of a few NLP motivated efforts to train classifiers - say, given a set of 500 manually labeled papers, can a classifier be built that would be effective in bucketing an additional 8,000 papers?


We have two closely related groups in our company. Mathematical Optimization and Data Science. I am in the optimization group, but when I query the data scientists, they seem to be doing similar things - simulation, network optimization, resource allocation and planning. I am not sure why management wants to call them data science.


I'd guess that historically the Mathematical Optimization department has comprised "legacy" operations research people (spreadsheet modeling and decision analysis, linear and nonlinear modeling with AMPL-type tools, SAS statistics, etc) and the Data Science group is newer and more software-ish.

Am I right?


I guess so. They work with SAS and python, while we work with C++ and CPLEX on vi editor in a linux terminal.


Maybe they are just proxies for other labels, like "the good group" and "the one we can drop if our numbers don't hit expectations." I'm offering this theory as a bayesian.


Recently I wrote a tweetstorm of my not-so-great experience hunting for a data science job last summer: https://twitter.com/minimaxir/status/951117788835278848

Despite data science being a hot job, the sheer, growing number of MOOCs available will cause the high amount gatekeeping from many prospective employers to get even worse.

I am very happy as a data scientist now, and yes, it's more complicated than doing Excel VLOOKUPs! Although I did get rejected from those VLOOKUP positions many times in my job search...


This article makes it sound like you can just walk into a DS position. No. Expect a months-long job search if you don't have a PhD.


Maybe but once you are in you have the golden ticket. Source: am a current DS with an MS and like the guy in the article I get contacted by recruiters weekly.


Ha yea I lucked into one 0:) Came from DE side and snuck me in so we can get things done.


sheer, growing number of MOOCs available will cause the high amount gatekeeping from many prospective employers

I did a MOOC, nothing wrong with them. A lot of it turned out to be just revising things I had originally learned in my BEng or MSc, but I learned some things too. And a bit of revision never hurt. I don’t think I had done much calculus “in anger” since graduating for example, so that was rusty.

In 5 years or 10 years there will be no data scientists as a full time job - the skills will just be folded into regular programming jobs or accounting jobs or whatever. If you’re in that job now, make hay while the sun shines


"In ten years all programming will be either outsourced or automated..."

-everyone in the 90s


When I started in this game, just being able to put up a website made you hot stuff. HTML and CGI were cutting edge. But for a long time now, those skills are totally commoditized. That’s what I mean.


"Full Stack Developer" will mean everything from heavy front-end dev work to scaling out a monolithic in-house back-end framework into microservices to cobbling together Android/iOS apps to slapping ML on top of it all and calling it a day.

Oh and it will all run on JS.



Interesting. Like others point out in then article, I do field a lot of recruiters as an economist/data scientist, but most of the initial offer discussions have silly attributed.Base range max lower than current base salary, contract-to-hire, etc. Having been doing this type of work for almost a decade now, I think this kind of article comes across as recruiter uncertainty rather than identifying the real value add of data science: extracting valid insights from data.

If you want data scientists, pay a good salary and good equity.


Looks like this article is well timed for the first batches of MS in Data Science grads who'll start looking for jobs soon. I'm not a data scientist, but I feel Data science field is still quite vague, and the terms ML, AI, Deep Learning etc are thrown around so much, its hard to understand or make a guess as to how much real demand is there for such jobs.


One of the most ill-defined too.


I'm actually not that impressed at $160k for "the hotest job" in a major metro, and it sounds like their example with a masters in Stat might be using something more than Xcel plots.

Kind of disgusted though, that Equifax "is shortening the hiring process to keep anyone from slipping away." They could fix their security practices before hiring anyone with Scientist on their resume.


I’m sure the people implementing their security are the same people recruiting and hiring new employees in an unrelated field.


I don't think data science would be a hot job if advances in data engineering (arguably, another ill-defined term) hadn't been achieved. Because of this I like this definition of data scientist, although I can't remember where I read it:

"A data scientist is someone who knows more about statistics than your average software developer, but more about software development than your average statistician".


Or: "A data scientist is someone who knows less about statistics than your average statistician, and less about software development than your average software developer".


Also valid.



A more correct version would be:

"Is America's current fad job, like tons of similar jobs before it" (have lived through 4-5 of those "hot jobs").


What were the fad jobs in USA before?


E.g. around the dot-com bubble it was "web designers" -- now they're a dime a dozen.

In the 90s/early 00s there was the biology/bio-engineering/bionformatics trend -- everybody was thinking of going into biology when I went to university -- that dried up soon as well [1]

[1] https://iubmb.onlinelibrary.wiley.com/doi/pdf/10.1002/bmb.20...

Here's a good list written by another HN member:

https://news.ycombinator.com/item?id=8864737


I've had a surprising experience with hiring data scientists in the last year. I have hired two (one full-time and one part-time) and am in the process of hiring a third. During this time I have also been looking for a SugarCRM or SuiteCRM developer.

I have been flooded with candidates for the data science position. There seems to be a glut of well qualified [1] data scientist applicants. I have turned down many candidates who I think would also have done well in the position. The CRM developer position has been extremely difficult to fill. This has had the effect of bringing the salary of those two positions in our company to near equivalence. My initial expectations for the salaries were that the data scientist would be paid substantially more than the CRM developer. I know that Sugar/SuiteCRM aren't super popular but it also could be that I have been effected by all the attention that ML and AI have gotten over recent years.

[1] well qualified for us has generally meant masters level or above with real work experience. maybe this definition is the source of the 'glut' of applicants?


I’ll be blunt - working with data is a lot more fun than working with line of business applications. That’s why so many more people want to do it.

People compete in machine learning for fun (e.g. kaggle.com). I doubt anyone works on CRM systems for fun.


Unfortunately a lot of the "Data Scientist" I meet are nothing more than excel / sheets gurus. Most of them have never written a line of code or syntax that is more complex than a nested vlookup.

I hope the tide changes here and I think it will, but this has been my experience so far.


I had a feeling that in Europe, a lot of those jobs run under Data Analyst. Data Scientist roles at small companies can go in that direction but most that I've interacted with strictly don't do Excel anymore where possible. Mostly because they found out that turnover is very high if you hire Data Scientists for VLOOKUPS, so you'd better give talented people other tasks to keep them.


If I were European i'd scope up one of those boring jobs and then do another job while at work. Those job protection laws (at least in France) are free money if you work it right


If it makes you happy, sure. Most people I know prefer an interesting job they like, rather than 8 hours of boredom. Remember, the employer can very well deny you any personal internet access. Personal internet usage at work is grounds for dismissal even with strict labour laws.


Interesting take, I never thought of that


That's what I see too with the data scientists at my company. They run a few SQL queries and create spreadsheets. Nothing really groundbreaking.


Unfortunately a lot of the "Data Scientist" I meet are nothing more than excel / sheets gurus. Most of them have never written a line of code or syntax that is more complex than a nested vlookup.

Those kinds of jobs are advertised as “data science” jobs. In fact any job that involves data manipulation at all is getting re-branded as “data science”. Partly this is employers doing a bait-and-switch to attract better candidates and partly its just jumping on the bandwagon.

But candidates aren’t entirely innocent either. How many programmers - an honest title for honest work - call themselves Senior Certified Enterprise Solution Architect Team Thought Leaders or some such nonsense? We’ve all met a few. LinkedIn is crawling with them.


The White Walkers of the data science field are out the box enterprise solutions. These are enterprise software, data science consulting and ops solutions in a package. The corporate customer need not hold in house a data science team. The ambitions of the enterprise solutions (think DataRobot, H2o, etc) are to effectively bring one click production ready solutions that even C-level can participate in.

I see this is as the greatest threat to the demand of the "in house" data scientist.

If this turns out to be the case, I see the greatest demand for those who can write production grade code (i.e. software engineers) and those who are effectively trained data scientists. We see this job often called research scientist or research engineer.


Employers rediscovering statistics.


Bingo.


It is hot in this part of the world as well. I have a few friends who wouldn't pass as programmer (not because they don't know frameworks, but because they're weak in logic), but earn handsomely when they claim as "data science" guy.


Data Janitoring is America’s Hottest Job

FTFY.


Ill-defined though it may be, there's still an understandable difference between data science and data engineering.


You'll also need a strong background in statistics if you wanna bring value to a company other than "machine learn all the things".

I'd even argue that - for data science - it's more important than any ammout of years doing engineering.


My wife will be attending a university based data science boot camp. She has spent 12 years as s microelectronics process engineer. Seems a dead end job and can’t break the $100k mark. Hopefully it will be worth it.


But is that something she will like ?


Yes, as she already does lots of statistical process control, design of experiments, etc.


at a certain company in San Francisco, they had Data Science meetups very publicly, for a year or more.. very well attended.. The company made an announcement they were 'hiring' at many of them ..

It appears to me that about two or three people were hired that way over more than a year.. at the same time, the company was being sold to a larger company in Seattle.. the founders made at least seven figures in the sale, I would guess... draw your own conclusions..


Where are these $300k salaries? I have a PhD and work in Atlanta. I never got the memo on these sky high salaries. Don't know anyone else either who earn this much.


I'm currently pursuing my data science degree after my last burnout as a senior sysadmin. I'm hoping my practical experience will be a perfect augmentation of the "data science" stuff so that I can be more confident when bringing the board or the execs proposals about how to fix things.

Let me tell ya'll all a little secret. Execs have been mismanaging infrastructure... and it's all so close to crumbling at the first puf-o-wind.


Definitely could use a high paying gig, but crunching surveillance data without opt-in has always felt unsavory and wrong to me. Their examples including Equifax made me cringe. Any perspectives that might help me feel better about doing it?


After the Facebook data leaks and such, I suppose the field got a bad rep. So not sure if such jobs are really hot. Perhaps the author is trying to influence public perpeption?


What about Firefighters???




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: