Interview with Data Scientist at Kaggle: Dr. Rachael Tatman

Part 17 of The series where I interview my heroes.

Index and about the series“Interviews with ML Heroes”

You can find me on twitter @bhutanisanyam1

Today, I’m very excited to be talking from someone from the kaggle team: I’m talking to Dr. Rachael Tatman: Data Scientist at kaggle.

Rachael holds a Ph.D. in Linguistics from The University of Washington, as well as a Masters in Linguistics from the University of Washington as well.

She is currently working as a Data Scientist at Kaggle, she also hosts weekly coding live streams at Kaggle-Live (YouTube, which I have to say are amazing!).

She is also a Kaggle Kernels Master and Discussions Expert.

About the Series:

I have very recently started making some progress with my Self-Taught Machine Learning Journey. But to be honest, it wouldn’t be possible at all without the amazing community online and the great people that have helped me.

In this Series of Blog Posts, I talk with People that have really inspired me and whom I look up to as my role-models.

The motivation behind doing this is, you might see some patterns and hopefully you’d be able to learn from the amazing people that I have had the chance of learning from.

Sanyam Bhutani: Hello Rachael, Thank you for taking the time to do this.

Dr. Rachael Tatman: Of course! Thank you for the invitation.

Sanyam Bhutani: You’re currently working as a Data Scientist at Kaggle, you have a background in Linguistics. Could you tell us how did you get interested in NLP and Data Science?

Dr. Rachael Tatman: I definitely got into it from the “science” side. When I started grad school I had very little programming experience, just a couple intro CS courses in undergrad. My main research interests at the time were the effects of different elicitation tasks when collecting voice data (like reading text vs. holding a conversation) on the speech produced. Since these effects were pretty hard to tease out, I took some graduate level statistics courses to learn more about how to model them. This is where I was introduced to R. I kept using R for different research projects, learned a bit of MatLab for signal processing and played around a bit with Python because I was using Python software to run my experiments and had some pretty specific needs. With practice, I became more confident in my ability to write code to solve problems. Since my problems were generally around collecting, transforming and analyzing data, this is probably the point at which you could have started calling me a “data scientist”.

As for getting into NLP, as my research slowly changed over time I started working on problems that were more and more relevant to NLP. One of my projects, for example, looked at how people online use different spellings to show different dialects. Eventually, however, I realized that NLP researchers really don’t read linguistics papers; in order to join in the conversations going on, I started going to NLP conferences. Between the machine learning results that were being presented at conferences and the statistics courses I was still taking, I got up to the point where I could start reading and understanding machine learning papers within a year or two. By the time I graduated, I felt pretty comfortable calling myself an NLP researcher.

All in all, it was not a very efficient way to go about it. A lot of what I did in my degree was not at all relevant to what I do now. (I took several years of American Sign Language, for example, and wrote several research papers on sign language phonology.) To be fair, though, I had no idea I was going to be a data scientist when I went into grad school. In fact, the career didn’t even exist when I started my Ph.D.!

Sanyam Bhutani: Kaggle is no doubt the home of Data Science.

Could you tell us more about your work at Kaggle as a Data Scientist? What does your day at Kaggle look like?

Dr. Rachael Tatman: It really depends on the day! I might be creating helpful content for other data scientists, working with the different engineering teams on new features or bug fixes or analyzing our own data. Pretty much the only thing that I do the same every day is to make sure I’m listening to our community — reading the forums, seeing what folks are saying on Twitter, keeping an eye on different Slack channels, going to meetups and conferences, that sort of thing. A big part of my job is keeping track of what’s important to Kagglers and making sure the rest of the Kaggle team knows about it.

Sanyam Bhutani: As part of the Team, I understand you’re not allowed to compete in the competitions.

Could you, maybe name a few competitions that you found very interesting and tempting to still compete in(Both in terms of challenge and the winning solutions)?

Dr. Rachael Tatman: I can compete, I just can’t win anything. ;) That, plus being busy with my other work, has meant that I haven’t really done much with competitions. It’s a new year, though, so who knows?

Some of the competitions I’ve been most interested in are the Jigsaw toxic comment Classification competitions and the Quora Insincere Questions Classification (which is ongoing). Abuse/bad actor detection is such a hard problem, even for humans working in their own native language, that it’s been fascinating to see what people are trying. (Although I will admit that personally, I find the annotation part of the task the most interesting, and for Kaggle competitions that’s obviously done for you.)

I was also just tickled pink at the results of this year’s Santa competition. Folks were doing a pretty good job… and then Dr. Bill Cook, a very well known optimization researcher, came in and absolutely changed the game. I’m always delighted to see folks with a depth of domain knowledge succeeding at competitions.

Sanyam Bhutani: Natural Language Processing has arguably lagged behind Computer Vision. What are your thoughts about the current scenario? Is it a good time to get started as an NLP Practitioner?

Dr. Rachael Tatman: It’s a really good time to get started in NLP! I don’t think people should be surprised that NLP is a little “behind” computer vision: human language is extremely complex. If we think about it in terms of the complexity of biological systems that do the same job, even something like a fruit fly, that only has about a quarter of a million neurons in its whole nervous system, can do pretty sophisticated visual processing. In contrast, the only species capable of using pronouns is us and we have sixteen billion neurons in the cerebral cortex alone.

I did spend just under a decade studying language exclusively, so I may be a wee bit biased here. But NLP is such an exciting field in part because linguistics is exciting: there’s so much we don’t yet know about how language works.

Sanyam Bhutani: For the readers and the beginners who are interested in working on Natural Language Processing, what would be your best advice?

Dr. Rachael Tatman: One of the biggest challenges facing NLP beginners right now is that there’s actually too much information out there. It’s easy to get overwhelmed, especially if you start by trying to read research papers. I’d recommend starting by reading a textbook (Speech and Language Processing is a classic, and the latest edition is available for free online) or finding a course you like. This will give you a good idea of where the field is now and, even more important, what you don’t have to build from scratch when you start working on your own projects. There’s a lot of work that’s already been done in the field, and I encourage beginners to start by building off an existing project rather than trying to start from zero.

Once you’ve got a high-level understanding of what we can do in NLP, I’d try and come up with a project you’re genuinely excited about to get started. Kaggle competitions are one option, of course, but because we use language every day, there are also probably a lot of things you could build to make your day-to-day life better. A spell checker for one of your languages that don’t have one, a system that does semantic clustering and suggests emails that you could bundle together, a chatbot to help your elderly relative find out what classes are happening at a local community center… If you’re working on creating something you genuinely want to exist, then you’ll have the motivation to help you push through when you run into bugs or other problems. And you will run into bugs — that’s just a normal part of the process. :)

Sanyam Bhutani: Many job boards (For DL/ML) require the applicants to be post-grads or have research experience.

For the readers who want to take up Machine Learning as a Career path, being a Ph.D. in your domain, do you feel having research experience is a necessity?
What are your thoughts about kaggle as an experience factor?

Dr. Rachael Tatman: My universal advice is to not get a Ph.D. I even wrote a blog post about it a while ago. The blog’s about linguistics specifically, but most of it applies to machine learning as well. I think that having a Ph.D. can be an advantage when you’re looking for data science jobs, but unless you really want to 1) do research or 2) be a professor there’s no really no benefit to getting a Ph.D. that you can can’t get more quickly doing something else.

I think that Kaggle, or other practical experience, will get you to the point where you can apply for jobs much more quickly. I probably wouldn’t recommend only doing Kaggle competitions, though. You’ll learn a lot about algorithms that way, but you won’t get as much practice with things like cleaning data or designing metrics. That’s part of the reason I suggest that people work on their own projects as well. That shows off your ability to come up with interesting questions, source and annotate data, clean your data and think about what users want.

Sanyam Bhutani: I’m also a big fan of your live streams and kernels.
Could you share a few tips on writing good kernels and becoming a better technical speaker?

Dr. Rachael Tatman: Hmm, what makes a kernel “good” is subjective, but the ones that really stick out for me are the ones that make me go “oh my gosh, I wish I’d thought of that!”. I really like to see people come up with new approaches for interesting problems, like this kernel that uses topic modeling, an NLP technique, to cluster LEGO sets based on their color.

As for technical speaking, the best two pieces of advice I can give you are, first, to practice as much as possible. Ask if you can give talks at local events or to relevant clubs. The more talks you give the less nerve-wracking they are and the more you learn what is effective for you. Practice is doubly important when you’re prepping a talk. I usually try to run through the talk at least twice a day in the week leading up to it, making little adjustments when I come across awkward places. Of course, I don’t do that with live streams. I pretty much treat livestreams like technical interviews; it doesn’t matter if I make mistakes so long as I’m telling you what I’m thinking so you can follow my thought processes.

My second piece of advice is to be as specific as possible. One of my personal pet peeves are talks that are about how “data science is revolutionizing something” but that is super vague. I want information I can actually apply! If you built a model that does X, talk about why X is important, how you built the model, what makes your model different from other models and how it performed in various situations. Tell me about what specifically you did that didn’t work so I know not to try it. Think about what you wanted to know about whatever you’re talking about a year ago and then tell me those things.

Sanyam Bhutani: Being a follower of your kernels, I know that you’re an expert both at R and Python, you’ve also been an instructor of R during your Masters.
For the endless question asked by beginners — could you give us an opinion on the question- “Whether I should start by practicing R or Python? Why?”

Dr. Rachael Tatman: It depends. I would say R if what you want to do is data analysis like you’re looking for something to use instead of Excel. R is built for that and it’s extremely high level so it’s very quick to get started; to plot a data frame in R all you need to do is call `plot(datafame)` and it will automatically generate a reasonable plot based on what’s in the data. You don’t even need to read in a library! To get to that point in Python you need to do a whoooole lot more work.

On the other hand, if you’re a software engineer or have experience with other software languages I’d probably suggest you start with Python. Python was developed as a teaching language for software engineers and is also much younger than R, so it’s a lot less idiosyncratic. Just as a for instance, there’s no native hashed data structure in R. It also doesn’t have pointers or references. If you’ve come to expect languages to have those things, then R can be a little frustrating.

Sanyam Bhutani: Given the explosive growth rate of ML, How do you stay updated with the recent developments?

Dr. Rachael Tatman: Twitter! I find it especially helpful to follow conference hashtags and live tweeting since conferences are the most prestigious publication venues for machine learning research. I also try to follow people from a variety of fields and backgrounds so I get a pretty diverse sampling of what people are interested in.

I did stop trying to follow arXiv a while ago after that really awful paper about predicting how “criminal” people were from their faces came out. Because there’s no peer review, you really can’t trust the quality of papers on there even though there is sometimes interesting work. I figure if something really amazing gets posted, I’ll eventually find out about it on Twitter.

Sanyam Bhutani: What developments in the field do you find to be the most exciting?

Dr. Rachael Tatman: Ooo, good question. I think the papers that I’m most excited about are the ones that offer theory-based explanations for why certain model architectures work better for certain problems. Empirical results, like “we tried x and it worked better than y”, are great, but I want to know more about why x and y are performing differently.

Sanyam Bhutani: What are your thoughts about Machine Learning as a field, do think its Overhyped?

Dr. Rachael Tatman: Yep. The biggest thing that worries me about the hype is that I think it leads to folks not having a realistic understanding of how and when machine learning systems fail and what their limitations are. Like someone who is just learning about AI might read about a system that can identify cars and assume, based on their own experience of learning to recognize cars, that the system has an understanding of the qualities that make something a car, like having four wheels, an engine, and a steering wheel. This might lead them to assume that the system might recognize a car even if it were, say, upside down, or made before 1930. But unless images like that were included in its training data, it probably won’t. My worry is that this lack of understanding will lead to people over-relying on ML systems with systemic flaws because they heard that it was very accurate. (I believe Linda Skitka calls this “automation bias”. She’s done a bunch of research showing that, when an automatic system is available, people tend to rely on it even in situations where they shouldn’t.)

Sanyam Bhutani: Before we conclude, any tips for the beginners who aspire to become Data Scientists and Kagglers but feel completely overwhelmed to even start competing?

Dr. Rachael Tatman: Celebrate failure! If you fail at things it’s because you’re pushing yourself and growing, and that’s a wonderful thing. If you try things and they don’t work, then you’re just getting closer to finding out what will work, whether that’s picking a better model architecture or just figuring out how to get this error message to stop showing up.

I also think we all, including me, compare ourselves to this shadowy “machine learning expert” who knows everything and also gets stuff right, but in reality, everyone only knows the tiniest little slice of everything there is to know. Don’t be afraid to ask questions and look things up if you don’t know them. (I search for things all the time while I’m coding!) But also don’t forget that you’ve got a lot of knowledge already. You’re bringing all your life experiences with you to learning and you never know what will end up leading to the next big breakthrough.

Sanyam Bhutani: Thank you so much for doing this interview

You can find me on twitter @bhutanisanyam1

Subscribe to my Newsletter for updates on my new posts and interviews with My Machine Learning heroes and Chai Time Data Science

Interview with Data Scientist at kaggle: Dr. Rachael Tatman

Part 17 of The series where I interview my heroes.

May we suggest a tag?

May we suggest an author?

Sanyam Bhutani

Interview with Data Scientist at kaggle: Dr. Rachael Tatman

Part 17 of The series where I interview my heroes.

Interview with the NVIDIA ACM RecSys 2021 Winning Team

Interview with Jeremy Howard

Introduction to Image Augmentations using the fastai library

Chai Time Data Science Newsletter by Sanyam

Subscribe to see what we're thinking

Great!

May we suggest a tag?

May we suggest an author?

Sanyam Bhutani