Explore a Career in Informatics - Interview with Tiffany Chen,
Director of Informatics @Cytobank
In this edition of Learn Educate Discover, we introduce the field of Informatics — a high growth field focused on data storage and analysis (think huge, massive amounts of data).
To help us understand this area, we met up with Tiffany Chen — Director of Informatics at a very interesting startup, Cytobank (http://www.cytobank.org/ )
Here are our notes from the discussion with Tiffany. Check out the podcast below if you prefer listening (like we do!)
Informatics is the science of information system Systems which help in interpreting and processing of information. Tiffany Chen is the director of informatics at a California based startup Cytobank (www.cytobank.org). They build products which help researchers trying to understand biology at the single cell level. The product will help the researcher conduct their research and analysis. Tiffany has a PhD in BioMedical Informatics from Stanford University. She has been working as the director of informatics at Cytobank for 3 years.
How old is Cytobank?
Cytobank started originally from projects started by its 3 founders. It was incubated at Stanford. It started getting its first real customers in 2010 -2011. I joined around 2.5 years ago. It has been around in some shape and form for around 5-6 years or so, before which it existed as a Stanford only model. In fact, one of Cytobank’s earliest customer was me!
Could you tell us a little bit about what Cytobank does?
Cytobank is a platform for single celled data analysis on the cloud. So the people using Cytobank are generally biologists, clinicians who are interested in investigating the biology of both the immune system, cancer and other different cell types. During my PhD, I used Cytobank to communicate with other researchers, get access to their data and analysis etc. So I found it an extremely useful and different product at that time.
Could you tell us a little about yourself and your journey?
My undergraduate was at Duke University in Computer Science and Biology. I really love Computer Science. Nothing really overlapped between these 2 until I took a few classes. One was a lab course where we were physically making microarrays, like printing them out. In another class we were doing gene expression analysis.
What are microarrays?
This is a technology which helps in analysis of gene expressions. Central to biology are DNA, RNA and Genes. RNA is the expression of these Genes. Microarrays were developed in the early 2000s and is a chip which can be used to measure the RNA across a ton of genes. It is a fingerprint to understand a number of different things like disease, to inheritance and so on. It lets you see what your cells are actually doing.
What got you interested in the field?
I was always interested in Biology but in college I also took Computer Science. Informatics is a mix of statistics and CS. In some classes we printed microarrays and in some we did data analysis. So I thought at that time to try out research. I did mathematical modelling of Folic Acid in the body. I also did Caterpiller research- literally feeding, growing Caterpillers and gathering data about their weight. All this made me realize that my computation skills actually had some use in the biological domain.
Could you describe for us what Informatics means?
I think of it as the merger of how to store, process data and also how to analyze it. It’s a fairly broad field due to which people in this field come from such different backgrounds- you have people from computer sciences and computer engineering to people from hardcore statistics. This also makes it a little hard to hire people as an Informatician. In the last 3- 5 years they’ve started segmenting these roles more. Nowadays you hear more about data scientist or data engineer and so on. These are all segments of informatics. It’s constantly evolving. The term data scientist is also used to describe statisticians who know how to code. But a data engineer is quite different - this role is more focused on building the infrastructure that can handle the hypothesis being asked. Data scientists are the ones that actually ask/make that hypothesis, test and it then sell it.
If you wanted to explain this evolution, how would you do so?
Historically, information science or informatics is related closely to library science where you basically figure out how to catalogue and store information. In clinical informatics it’s been most about how to structure and store medical data. More recently, it has been about getting results and information from unstructured data as well as making sure that the signal you’re getting from the data isreal and trustworthy, because if you lump enough data together you’re bound to get some kind of signal. It’s actually pretty similar to big data, basically the interface of big data.
When and why do people use words like Informatics, data scientists, big data?
One way to describe it is that previously you could fit an entire data set on your laptop and analyze it directly. Over time, people started doing this analysis parallely, in parallel compute clusters specifically built for analyzing data. Now, if you say that you have so much data that it is impossible to compute it directly because it’s so slow, you need to have new technologies like Spark and Hadoop which help you analyze this data faster. Statisticians nowadays can actually now do real statistics because the data is larger. I don’t hear the word informatics as much as I used to, but rather hear data scientist a lot more. It’s important to remember that data scientists doing work in a corporate setting versus academic setting might be using similar tools but they may have different resources and solving different problems. But the overall nature of the job will be similar.
What do you think are the applications of this field?
It’s grown a lot in the last 5 - 10 years. From a computational biology standpoint it can vary from impact on cancer research to automating standard quality analysis for different data types. Also, my father is a professor in finance, and he’s asked me to run models for him in R because he didn’t know how to do it himself. There are a number of different softwares people use like Stata and so on, and if you have a more programming background you use R or Python and so on. From Biology to Finance to even Social Network analysis - every single social network today now has some of the largest groups of informatics folks there. It’s very good for their business because if there is a company whose product is the social network, they need to be able to know what people are clicking on. Basically any field which deals with massive amounts of data would have someone from this background.
Going back to the three tenets of storage, processing and analysis, can you talk a little more about processing and analysis?
Let’s take the field of Genome sequencing. The data generated per patient is quite large. Once that data is stored, there are a number of pre-processing steps which need to be completed to get to the point where you can even get to the point where the data can be analyzed. Files might be 20X times the size which it needs to be before you can even start analyzing it. For the fields of genomics, there’s a lot data cleanup which is required. So this is generally the intermediate step which is involved before you can start the analysis and apply tools like machine learning and so on.
To listen to the full discussion, check out the podcast below:
Correction: Tiffany later corrected the 70% data growth statistic, it’s 30% annually. Here’s the infographic:
direct image link: https://projects.ac/blog/wp-content/uploads/2014/01/Love-your-data_Projects_s.jpg
Tiffany was generous enough to share a number of resources for interested folks to check out! Here you go:
Where can people practice coding or solving problems related to Informatics online/learn?
https://www.coursera.org/learn/machine-learning (light version of the Stanford course)
stanford’s CS “how to practice for a technical interview” site — http://web.stanford.edu/class/cs9/
Udacity and Coursera
ImageNet Challenge (http://image-net.org/challenges/LSVRC/)
Nolan Lab and Batzoglou Labs at Stanford:
Difference between data scientist/engineer — http://blog.udacity.com/2014/12/data-analyst-vs-data-scientist-vs-data-engineer.html
Inforgraphic on this as well:https://i1.wp.com/www.analyticsvidhya.com/wp-content/uploads/2015/10/infographic.jpg
Biomedical Data Science Department at Stanford Universityhttps://med.stanford.edu/dbds.html
Companies that make solutions for Big Data besides the usual “Facebook, Google, etc” — http://bigdataanalyticsnews.com/10-hadoop-hardware-leaders/
This infographic is a little outdated but could be useful:
Tiffany also gave an interview recently which covers some of her work along with more of her research:
Thank you for listening and we hope you found this episode useful. If you have any questions for Tiffany or for Team Learn Educate Discover, email us at email@example.com. We will reply!! :)