
Several people have asked for my slides from Data Day Texas last week. Here they are!
Impressions
This was my first time at the conference. I found it a fun and high-energy group of real, in-the-trenches data people. In other words, I felt very much at home.
Data Day has an unusually strong contingent of graph database afficiandaos. In years past, graphs DBs have sometimes been a separate conference and sometimes just a track within the main conference. Attendees seemed split on this aspect of the conference: some were really excited about graph dbs; others said “yeah, I don’t really go to those talks.” I didn’t hear a lot of people in the middle.
Because the Data Day is only one day and I had my talk in the morning plus office hours in the afternoon, I missed a pretty big slice of the rest of the presentations. Regrets and apologies to all the awesome people who I can’t say anything specific about.
For example, I missed it myself, but I was told that Heidi Waterhouse’s talk on the Death of Data was amazing. I also heard that the Cassandra keynote on the Next Five Years in Databases did a good job bringing a bunch of different trends together.
The human in the loop track
I caught the most talks from the human-in-the-loop ML track. There’s a ton of really interesting thinking going into workflows where humans and machines learn together. We’re rapidly moving past “humans provide labels; machines optimize fit” to a whole bunch of fascinating and specific questions about how.
- Which labels?
- In which order?
- How much trust should be assigned to labels?
- At what cost?
- Exactly what is a label? Can they be linked/nested?
- When can computers suggest, hint, etc. labels?
- How can we bring external datasources to bear?
- How do labels and models travel across domains?
- etc. etc.
Since each of these questions also implies a set of UX and infrastructure decisions, there turns out to be a lot of space for innovation. I left thinking that it’ll be interesting to see how general the solutions turn out to be: will we settle into a handful of widely scaled modes for training HitL ML models? Or will the right solutions be different for every organization?
IMO, the jury’s out, and it’s going to have a big impact on what data science looks like 5 years from now.