Big Data and Me

Summary of Characteristics and Attributes of machine leaning

I have to disagree with Mr. Kobielus’s analogy of log data to dark matter. Despite the 3 V’s of log data in the world today being beyond the scope of human comprehension – with machine learning, log data becomes palatable, whereas, dark matter is incomprehensible despite our best technological efforts to quantify it. This is not to say that on face value, both are not equally mysterious. Much like the mysteries of the cosmos, predictive data lost in a sea of data seems illusory to the human eye.
Despite the short coming of the human ability; it has provided the tools with which we can corral this seemingly infinite data. By way of algorithms, computers can not only detect important data sets missed by human eye, but it can also learn to become more proficient and finding those sets. Perhaps the best way to go about this process is unsupervised learning. This is the process by which a computer will cluster, compress and summarize data, so that a human mind can begin to comprehend it.

Despite the cold and unhuman sound of the name “machine learning”; Hanna Wallach outlines how it can provide some very human results. She also, attempts to clarify what exactly big data is. For example: how does big data as we know it differ from large amounts of data that would come from the field of partial physics? Hanna sources a few quotes for the definition of big data – the best of which is – the amassing of huge amounts of statistical information on social and economic trends and human behavior. In other words, as she states “unlike data sets arising in physics, the data sets that typically fall under the big data umbrella are about people…”. I think this is the crux of big data and many people may miss this.

One way that machine learning can in some ways be more human than human; is in now it can find smaller subsets within large amounts of data. For example: minority members of populations might not be well represented in a set, but people are still interested in learning what their statistics have to tell us. Machine learning can drill down into the data to discover and analyze these small sets for that purpose. Whereas, a human eye or even recent technologies would find it difficult to sift through to find the data that represents minorities. Hanna describes this process as granulizing the data – meaning that it is looked upon from a micro, as well as, a macro level.
The overarching theme of Hanna message is that in order to provide real improvement to our society via data mining; social sciences will need to be combined with computer science. And prioritizing social questions over data availability. She suggests a question driven approach, instead of working inductive or deductively from the data itself. She feels that this is the best way to avoid missing information of small granular size.
Perhaps the best models for obtaining the goal of social improvement through data mining are exploratory and explanatory ones. It is through these models, that the typical model of prediction can be achieved; when dealing with highly granulated data such as that of minorities in a large population. Unfortunately, human biases will play into these as well.
Another important aspect of Hanna’s message is that of the uncertainty that comes with the minorities of a data set. She feels that uncertainty should be expressed in order to properly account for those smaller data sets within a large one. Luckily many machine learning methods do just that.

Big Data and Me

Thursday, March 12, 2015

No comments:

Post a Comment