I think the first reason that visualization is so important to researchers has to do with just how big, big data is. For example: one would find it difficult to express data quantitatively with just a number standing by itself. Regardless of how big that number is; without anything to compare it to – it would be hard for one to wrap their brain around it. So, if someone were to provide a number to represent the amount of stars in the sky, that number would first off, quickly be forgotten, but more importantly, it would not have been fully understood in the first place. Now, if you were to show a graph comparing the number of stars in the sky to grains of sand on earth to the number of people on earth – this would provide context and that is one thing that visualization can do quickly and in a way that can be easily remembered.
The human brain works best as a pattern recognizer/trend spotter. This is a tendency that has been honed over the years, since our days as hunter gathers. Despite being a species that rarely hunts or gathers anymore, the tendency persists in our cognitive functions and is the main reason why we respond so well to visualization – it is just easier to spot the patterns when given information in visual form as compared to hearing or reading it in text.
Properly visualizing data (thus, properly understanding it) is imperative to the process of collecting and refining data. Without the use of visual aids such as a statistical distribution graph, a data scientist would find it difficult to properly understand the data that they are working with, in a timely manner. But, with these tools, one could for example quickly find the statistical properties of the data set that they are working with.
Additionally, visualization can allow people who many not be savvy to a particular process, still be able to obtain a reasonable understanding of it in a relatively quick amount of time. For example: one may not know the ins and outs of a financial report, but if they see a bar labeled expenses that is taller than the bar labeled income – even a novice will be able to understand that the business is in the black.
Now, there is no doubt that visualization is the best way to convey information in the digestible form, but that is not to say that the visualization process is in itself easy. In order to maximize the benefits; a whole array of professionals have contributed to the art of visualization. This has been done over the years with the help of psychologists, statisticians and computer scientists. One important factor is that of aesthetics. The colors, shapes and placement of objects must be done in such a way that it is easy to read and draws the viewer to the areas where the information is associating with metrics or other information. In order to be able to decode the information quickly.
In addition to aesthetics, the type of graph that is chosen for particular data is also very important. A data scientist could not practically provide time series information as well with a pie chart, comparedto an index chart. Some types of information can be aggregated together on a chart, while other information is best interpreted if it stands alone, like a histogram.
And that is why visualization to so important to people that works in big data. Because they are not only using it to their jobs better, but also to better articulate their findings to the main audience of that data. Data scientists need a way to bridge the gap of knowledge when it comes to big data; whether internally or externally. And visualization tools are the best tools for that job.
ggplot in R
For the purpose of this project; I will be using the diamond data from the ggplot library. I’ve decided to just use the order factor variable of cut (Good, Very Good, Premium, Fair and Ideal) for my graphs.

I then realized that I really don’t have a strong knowledge of what exactly a factor is within R. So, I took advantage of the help function in R to read up on it.


Ok, so now that I am a little wiser on the data I am working with – I am going to attempt to plot the data that I have selected. I have selected the head function to return the top six objects (diamonds), from the data set “diamonds”, by the aesthetics of their cuts.

Now, I have to decide what type of plot to use. For this, I am going to reference the suggest website of http://docs.ggplot2.org. I initially tried using “geom_area” to get an area plot (see below), but R gave be a blank page in return.


So, now I am going to try to do a histogram. I’m not sure if I have to re-load the “head (diamonds)”, but I did it anyway and got the return I was looking for.


Alright, it looks like I am on the right track here as I see that all five of the variables are represented on the histogram; and they are ordered as I expected them to be.
Despite the area plot not working, I want to try at least one other type before I move on to trying out different colors. I first tried a bar plot, but that looked just like the histogram. So, I got the density graph (whatever that is) to work also. And again, everything appears to be in order.


Alright, I’m going back to the histogram now. But, I am going to add the variable of clarity to cut, along with a color fill to (hopefully) present a histogram that shows both cut and clarity overlapped on the graph.


I’m pretty happy with the above graph. It shows the diamonds grouped by clarity along the x axis and has the cuts for each of those clarities represented by the colors with a legend on the right hand side.
 
No comments:
Post a Comment