Big Data and Me: February 2015

Monday, February 23, 2015

Summary

Everything You wanted to Know About Data Mining but were afraid to ask

In much the same way that technology in general has shrunk the world; the outgrowth of that technology known as data mining has shrunk our ability to discern information about people based on their actions. Whereas is the past it would take a one on one personal relationship to know your customer; today, we know plenty about our customers without have to even know their names. It is this monetization of data that is truly driving its progress.
When working with very small amounts of data, anomalies and patterns are relatively apparent. But, when applying those same intuitive processes to big data; one would be looking for a needle in the preverbal haystack. Enter big data – the process of drilling down and threw large amounts of data in order to find a pattern, cluster, relationship or hell anything that has predictive capabilities. This is where the money is at. Once one can make a reasonable prediction, the rest is just directing marking efforts in the same direction.
The methods vary; whether it is association learning, cluster detection or classification; one thing is for sure – with so much value to be had – there is no end in sight for big data analysis.

Big data blues: the dangers of data mining

It is quite possible that data mining will start to see diminishing returns at the same point that customers become uncomfortable with personal information being collected. We certainly aren’t there yet, as the majority of people consent (although many times indirectly) to the use of their data for business purposes. This may however, not always is the case.
The best way to avoid any potential customer blowback is for a company to be honest about what data they are collecting and what they are doing with it. Some companies have gone so far as to the make this information publically available online. The inverse of that would be for a company to be collecting/using data in a way that would offend a customer, while hiding that fact. This could and has had negative effects on business/customer relations.
Another issue that arises is the complexity of the policies that many companies have. For example: does anyone really read those policy disclosures or do we all just skip to the end and accept. Even if one did read the entire policy; would they understand it? Chances are that unless that person has some professional expertise – no. This raises a question of fairness as a concerned consumer may not even have the ability to be an informed consumer.
Despite these shortcomings – data mining is a mutually beneficial practice that streamlines both the consumer and business experience. For these reasons many are confident that there can be an agreement reached between business and consumer that will allow both parties to be comfortable. This will most likely come in the form of standardized codes of conduct for businesses to follow. Standardized codes would be good because instead of consumer having to understand the policies of each and every business that they deal with – they will just need to be familiar with the code of conduct. And as long as the business lives up to these codes – all will be well. This will take action on the part of the consumer in the event of a breach of these codes, because if people can get away with braking them; they will.

Sunday, February 15, 2015

What is Data Mining?/ How can data mining can help any business?

According to Williams “Data mining is the art and science of intelligent data analysis.” And it aims is to discover meaningful insights and knowledge from data. Tough accurate, the previous description hardly sums up the effect that data mining is having on the world around us. These effects certainly appear to be positive on the surface but, is big data minding our privacy while mining our data? Or, can the two realistically even be separated?
Let’s start off with the neigh Sayers of the data mining community. Crawford writes “We are now faced with large-scale experiments on city streets in which people are in a state of forced participation, without any real ability to negotiate the terms and often without the knowledge that their data are being collected.” And he’s right. In the age of data mining nearly every decision you make – whether it is what to eat for lunch or where you buy a shirt – that information along with some information about yourself will be logged and later analyzed. This information will come full circle back to you by way of a direct mailer, popup ad, a solicitation phone call or one of many other formats. The gathering and then drilling through large amounts of information for specific purpose (even if that purpose is not yet known, is called data mining.
Privacy issues aside, there are, from business perspective, many great advantages to mining data. As Mascot puts it “big data has leveraged big ROI. What he means by this, is that the time and money spend on gathering, analyzing and making predictions based on large amounts of data is well worth the price of admission.

One such case is that of the Carolinas HealthCare System who “purchases the data from brokers who cull public records, store loyalty program transactions, and credit card purchases.” “The idea is to use Big Data and predictive models to think about population health and drill down to the individual levels,” Once on individual level, recommendations can be made once comparing ones medical records with their personal spending; thus life style choices. Though, considered highly intrusive by some; one could hardly argue the benefits of ensuring that someone with Diabetes does not purchase too much candy. This will allow the hospital to streamline procedures and ultimately reduce costs.

Outline one data mining technique as discussed by Rijmenam (2014) and Williams (2011) and provide its benefits and negative aspects.

Williams highlighted the idea of Data Mining team. In this frame work there are specialized players working together for the goals of the overall project. such as the data miners, domain experts and data experts. Together they will mine usefulness out data. The downside to this framework is that there typically isn’t any sort of industry expertise amongst the data folks. This issue will be attempted to be remedies through a series of meeting but, could be a fatal issue if not addressed correctly
.
Rijmenam, outlines a classic statistical method known as regression analysis in data mining. Regression analysis tries to define the dependency between variable. This model is highly useful in making models that have predictive capabilities. The down side to regression analysis is that it assumes a one-way causal effect from one variable to the response of another. In other words, this type of analysis can show the one variable is dependent on another but not vice-versa

Thursday, February 12, 2015

The State of Hadoop 2014: Who’s using it and why? By Mike Wheatley

Hadoop is crouched in the attack position; ready to take over the world of big data. Despite being over eight years old now; Hadoop has yet to meet its full potential. Despite the tremendous upside (due to its versatility), Hadoop really hasn’t yet exploded onto the scene of big data, as some expected. There are some conflicting reports of just how widely used it is today but, everyone seems to be in agreement that it is the platform to launch the future of data.
Perhaps, the reason why Hadoop has yet to surpass the likes of Microsoft SQL Server or Oracle has to do with a fear of change from the user base. But, this fear to destine to give way to the versatility of Hadoop. The majority of the early adopters of Hadoop have come from the analytics, advertising and security sectors. The forecasts trend to Hadoop being a major player in all fields that deal with data by 2020. As, the projections have them going from a value of 2 billion to 50.2 billion by then.

Facebook trapped in MySQL ‘fate worse than death’ by Derrick Harris

As a result of the rapid growth of Facebook (along with several of web based companies), a problem has arisen. The problem is that a company that was built on the technology of the day, most likely will be tied to that technology (to some degree) without a major overhaul. Major overhauls are typically not possible to a company whose product is web based and moves in real time. Despite many patch type solutions; when your core program is outdated, problems are inevitable.
Though Facebook use of MySQL will eventually have to the way of the Dodo bird; there are possible solutions out there. One of those solutions is a platform known as NewSQL. NewSQL is a platform designed for the next generation Web 3.0 application.
Of Couse the same problem will occur in a potentially perpetual cycle. And this is what makes the big data game so exciting. There is no end; now throwing your hands up in victory and saying: we’ve won. As technology increases, so will our need to accommodate the data that it procuces.

Monday, February 2, 2015

Summary: What is MySQL? MySQL is a BDMS developed, distributed and supported by Oracle. As with all data bases; it can function as a tool for everything from personal organization to data mining for a large corporation. Unlike using a list to keep track of information – MySQL uses tables (to separate the types of data) and relationships to link them back together in a way that provides maximum utility. This is done by way of Structured Query Language (SQL); which is the language/code that is used to determine specific types of relationships between the data. Perhaps the best thing about MySQL is that it is an open source program. This means that anyone can access and adjust the program for free. Also, the ability to multi-thread different programs, libraries tools and backends increases the function of this program. This functionability is what makes it so widely used today. As a result of the unencumbered programming that results from being open source – there are several sets of features available for MySQL. . Introduction to MySQL: In the broadest terms – a database is a way to store and retrieve data. This is most efficiently done by using a relational system. A relational database is a structured collection of tables, in which there are key fields and cross reference tables that link the tables together. It is important to have each table be for a specific set of information and let the tables join back together in a report or query. MySQL is widely used by everyone from Facebook to corporations to individuals. MySQL runs on command codes, known as prompts (EX: “u” – specifies that you are giving a login and “;” to stop a command). To create a table using the MySQL prompts would be: CREATE TABLE, followed by pert ant information. One thing to keep in mind is that the code is case sensitive. To inquire about data from an existing base – you would use the “SELECT” function, followed by pertinent information (EX: SELECT * FROM students WHERE name = “Joe;” Remember that the “;” is important, as it signifies the end of the command. You can also pull records based on numeric value (EX: SELECT sid, name FROM students WHERE GPA >= 3.5;”. The functions of the SQL language are too many to name but, it is important to note that this language will enable you to create a database that is suited to your needs. . Big Data: Net Tricks for Econometrics: As computing power changes, so does our ability to analyses data. Classical methods of data analyses, such as the multiple linear regression models may not always be sufficient for the amount of data and variable that we are able to capture today. Not only are classical methods becoming antiquated but, also are the relatively new DBMS’s. A solution to this in NoSQL. NoSQL is a system that focuses less on the ability to manipulate the data but, is able to handle large amounts of it. This fits the needs of many modern day companies such as Google and Facebook. With this much data available; the shear amount becomes an issue. There are several programs that work in conjunction with the DBMS to “cleanse” the data; thus making it more palatable to produce reports for predictions, summarizations, estimations and hypothesis testing. In conjunction with the programs; there are several new methods for analyzing data that differ from the classical linear models. Some of these are CART, random forest and LASSO. Considering the NoSQL model has little data manipulation ability and fact that there is so much data to analyst – this fits perfectly with the technology of machine learning; in which extensive testing is done to ensure models that perform well outside of the test data sample. Additionally, all of this works sell with the discipline of economics; as it has so much data and so many variables. The future is bright for bid data. As computer generated representations of our lives continue to grow exponentially, so will the need for a solid understanding of big data. Whether it is business, personal, economical or geopolitical – big data affects our work and will continue to do so.