Using Hive with Google fusion maps

Representing private donations to political campaign in the USA:

I use public data of donations to political campaings to paint a map. Red (republicans), Blue(Democrats), and Grey(none of those two) represent the party that received more money in each state.

To get the donations data:

Data explained:

 http://data.influenceexplorer.com/bulk/?r

To get the data:

http://datacommons.s3.amazonaws.com/subsets/td-20110812/contributions.nimsp.2010.csv.gz

A serde to parse CSV files can be found:

The following Hive create table prepares the data:
add jar csv-serde-1.0.jar;
CREATE TABLE contributions
(cycle                      INT,
transaction_namespace       STRING,
transaction_id              STRING,
transaction_type            STRING,
filing_id                   BIGINT,
is_amendment                BOOLEAN,
amount                      FLOAT,
contribution_date           STRING,
contributor_name            STRING,
contributor_ext_id          STRING,
contributor_type            STRING,
contributor_occupation      STRING,
contributor_employer        STRING,
contributor_gender          STRING,
contributor_address         STRING,
contributor_city            STRING,
contributor_state           STRING,
contributor_zipcode         STRING,
contributor_category        STRING,
organization_name           STRING,
organization_ext_id         STRING,
parent_organization_name    STRING,
parent_organization_ext_id  STRING,
recipient_name              STRING,
recipient_ext_id            STRING,
recipient_party             STRING,
recipient_type              STRING,
recipient_state             STRING,
recipient_state_held        STRING,
recipient_category          STRING,
committee_name              STRING,
committee_ext_id            STRING,
committee_party             STRING,
candidacy_status            BOOLEAN,
district                        STRING,
district_held               STRING,
seat                            STRING,
seat_held                   STRING,
seat_status                  STRING,
seat_result                  STRING)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
stored as textfile;

Now is left:

  • Quering the data in Hive and generate a few CSV files
  • Create the fusion map (google has nice and easy tutorials for this)

 

I will try to write that step by step next time.

 

 

Introduction to Hadoop

In 2004 Google published an academic paper describing MapReduce, a programming model and an associated implementation for processing and generating large data sets”. The model was inspired by the “map” and “reduce” functions from funcitonal programming.

 
The idea behing MapReduce is to be able to split a task in many smaller subtasks and distribute them between multiple proccessing machines (a cluster of commodity hardware). In this model map and reduce represent to different parts of the proccess:
 
What are a map and a reduce?

In funcional programming a map takes a list of values and a function and applies the function to each value in the list. Using pseudocode.

L = [1, 2, 3]
F = function multiply_by_ten(a) {a = a * 10 }

A map is similar to:
foreach (i in L)  i = multiply_by_ten(i)

So a map could be called:
B = map(L, F)

And after executing map we have: B=[10,20,30]

Reduce proccess the elements of the list and returns a single value.

L= [1, 2, 3]
F = function sum(a,b) {c = a + b }

A reduce is similar to:
total = 0
foreach (i in L) total= sum(total,i)

So a reduce call could look like:
B = reduce(0, L, F)
the reduce will return 6

The first paraeter, zero is the initil value passed to the reduce.

What is MapReduce?

MapReduce is inspired on the map and reduce ideas explained before. From Google’s paper:

“The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce.

Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function.

The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user's reduce function via an iterator. This allows us to handle lists of values that are too large to t in memory”

They also included an example, a map and reduce for counting the number of ocurrences of each word in a large collection of documents.  Map and Reduce in pseudo code would look like:

map(String key, String value):

 

// key: document name

// value: document contents

for each word w in value:

EmitIntermediate(w, "1");


reduce(String key, Iterator values):

// key: a word

// values: a list of counts

int result = 0;

for each v in values:

result += ParseInt(v);

Emit(AsString(result));


HDFS and GFS

In 2003 Google published another  paper that influenced Hadoop birth, “The Google File System” (GFS). In this paper they describe “a distributed file system for large distributed data-intensive applications”. Files are divided in chunks of 64 megabytes and distributed accross multiple nodes for redundancy. The equivalent in Hadoop is HDFS.

Hadoop

Hadoop is the better known implementation of MapReduce. It is an open source project created by Doug Cutting, who named it after his son's toy elephant. Cutting was an employee of Yahoo!, where he led the Hadoop project full-time.

Although MapReduce and HDFS were the core of Hadoop initially at the moment Hadoop include a few other subprojects. Some of this subprojects are contributions from companies that while using Hadoop found that it could be a bit better if ... These subprojects sometimes are known as Hadoop ecosystem.

Some of these subprojects:
The project includes these subprojects:
Hadoop Common: A set of components that support the other Hadoop subprojects.
HDFS: A distributed file system that provides high-throughput access to application data.
MapReduce: As we have described, a software framework for distributed processing of large data sets on compute clusters.

Other Hadoop-related projects at Apache include:
Avro: A data serialization system.
Cassandra: A scalable multi-master database with no single points of failure.
Chukwa: A data collection system for managing large distributed systems.
HBase: A scalable, distributed database that supports structured data storage for large tables.
Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout: A Scalable machine learning and data mining library.
Pig: A high-level data-flow language and execution framework for parallel computation.
ZooKeeper: A high-performance coordination service for distributed applications.

 

 

 

 

 

 

 

 

"Introduction to Artificial Intelligence" and free online education.

The course that may have change the education world is over. 

It may be an over statement, after all, MIT's OCW has been there for a while and Khan Academy deserves an incredible amount of credit. 

But AI-CLASS has been a bit different, it allowed the students to feel a bit more commited with homeworks and midterm and final exam. An overwhelming ammount of students registered, more than a hundred thousand, and the courses was reported by the global media ( see the New York Times).

It is also worth mentioning the video that Sal Khan, founder of Khan Academy and  Stanford professors Peter Norvig and Sebastian Thrun (teachers of the Introduction to Artificial Intelligence course). It is worth watching this video in Youtube.

And now that it has been proved it could work a few more free online courses has been made available by Stanford.

Screen_shot_2011-12-22_at_11

Note that the Machine Learning and the Introduction to Databases courses were already taught last term atthe same time that the AI course. 

And now MIT announced MITx.

Forbes has published an articled with an interesting title: "Is Education the Next Industry That Will Be Killed by the Internet?"

What about my personal performance in the AI course? I didn't do very well compared with the rest of the students. I got a little over 95% what places me in the top 25%. The grades have been impressive, it is calculated that around 20 thousand students finished the course, distributed like follow:

low   high   number   percentile
====   ====   ======   ==========
 0.0 - 86.7   10,000   bottom 50%
87.0 - 93.5    5,000   top 50%
93.6 - 97.6    3,000   top 25%
97.7 - 98.8    1,000   top 10%
98.9 - 99.9      800   top 5%
      100.0      200   top 1%

That means that people who got an impressive 93% only got to be in the to 50%.

For readers who may not know how these grades are calculated. They use three scores:

- 30% of the top 6 homeworks. We had to do one homework per week (8) and the top 6 are used for the grades.

- 30% of the midterm exam.

- 40% of the final exam.

 

One more article on online education: 11 Tech factors that Changed education in 2011.

Installing ESXi 5.0 in a HP ProLiant Turion II N40L MicroServer

Or how to get ESXi 5.0 installed using a USB stick.

I use a Mac to get my USB stick ready, and follow the instructions given in this article. But I followed the instructions and got the following error when trying to boot the server from the stick:

“mboot.c32: not a COM32R image”

So, the missing step is to create a file named "ks.cfg" in the stick with the following contents:

vmaccepteula
rootpw password
autopart --firstdisk --overwritevmfs
install usb
network --bootproto=dhcp --device=vmnic0

 

Now it works, you have a USB stick that can install ESXi 5.0 in your microserver.