Data Science is the science of getting the most out of the data within an organisation.
It uses elements of Statistical Sciences (especially Data Mining), Machine Learning, and Database Science as well as Web technologies and Business Development. In a sense it pulls the whole lot together to get knowledge out of data. It also includes the over-hyped term "Big Data" which we take to mean "all sorts of data and maybe lots of it."
We cover most of those topics elsewhere on the web site but here we cover mainly the database science aspects. Data Science does seem to be more connected with databases and Computer Science than with Statistics.
Perhaps the best way to define "all sorts of data" is to define the term "NoSQL". Literally it means "not only SQL" which in practice means "not just numbers in a table" but can include (especially) text, documents, images, graphs (network links), metadata,
More examples on this topic below when we talk about specific databases.
Our interest in Cloud Computing lies with using cloud Machine Learning services such as BigML.com. Also because of the CPU requirements we also use cloud CPU services such as Amazon Web Services (AWS). An example is that of running R on AWS.
The Revolution Analytics Enterprise version of R (see below) has extra features for "bigger" data sets. This includes the ability to analyse data sets up to 1, 2 or 16 terabytes (according to the specs - we haven't tested that out).
We also can deply MATLAB applications to computing clusters.
MapReduce is a specific example of distributed computing. It is a specific way of splitting up large computing jobs and distributing the workload. Hadoop is a useful way of doing this but not necessary - we have used MapReduce with Python.
By "all that" we mean the Haddop ecosystem and other technologies for doing similar things. For example, we like Apache Spark which is an improvement on Mapreduce.
Even though many company databases are not really large enough to require MapReduce technology there are advantages to using the ecosystem because so much (open source) code and resources has been developed for machine learning it can be easier to use the heavier machinery than to start at the lighter end.
MongoDB is a NoSQL database that has a comprehensive interface with R. It is good for storing documents.
Another database that people like is Cassandra which has an R interface (not so comprehensive yet) too. It was originally developed by Facebook. We like it but haven't used it yet.
We use the Revolution Analytics (http://revolutionanalytics.com) Enterprise version of R which has extra features for "bigger" data sets. This can help get around one of the problems with R in that it works best if all the data can fit into the computer RAM (or even better into half the RAM).
Compared to the CRAN version of R the Revo version uses the Intel Math Kernel Libraries and can run a bit faster - sometimes significantly so when multiple cores come into play.