Oslo Summer School in Comparative Social Science Studies 2016
Data Science for the Social Sciences
Pål Sundsøy, Senior Data Scientiest at Telenor Research, Norway
Bjørn-Atle Reme, Research Economist at Telenor Research, Norway
Main disciplines: Economics,
Political Science, Big Data Analytics
Dates: 1 - 5 August 2016
Course Credits: 10 pts (ECTS)
Limitation: 20 participants
NOTICE! This PhD course was cancelled!
The availability of large datasets («often referred to as Big Data») has opened the possibility to improve our understanding of society and human behavior.
Studies has shown that Big Data has the potential to improve health policies, understand large-scale social networks, improve the efficiency of poverty prediction, run large scale experiments and improve the understanding of urban development by for instance looking people’s mobility patterns. Data-driven methods have also outperformed traditional marketing approaches.
Big Data, data mining, predictive analytics and machine learning are buzzwords that are “everywhere” – now also in social science. In this course we will give an introduction to these topics, show interesting studies, which utilize such methods and discuss how it may influence social science research in the future. The course requires some knowledge of statistical analysis in order to get maximum output. Furthermore, some familiarity with statistical software and simple coding is an advantage, but not necessary.
One of the key skills in Data Science is to ask the right questions, manipulate data sets, make models and create visualizations to communicate results. This course covers the concepts and tools you’ll need throughout the entire data science pipeline, from asking the right kinds of questions, getting data and cleaning data to making models and visualize data. The primary focus of the course will be on applied analysis, and we will use the learned skills on real-world data.
This course will address 4 main issues:
- What is ”Big Data Analytics”, “machine learning” and ”Data science”?
- Whats in it for for Social Sciences?
- What tools and methods are commonly used in Data Science?
- How do we practically train, test and evaluate models in Data Science
Specific requirements for admission to the course
We recommend installing R, or specifically R Studio on beforehand. Python could perfectly be used as well, but we rather choose to focus on one scripting language when going through examples.
- Liran Einav, Jonathan Levin, The Data Revolution and Economic Analysis, Chapter in NBER book
- Varian Hal.R, Big Data: New Tricks for Econometrics, Journal of Economic Perspectives, vol. 28, 2, 2014
- David Lazer et al., Computational Social Science, Science, 6 February 2009: 721-723
- Chapter 1 and 2 in “An Introduction to Statistical Learning with Applications in R” by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. (The specially interested reader is encouraged to also read chapter 3,4,5 and 6 - but these chapters are not essential for this course).
Session one: What is Data Science, and is it relevant for Social Sciences?
In this course you will get an introduction to the concepts of “Data Science” and “Big Data Analytics”, historically how it has been applied, as well as the future trends. We will go through some prominent papers which apply Data science methods on various applications.
- Bengtsson, L, et al. “Improved Response to Disasters and Outbreaks by Tracking Population Movements with Mobile Phone Network Data: A Post-Earthquake Geospatial Study in Haiti“, PLOS Medicine, 2011.
- Blumenstock, J. et al. “Predicting poverty and wealth from mobile phone metadata”, Science, 2015.
- Palchykov, V. et al. “Sex differences in intimate relationships”, Scientific Reports, 2012.
- Wesolowski et al. “Impact of human mobility on the emergence of dengue epidemics in Pakistan”, PNAS, 2015.
- Eagle, N. et al. “Network Diversity and Economic Development” Science, 2010.
- Aral, S. “Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks”, PNAS, 2009.
- Kramer, A. “Experimental evidence of massive-scale emotional contagion through social networks”, PNAS, 2014.
- Gonzalez, M. “Understanding individual human mobility patterns”, Nature, 2008.
- Montjoye Y.A et al, “Predicting personality using novel mobile-phone metrics”, LNCS 2013
- P.Sundsøy et al “How machine learning outperform marketers gut-feeling” LNCS 2014
Session Two: The Data Scientist’s toolbox - software and hardware
In this course you will get an introduction to the main tools and ideas in the data scientist's toolbox. The course gives an overview of the data, questions, and tools (commercial + open-source) that data analysts and data scientists work with.
Session Three: R and SQL: an introduction
In this course you will learn how to program in R and how to use R for effective data analysis. The course covers practical issues in statistical computing which includes programming in R, reading data into R, accessing R packages, writing R functions and so on. We will work by looking at examples. You will also get familiar with using SQL for data preparation in R.
- R Resources: Dowload R, R-project site, Rstudio
- Ref Cards: CRAN
- SQL in R: SQLdf package
Session Four: Getting, exploring and cleaning data
This course will cover the basics needed for exploring, cleaning, and preprocessing data. A wise man said that garbage in - garbage out. This step is often considered to be 80% of the work of a Data Mining process.
- Sections “Visualizations”, “Pre-Processing” and “Data Splitting”
Session Five: Data Mining: cross-validation, overfitting, regularization, variance/bias
One of the most common tasks performed by data scientists and data analysts are prediction and machine learning. This course will cover the basic components of building and applying prediction functions with an emphasis on practical applications. The course will provide basic grounding in concepts such as training and tests sets, overfitting, cross-validation, regularization and variance/bias.
Session Six: Overview of Algorithms
The course will introduce a range of models based and algorithmic machine learning methods (regression + classification methods) including standard regression, classification trees (boosting/bagging), random forests, support vector machine, neural networks and Deep learning.
Session Seven: Practical Machine Learning 1: Regression
This course introduces machine learning in R, with focus on regularized regression models.
- Glmnet Vignette by Trevor Hastie and Junyang Qian
- A short introduction to the caret Package
Session Eight: Practical Machine Learning 2: Classification
This course introduces machine learning in R, with focus on classification models.
Session Nine: Model evaluation
What is a good model? There are several ways to evaluate the performance of a model. Which method is “best” often depends on the application. In this lecture we discuss model evaluation in-depth. Moreover, we will focus on how and why data scientists typically have other criterias for model performance than the ones traditionally used among social scientists.
Session Ten: Data Visualization and Communication
In this course you will learn how to communicate findings visually in ways which helps the receiver grasp the essence of your finding. Good visualizations are also a key to “getting a feel for the data” in a deeper way that just looking at a matrix of numbers. Such insight can be a valuable part of creating good hypotheses.
Pål Sundsøy is working as a Senior Data Scientist and Researcher in the Big Data Analytics group at Telenor Group Research. His work is aimed at Data Mining, visualization and research on large-scale behavioral datasets. His academic background is within Physics and Mathematics.
Bjørn-Atle Reme is working as a Economist and Researcher at Telenor Group Research. He has his PhD in economics from The Norwegian School of Economics (NHH), specializing in applied game theory (industrial organization) and behavioral economics. His research ranges from theoretical modeling to economic experiments, both in lab and field.