As part of the capacity development pillar of the Big Data for Development project, AIMS-NEI designed the Big Data for Development Short Course Program (BD4D-SCP) and training sessions are being delivered across the AIMS-NEI network, in Rwanda, Senegal, Cameroon and now in Ghana.
The course targets people (based in Accra, Ghana) with passion in Data Science in general and in particular in Big Data Analytics, having at least a 4 years’ undergraduate degree or a minimum 2 to 3 years of work experience as a professional in Statistics, Information and Technology or any other Data Science related discipline.
A number of other short courses are in the pipeline to achieve our BD4D objectives to increase the number of data scientists in Africa and provide a platform for all practitioners to interact.
As the world population and things become more and more connected, datasets are becoming increasingly large, such that traditional data processing software and techniques cannot deal with these large-scale datasets. Thus, you need specialized frameworks and tools such as Apache Spark to deal with large datasets. This course teaches the essential basics of processing large scale datasets using Python. In addition, the course also teaches you how to perform common data science tasks such as data wrangling and building machine learning models in Python. This course takes a practical approach to equip participants with the most essential tools in the shortest possible time. The course emphasizes learning by doing, as such, they are a lot of exercises built into the course to give participants ample time to practice.
Summarized Objectives & Outcomes
- Understand intermediate to advanced concepts of the Python language: data structures, functions, classes and the python packages ecosystem
- Perform data science tasks using Python: data ingestion, processing, visualization, web scraping etc.
- Handle large scale dataset (20gb+) using Apache Spark: big data basics, Hadoop ecosystem, cloud computing platforms, big data processing with Apache Spark.
- Be familiar with essential machine learning (ML) theory: the learning problem, types of learning, loss functions, linear models, deep learning and more.
- Be able to build and evaluate machine learning models: use scikit-learn and TensorFlow to build and evaluate models using Python.
- Appreciate real world ML and big data use cases: object detection in android devices, analyze large scale GPS data for human mobility use case.
Day 1: Advanced Concepts in Python: on this first day, the course will focus on Python language to build strong foundation for the rest of the course materials. Participants will be introduced to intermediate to advanced level practical techniques such as writing functions, classes, error handling, packaging python code and more.
Day 2: Python for Data Science: during the second day, the focus is on performing common data science tasks using Python. We will go through how to do data ingestion, processing, analysis, visualization, web scraping and more using Python and along the way introduce essential packages (e.g., pandas, geopandas, numpy, matplotlib etc.) for doing these tasks.
Day 3: Big Data Processing: on the third day, the course focuses on how to handle large data sets as using Python. The following topics will be covered: introduction to big data, multiprocessing in Python, Apache Spark, how to use common cloud platforms and more.
Day 4: Machine Learning (ML) in Python: on this day, the course will first provide an introductory lecture on machine learning. The rest of the day will focus on how to perform various ML tasks (e.g., data preparation, model building, evaluation and interpretation) using the scikit-learn package in Python.
Day 5: Putting it All Together: during the final day, we will focus on using the skills gained in this course to solve real life data science problems by looking at case studies. Potential case studies to be covered include: how to process nigh lights satellite images (geospatial), how to process massive call records from cellphones (mobile data) and how build ML models to impute missing sensor data (sensor data).
• Programming: ability to write a simple program in Python (basic Python level)
• Math and Statistics: a background in statistics, data science, or any quantitative sciences.
For an optimal student experience, we recommend the following hardware configuration:
1. OS: Windows 7 SP1 64-bit, Windows 8.1 64-bit or Windows 10 64-bit, Ubuntu Linux, or the latest version of OS X
2. Processor: Intel Core i5 or equivalent
3. Memory: 8 GB RAM preferred
4. Storage: at least 100 GB available space
5. Computer should preferably have access to internet
You’ll also need the following software installed in advance:
1. Browser: Google Chrome/Mozilla Firefox Latest Version
2. Text editors: Atom/Sublime Text as IDE (Optional, as you can practice everything using Jupyter notebook on your browser)
3. Anaconda: can be installed from here- with Python 3
Python; Big Data Analytics; Apache Spark; Machine Learning; Data Science.