INFO 4871/5871
Web Data Science
INFO 4871/5871 “Web Data Science” is a semester-length and cross-listed undergraduate elective and graduate course. The internet makes many kinds of information easy to access. The ability to retrieve, parse, and analyze this information is a valuable skill for data scientists. This course will provide an overview of computational tools and practices for transforming web documents and APIs into data for common research designs.
Learning objectives
- Understand the legal and ethical contours of web data access
- Navigate and parse common web data formats like XML and JSON for data
- Retrieve and automate data extraction from HTML and PDF documents
- Access popular APIs to collect data for common research designs
- Understand the methods and research designs for using web tools to audit algorithmic behavior
Outline
Module | Week | Skills |
---|---|---|
Fundamentals | 1 | Introductions |
2 | XML & JSON | |
3 | Protocols | |
Structure | 4 | Static web pages |
5 | Archived web pages | |
6 | Dynamic web pages | |
7 | PDFs | |
Dynamics | 8 | APIs |
9 | Wikipedia | |
10 | Census | |
11 | Homophily and selection | |
12 | Automation | |
Applications | 12 | |
13 | Fall Break | |
14 | Final Projects | |
15 | Final Projects | |
16 | Final Projects |