Software Engineering for Machine Learning
A Microsoft case study highlighting a nine-stage workflow for ML system development. The paper identifies key differences between software engineering for ML systems and traditional software engineering, emphasizing challenges in data management, versioning, and ongoing model monitoring.
https://www.microsoft.com/en-us/research/uploads/prod/2019/07/se4ml_icse19_final.pdf
DuckDB: An Embeddable Analytical Database
DuckDB is an in-process SQL OLAP database management system optimized for analytical workloads. The paper presents its architecture and benchmarks, showing performance comparable to or better than other analytical databases, with a focus on ease of use and integration.
https://www.cwi.nl/~hansjeffrey/DuckDB2019.pdf
Kafka: a Distributed Messaging System for Log Processing
Describes the design and implementation of Apache Kafka, a high-throughput distributed messaging system developed at LinkedIn. Kafka is optimized for log processing and stream data pipelines, providing durability, scalability, and fault tolerance for large-scale data integration.
https://dl.acm.org/doi/10.1145/2505515.2505666
Value of Data (Harvard Business School Working Paper)
Introduces a four-part framework (quality, scaling, scope, uniqueness) for assessing the value of data and its competitive advantage. The paper uses real-world examples like Netflix and Waymo to illustrate when and how data creates sustainable business value.Link: https://www.hbs.edu/ris/Publication Files/22-002submitted_835f63fd-d137-494d-bf37-6ba5695c5bd3.pdf
Billion-scale Similarity Search with GPUs
This paper presents efficient algorithms and a GPU-optimized library (FAISS) for similarity search in very large datasets. It enables real-time nearest-neighbor search on billion-scale collections, crucial for tasks like large-scale image or document retrieval.
https://arxiv.org/abs/1702.08734
Apache Spark: A Unified Engine for Big Data Processing