Papers (not mine but ones you should read)

Software Engineering for Machine Learning

A Microsoft case study highlighting a nine-stage workflow for ML system development. The paper identifies key differences between software engineering for ML systems and traditional software engineering, emphasizing challenges in data management, versioning, and ongoing model monitoring.

https://www.microsoft.com/en-us/research/uploads/prod/2019/07/se4ml_icse19_final.pdf

DuckDB: An Embeddable Analytical Database

DuckDB is an in-process SQL OLAP database management system optimized for analytical workloads. The paper presents its architecture and benchmarks, showing performance comparable to or better than other analytical databases, with a focus on ease of use and integration.

https://www.cwi.nl/~hansjeffrey/DuckDB2019.pdf

Kafka: a Distributed Messaging System for Log Processing

Describes the design and implementation of Apache Kafka, a high-throughput distributed messaging system developed at LinkedIn. Kafka is optimized for log processing and stream data pipelines, providing durability, scalability, and fault tolerance for large-scale data integration.

https://dl.acm.org/doi/10.1145/2505515.2505666

Value of Data (Harvard Business School Working Paper)

Introduces a four-part framework (quality, scaling, scope, uniqueness) for assessing the value of data and its competitive advantage. The paper uses real-world examples like Netflix and Waymo to illustrate when and how data creates sustainable business value.Link: https://www.hbs.edu/ris/Publication Files/22-002submitted_835f63fd-d137-494d-bf37-6ba5695c5bd3.pdf

Billion-scale Similarity Search with GPUs

This paper presents efficient algorithms and a GPU-optimized library (FAISS) for similarity search in very large datasets. It enables real-time nearest-neighbor search on billion-scale collections, crucial for tasks like large-scale image or document retrieval.

https://arxiv.org/abs/1702.08734

Apache Spark: A Unified Engine for Big Data Processing