Practice Questions Set 1
7 min readEdit on GitHub
Practice Questions Set 1
For You To Think
- If your single backend server crashes, your entire application crashes. Why do big companies never rely on one server, and why do beginners still start with one?
- Why can one computer not handle 1 TB of data efficiently, but 10 weak computers working together can? Explain the fundamental concept behind this.
- If a database can store millions of records, why do companies still need Hadoop, Spark, NoSQL, caching, and distributed storage systems?
- What actually happens internally when millions of users open the same website at the same time? Why doesn’t the server explode, and what architecture handles this?
- If microservices are so powerful, why do almost all startups begin with monoliths? What does this tell you about business vs engineering?
- Why does scaling a backend become useless if the database cannot scale? What makes databases the real bottleneck in large systems?
- Why is analyzing data more valuable than collecting it? What does a company lose when it only collects data but never analyzes it?
- Why does MapReduce break huge files into small chunks and send these chunks to different machines? Why not just load the full file on one strong machine?
- Why do companies track every click, scroll, search, and button press on a website? What is the real business reason behind collecting this data?
- Why is distributed architecture not just a technical choice but a necessity for any company serving millions of users? What problem does it solve that cannot be solved by a single powerful machine?
Big Data
- Why can’t traditional databases handle Big Data? Explain the core limitations.
- What does “Volume, Velocity, Variety” actually change for developers and companies?
- If storage is cheap now, why do companies still worry about “big data scaling”?
- What is the difference between “big data” and “large files”?
- Is Big Data always useful? Give scenarios where collecting too much data harms business.
- How do organizations decide which data to store and which to ignore?
- Why is parallel processing important for modern data?
- What happens if big data is collected but never analyzed?
- Can Big Data exist without cloud platforms?
- How does data governance become harder as data size increases?
Hadoop
- Why was Hadoop created when databases and servers already existed?
- Why does Hadoop use HDFS instead of normal file systems like NTFS/ext4?
- Explain why Hadoop prefers “moving computation to data” instead of moving data to computation.
- Why does HDFS use replication (3 copies by default)?
- Why doesn’t Hadoop allow random writes easily?
- What problem does YARN solve that old Hadoop (MRv1) failed at?
- Why can a cheap cluster of machines perform better than one expensive machine in Hadoop?
- Why is NameNode a single point of failure? How does this impact cluster design?
- Why do companies still teach Hadoop even though Spark is popular?
- What makes MapReduce slow compared to in-memory engines?
MapReduce
- Why does MapReduce require data to be in key-value format?
- Why is the Shuffle phase considered the costliest step of MapReduce?
- Why do mappers and reducers run on different machines?
- Why can’t reducers start before some mappers finish?
- Why does MapReduce write intermediate results to disk instead of RAM?
- How does MapReduce achieve fault tolerance during job execution?
- Why is WordCount the best example to learn MapReduce fundamentals?
- If MapReduce is slow, why do some companies still use it today?
- Can MapReduce work well for real-time data? Why or why not?
- Why does MapReduce fit batch processing but not interactive analytics?
Distributed Systems
- Why do companies move from single-machine systems to distributed architectures?
- What problem does “horizontal scaling” solve that “vertical scaling” cannot?
- Why do distributed systems face problems like network latency and partition failures?
- What does CAP theorem imply about every distributed system?
- Why can adding more servers sometimes reduce performance?
- Why is data consistency harder in distributed systems?
- Why is fault tolerance a critical design goal in modern architecture?
- Why are distributed systems harder to debug than monoliths?
- What problems arise when you store the same data across multiple machines?
- Why does communication overhead matter more than CPU power?
Monolith Architecture
- Why did monolith architecture become the standard for early web applications?
- What problems appear when a monolith grows very large?
- Why do monoliths become harder to deploy over time?
- Can a monolith be scalable? Under what conditions?
- Why do companies rewrite old monolith systems instead of patching them?
- What risks exist when breaking a monolith into microservices?
- Why is a monolith simpler for beginners to understand?
- When is a monolith better than microservices?
Microservices
- Why do microservices improve developer productivity at scale?
- Why does communication between microservices become a challenge?
- What makes debugging microservices harder than debugging monoliths?
- Why do microservices require central monitoring and logging systems?
- Why do microservices need API gateways?
- What problems appear if microservices share the same database?
- Why do microservices require strict boundaries for responsibilities?
- When should a small startup avoid microservices?
Scaling
- What is the fundamental difference between vertical and horizontal scaling?
- Why is horizontal scaling more common for modern web apps?
- Why does caching improve system performance drastically?
- Why does a load balancer make systems more reliable?
- Why is distribution of load more important than raw server power?
- Is scaling always technical? Can business-level scaling be an issue too?
- Why does database scaling become the bottleneck before backend scaling?
- Why is replication not equal to scalability?
- What role does CDN play in scaling?
- Why can't we scale infinitely even with cloud infrastructure?
Big Data & Data Analytics
- Why is big data meaningless without analytics?
- How does analytics convert raw data into decision-making power?
- Why is visualization important in big data analytics?
- Why do analytics engines need data in structured or semi-structured form?
- Why is preprocessing a major part of any data analytics workflow?
- Why do data scientists need distributed storage to work effectively?
- Why do companies use predictive analytics on top of big data?
- How does the size of data affect algorithm choice?
- Why is data sampling sometimes necessary in big-data analytics?
- Why are SQL engines being redesigned to work on Big Data?
Big Data & Web Development
- Why must modern web applications be designed with big-data production in mind?
- How do user actions in web apps generate big data?
- Why do frontend interactions matter for analytics pipelines?
- Why do backend developers need to understand log data structures?
- What role do APIs play in generating or transporting analytics data?
- Why do large web apps store user event data separately from main databases?
- How does caching impact data collection and analytics accuracy?
- Why are distributed systems required for high-traffic web apps?
- How do analytics dashboards depend on big data technologies?
- Why must web developers understand eventual consistency in large systems?