Google nel corso degli anni ha sviluppato differenti tecnologie che sono diventate con il tempo piattaforme open source popolari ampiamente utilizzate. Attualmente l’azienda di Mountain View sta portando sul mercato, come offerte del proprio Google Cloud Platform, una vasta gamma di servizi per i big data.
Recentemente Info Q ha intervistato a tal riguardo William Vambenepe, lead product manager per i servizi big data in Google. Abbiamo scelto di riproporre questa intervista per comprendere in quale direzione si stia muovendo Google rispetto ai suoi servizi big data e come il modello a consumo rappresenti il futuro di BigG.
Google mette a disposizione i servizi big data all’esterno
La prima domanda posta al product manager di Google riguarda il servizio Bigtable, ossia il database di Google sviluppato per la gestione dei big data in grado di supportare carichi di lavoro a livello di petabyte facilmente scalabile. Google ha reso disponibile all’esterno il suo servizio proprio quest’anno con il nome di Google Cloud Bigtable. Info Q ha chiesto al product manager dell’azienda americana quanto questo servizio sia cambiato nel corso degli ultimi anni.
Vambenepe ha affermato che il database per i big data ha subito continue evoluzioni nel corso della sua storia, guidato dalle esigenze sempre crescenti dei vari applicativi Google. Ad oggi, Bigtable è completamente diverso dal servizio sviluppato per la prima volta nel 2004, ad esempio tantissimo lavoro è stato fatto per portare la latenza al novantanovesimo percentile, requisito fondamentale da quando Google ha iniziato a servire traffico fuori dal database.
Ma Bigtable non è l’unico servizio reso disponibile all’esterno. Altri stanno per essere proposti sul mercato dei big data con pacchetti on demand. A tal riguardo il product manager afferma che Google è consapevole di come molti strumenti interni possano essere estremamente utili anche all’esterno. Un esempio è BigQuery che espone Dremel come un servizio e consente agli utenti di analizzare enormi set di dati, dell’ordine dei petabyte, con query SQL eseguite in pochissimi secondi e senza la necessità di gestire un cluster da parte del cliente.
Altro esempio è Cloud Datastore che attualmente si basa su MegaStore e consente di gestire un database transazionale NoSQL. Ma ci sono anche altri strumenti interni utilizzati nei servizi Google per i clienti. Ad esempio Cloud DataFlow utilizza FlumeJava per l’elaborazione bach e Millwheel per l’elaborazione in streaming.
Servizi big data Google: quale il migliore, il più veloce e il più economico
Secondo Vambenepe, i servizi Google corrispondono a tutte e tre le caratteristiche. Per quanto riguarda il migliore il product manager è convinto che le prestazioni di BigQuery a larga scala siano senza precedenti, diversi sono i clienti, infatti, che utilizzano il servizio per trattare query dell’ordine dei petabyte. DataFlow dal suo canto offre la semantica più avanzata nel settore dell’elaborazione streaming. Bigtable supera di gran lunga i database concorrenti in termini di lettura e scrittura dei dati.
Per quanto concerne il servizio più veloce Vambenepe afferma che i servizi big data completamente gestiti su Google Cloud permettono di organizzare e ottenere dati con estrema velocità.
Per quanto riguarda il servizio più economico, tutti offrono la possibilità di pagare solo per ciò che effettivamente si utilizza, sia in termini di storage sia in termini di elaborazione. Non è necessario sovradimensionare il sistema in previsione di una crescita futura, l’architettura è facilmente scalabile in funzione delle proprie esigenze attuali e pertanto il risparmio economico è notevolmente garantito.
Google punta sul modello a consumo
Proprio su quest’ultimo punto Vambenepe tiene a precisare che le aziende non devono affrontare costi per acquistare e rinnovare l’infrastruttura e non dovranno spendere soldi in licenze. Il modello a consumo quindi continuerà a caratterizzare anche per il futuro i servizi di Google.
Ogni azienda ha a disposizione differenti servizi e può scegliere anche di integrali in funzione delle sue esigenze. Ad esempio DataFlow è un servizio altamente performante, ma qualora le aziende abbiano necessità di lavorare con linguaggi non supportati da DataFlow, potranno sempre puntare su servizi di livello inferiore come Spark, Flink o Hadoop su GCE. La maggior parte dei grandi clienti utilizza un mix di servizi a consumo a vari livelli di gestione in modo tale da soddisfare le proprie esigenze non soltanto in termini computazionali ma anche economici.
Chiunque volesse potrà leggere di seguito l’intervista integrale a William Vambenepe in lingua ufficiale.
InfoQ: Hadoop, HDFS and HBase were inspired by Google’s MapReduce, GFS andBigtable. How much does the service now called Bigtable differ from the internal platform of a decade ago?
William: Bigtable has gone through several major iterations in its lifetime at Google, driven from the evolving requirements from supporting Google’s major applications. In many ways, the Bigtable that is part of the bedrock at Google today is significantly different from the technology that was originally developed in 2004. For example, after its internal implementation, a significant amount of work was done on improving the 99th percentile latency which became a stronger and stronger requirement when Google started serving traffic out of the database. This drove a lot of work into diagnosing and grinding away the tail latency.
Additionally, multi-tenancy within Google has been a significant challenge, and in offering the technology as an external service, a lot of work had to take place around isolation of all the different layers of resources which are utilized. One last note is that this service is offered through the completely open source HBase API/client, which is somewhat ironic, given that Bigtable was the original service, and there is a tremendously powerful client infrastructure internally. However, we think it was the right thing to do, since the HBase community is diverse and powerful and we want to continue to work together with this amazing ecosystem.
InfoQ: Over the last few years Google has talked about a bunch of internal data services such as Dremel, MegaStore and Spanner; are they now finding their way into services that anybody can use on demand?
William: Definitely, many of them are. Many of the services we use inside Google are extraordinarily useful outside of Google. A clear example of this is BigQuery, which exposes Dremel as a service, allowing users to analyze potentially enormous (petabyte-sized) datasets with SQL queries which typically execute in just a few seconds and require no cluster management from the user.
Another example would be Cloud Datastore, which currently relies on Megastore to provide a NoSQL transactional database that can handle Google-scale data sets.
Beyond those, Google Cloud exposes other internal tools which have been described in published papers. For example, Cloud Dataflow unifies two internal tools, FlumeJava(for batch processing) and Millwheel (for stream processing) to provide a unified programming model and managed execution environment for both batch and stream.
And yes, we’re always looking at situations where other Google technology could be exposed as a service — and Spanner is definitely something that’s generated a lot of interest in this area.
InfoQ: In tech we very often talk about better, faster, cheaper – pick two. Is it possible that cloud based big data services will offer all three (versus do it yourself approaches using open source or products)?
William: Let’s see…
Just a few examples: BigQuery performance at scale is unparalleled (we have customer queries which process several petabytes in a single query). Dataflow unifies batch and stream in one programming model and offers the most advanced semantics in the industry for stream processing (e.g. windowing by actual event time, not arrival time). Bigtable vastly outperforms comparable databases on read and write latency. Etc. And all those capabilities come as fully managed services, not at the cost of weeks of deployment/configuration/tuning.
Clearly faster in terms of product performance (as mentioned above), but in this context I think “faster” refers to the speed at which an organization is able to move (which is the most important aspect in the end). In that sense, the fully-managed data services on Google Cloud allow organization to get results immediately. Because there is no setup required, what would normally be an “IT project” starting with capacity planning, provisioning, deployment, configuration, etc can fast-forward straight to the productive part.
Google’s Big Data services allow users to pay only for what they consume, both in terms of storage and processing. So when you’re storing 16TB of data you pay for 16TB of data, you don’t have to provision (and pay for) extra storage to account for growth, you don’t have to double or triple this for redundancy (it happens under the cover), etc. Similarly for processing, you don’t need to pay to maintain idle resources when you’re not actively processing or querying your data. To this lower infrastructure cost you can add the savings of not having to deploy, manage, patch and generally administer data processing infrastructure.
So… yes. Pick three of three.
InfoQ: From a performance perspective the launch of Bigtable concentrated on write throughput and end of tail latency. Why are these the key metrics that developers and their users should care about?
William: Fundamentally, wide-column stores (Bigtable, HBase, Cassandra) are scale-out databases which provide true linear scalability and are designed to be used both as an operational and analytical backend. Because their value proposition relies on making large amounts of data available very quickly, the fundamental metrics revolve around volume and speed.
One of our concerns was that the NoSQL industry in general is still tossing around benchmarks focusing on the 50th percentile of latency, both on the read and write sides. At Google we think this is a bad practice. Being focused on the 50th percentile of performance means that half of your requests (and by extension potentially half of your customers) are getting an unbounded worse experience than you are testing for. Instead, Google focuses on the 99th or 99.9th percentile, meaning that we are characterizing the expected experience for a vast majority of our users. In this way we better understand what effects our architectural and configuration choices have.
With regards to throughput, the industries that we think can take tremendous advantage of Bigtable are ones where the ability to collect and store more data means better decisions at the application layer. More throughput on small infrastructure makes existing applications easier to create and bring to market, and the ability to more easily scale existing applications to drive better insights.
InfoQ: The disruption we’ve seen across the industry for the last few years seems to be (in economic terms) moving us from economic rents for (packaged software) suppliers to consumer surpluses for end users. Do you see the disrupted supplier gains accruing to service providers rather than say distribution support companies?
William: In addition to the savings from purchasing (and renewing) commercial software licenses users will see even more savings from a much more flexible consumption model (paying just for what they need), a much smaller administration overhead (fully managed services) and a generally more efficient infrastructure (leveraging the expertise and economies of scale of huge providers like Google). While some of it will be concretized as a displacement of revenue from packaged software licenses to Cloud providers, I think most of the gains for the move to Cloud will be accrued by users.
But even more than the cost savings, the benefits of Cloud for users will be about getting a lot more from IT, not just paying less for it. The Cloud model will give wider access to information within the company, it will revolutionize collaboration, and it will provide access to advanced tools (e.g. advanced machine learning) which would be very hard to implement on-premise but very easy to consume as a service.
InfoQ: Companies like Google seem to have put a lot more effort into building data services than underlying infrastructure. Does this mean that this shift to public cloud puts a lot more on the table for adopters than might be perceived by focusing just on IaaS (and PaaS/SaaS)? Have the NIST definitions for cloud channelled the conversation too much and blinded us to other aspects of ‘as a service’ delivery?
William: I wouldn’t say that Google has put more effort in data services than in underlying infrastructure. We’ve put huge efforts into infrastructure, even though they are less visible. Some of the greatness of the more-visible data services comes from the greatness of the underlying infrastructure. For example, we recently described some of our networking innovation. As I pointed out on Twitter, this phenomenal performance of the underlying infrastructure plays a large role in allowing higher-level services like BigQuery or Dataflow to shine (in both performance and cost).
The NIST definitions are useful for an initial pass and served well in categorizing providers in the early days of Cloud. But in practice, Cloud services provide a continuum of options and the IaaS/PaaS breakdown is a bit too simplistic. People who seriously consider their needs and options for Cloud usage have shown that they understand the value of consuming the highest-level services which are flexible enough for their needs. For example, Cloud Dataflow provides the most optimized fully managed environment for data processing pipelines. If it meets your functional needs, it’s the best operational choice but if it doesn’t (e.g. you want to use a language not supported on Dataflow) then you always have the option to go to a lower-level service and use Spark, Flink, or Hadoop on GCE. Most large-scale customers use a mix of services, at various levels of management, to optimally meet their various needs. It’s the job of the Cloud Platform to ensure those services are well integrated and can be combined seamlessly.
InfoQ: How has the state of the art evolved for large-scale processing since MapReduce started the category over ten years ago?
William: Quite a lot!
MapReduce opened the gate to a world of cost-efficient large-scale computing. Pretty quickly though, our internal usage of MapReduce at Google showed that writing optimized sequences of MapReduce steps for real-life use cases was complex, so we moved towards developing a higher-level API, which can be automatically transformed into optimized MapReduce. That was the original FlumeJava work (not related to Apache Flume). Then we moved on to skipping MapReduce altogether and running the pipeline as a DAG (Directed Acyclic Graph).
The open source world followed more or less the same path with a delay. Hadoop brought MapReduce to the world, then tools like Apache Crunch and Cascading provided a FlumeJava-like pipeline API, and Spark and Tez brought a DAG-centric execution engine.
At the same time that these batch processing technologies were becoming more refined, the need emerged to process large data streams in near real-time. Google originally developed this model as Millwheel, a separate processing engine.
With Cloud Dataflow, we are taking the next step, merging batch and streaming processing into a unified programming model and isolating the definition of the processing from the choice of how to run it (which execution engine, and whether applied to historical data or to an on-going stream). But this time, we’re doing it for everyone, as an open source SDK and as a publicly-available service.