|Name||MPP Data Virtualization|
|Description||Ben Szekely, Senior Vice President, Head of Field Operations in Field Operations, discusses how semantic metadata technology, powered by MPP capabilities, provides users with complete control over how they elevate their Data Fabric for discovery and integration. Then, he offers a sneak-peek into new functionality coming to Anzo later in 2020.|
Hi everyone I'm Ben Szekely, Senior Vice President and Head of Field Operations, here at Cambridge Semantics and today I'm going to give you a quick walkthrough of a new capability coming soon to the Anzo platform. Massively parallel processing data virtualization is a true breakthrough for data integration within the context of the enterprise data fabric. The enterprise data fabric is an architecture for modern data management that anticipates the need to connect data across the business. It's really an overlay that sits on top of your existing data warehouses, data lakes, cloud data repositories, document repositories, relational database systems and allows users to quickly blend and combine data from different data sources using business models that then can be consumed in any sort of application, BI or analytics tool or exploratory analytics directly in the underlying graph model itself. Now this type of data integration has to be really fast and agile and flexible and work at real enterprise scale and so traditional approaches that move all the data into one place like a like a data warehouse just can't scale or keep up with this type of approach and so something new is required. Let's take a look. So pure federation-based approaches are have been looked at as a fairly seductive way to try to solve this problem. You simply put an overlay of federated query engine on top of all your data sources, you run a SQL query and it magically queries all the underlying sources and bring your brings your data back. But this approach has had problems at enterprise scale that required things like SQL caches to maintain for practical performance but this ultimately narrows down the use cases so that you can solve when you're maintaining that cache it's also you know a single-threaded OLTP architecture so each query can have very long runtimes and it relies heavily on these SQL caches and ultimately it can be very taxing on the source systems if not done properly. And at Cambridge Semantics we looked at these approaches a long time ago and made a decision early on this wasn't going to be suitable for doing data integration at data fabric scale. So up through the current version of Anzio 5.0 we've taken a very practical approach for data integration at the data fabric scale that makes use of our massively parallel graph engine AnzoGraph. What we do is we use SPARK jobs to rapidly pre-position data in a lightweight fashion from data sources. We use metadata and intelligence and understanding the data sources to automatically onboard data into this cheap compressed graph storage. The graph storage is compressed so it doesn't take up a lot of space and the metadata cataloging gives administrators and users a lot of a lot of freedom and flexibility into the lifecycle that so you don't have to move all of your data just move the data that you need. And then we have our MPP engine that can really quickly load the data from that graph engine up into memory and allow users to do MPP query really really fast off data that's been loaded from that pre-position storage. And so the net benefit of this is that the users get the data they need when they need it with some very optimal pre-positioning of that data and the admins have full control over the life cycle. And this has worked incredibly well for our existing customers at Cambridge Semantics. But as we really look to deliver data at true enterprise scale provide that overlay across all of the data and the business this pre-positioning will work but it can't keep up with what's required. Our engineering team has often wondered can we apply the massively parallel AnzoGraph database to do data virtualization directly and that's exactly what we've done. Coming later this year is our breakthrough MPP data virtualization capability that I'm now going to tell you about. So first off the AnzoGraph engine is now capable of loading data into memory directly from any source system all in parallel. So right within your data loading queries you can define connections to databases, you could apply lightweight mappings, you can pull from API's, you can from JSON at XML and CSV formats all in memory, all in parallel directly in the graph engine. And what we've observed is that you can load data really really fast from these data sources, in some cases almost as fast as you can load it from pre-positioned local storage and that's going to make a huge difference to customers that want to rapidly onboard data into the graph engine for use in data integration and all kinds of queries. In addition to loading the data directly in the memory we can also do pushdown query planning that minimizes data movement so when a user issues a query against AnzoGraph it can then in turn apply views in the define query to actually query that data directly in the databases in real time. So the benefit of all this is that users benefit from the MPP in-memory query capabilities of Anzo and AnzoGraph without having to pre-position the data on disk first. And so this leads to much faster cycle times, faster deployment, more flexibility, all the things the data fabric can really provide are now available even faster. So to summarize, with these new capabilities Anzo can actually support a hybrid structure. The reality is that some use cases you want to pre-position data. Some use cases you want to load the data into memory before querying it. Other times you'll want to query it into memory but then have that go directly against the data source at query time. With Anzo's graphmart capability you'll be able to in a single graphmart actually apply all three of these capabilities to the same use case. You can have some data and that's been pre-positioned that's loaded in the memory, you'll have some data that you load directly into memory from your sources and other data that you're querying in real-time. And so depending on your use case you can select the approach that works best for the data and the way it's queried for the ultimate flexibility in data virtualization within the data fabric. So for more information please visit cambridgesemantics.com and have a look at some of our blogs and white papers that are talking about these exciting new features. Thank you.
VideoAsk allows you to have asynchronous video conversations with your customers. Learn more here!