Name | MPP Data Virtualization |
Description | Ben Szekely, Senior Vice President, Head of Field Operations in Field Operations, discusses how semantic metadata technology, powered by MPP capabilities, provides users with complete control over how they elevate their Data Fabric for discovery and integration. Then, he offers a sneak-peek into new functionality coming to Anzo later in 2020. |
Thumbnail URL | https://embed-ssl.wistia.com/deliveries/b006ef1724ff614a6... |
Embed URL | https://fast.wistia.net/embed/iframe/l2kvdiv2xy |
Duration | PT378S |
Upload Date | 2020-04-08T03:47:49+00:00 |
Transcript |
Hi everyone I'm Ben Szekely, Senior Vice
President and Head of Field Operations,
here at Cambridge Semantics and today
I'm going to give you a quick
walkthrough of a new capability coming
soon to the Anzo platform. Massively
parallel processing data virtualization
is a true breakthrough for data
integration within the context of the
enterprise data fabric. The enterprise
data fabric is an architecture for
modern data management that anticipates
the need to connect data across the
business. It's really an overlay that
sits on top of your existing data
warehouses, data lakes, cloud data
repositories, document repositories,
relational database systems and allows
users to quickly blend and combine data
from different data sources using
business models that then can be
consumed in any sort of application, BI
or analytics tool or exploratory
analytics directly in the underlying
graph model itself. Now this type of data
integration has to be really fast and
agile and flexible and work at real
enterprise scale and so traditional
approaches that move all the data into
one place like a like a data warehouse
just can't scale or keep up with this
type of approach and so something new is
required. Let's take a look. So pure
federation-based approaches are have
been looked at as a fairly seductive way
to try to solve this problem. You simply
put an overlay of federated query engine
on top of all your data sources, you run
a SQL query and it magically queries
all the underlying sources and bring
your brings your data back. But this
approach has had problems at enterprise
scale that required things like SQL
caches to maintain for practical
performance but this ultimately narrows
down the use cases so that you can solve
when you're maintaining that cache it's
also you know a single-threaded OLTP
architecture so each query can have very
long runtimes and it relies heavily on
these SQL caches and ultimately it
can be very taxing on the source systems
if not done properly. And at Cambridge
Semantics we looked at these approaches
a long time ago and made a decision
early on this wasn't going to be
suitable for doing data integration at
data fabric scale.
So up through the current version of
Anzio 5.0 we've taken a very practical
approach for data integration at the
data fabric scale that makes use of our
massively parallel graph engine AnzoGraph.
What we do is we use SPARK jobs to
rapidly pre-position data in a
lightweight fashion from data sources. We
use metadata and intelligence and
understanding the data sources to
automatically onboard data into this
cheap compressed graph storage. The graph
storage is compressed so it doesn't take
up a lot of space and the metadata
cataloging gives administrators and
users a lot of a lot of freedom and
flexibility into the lifecycle that so
you don't have to move all of your data
just move the data that you need. And
then we have our MPP engine that can
really quickly load the data from that
graph engine up into memory and allow
users to do MPP query really really fast
off data that's been loaded from that
pre-position storage. And so the net
benefit of this is that the users get
the data they need when they need it
with some very optimal pre-positioning
of that data and the admins have full
control over the life cycle. And this has
worked incredibly well for our existing
customers at Cambridge Semantics. But as
we really look to deliver data at true
enterprise scale provide that overlay
across all of the data and the business
this pre-positioning will work but it
can't keep up with what's required. Our
engineering team has often wondered can
we apply the massively parallel AnzoGraph
database to do data virtualization
directly and that's exactly what we've
done. Coming later this year is our
breakthrough MPP data virtualization
capability that I'm now going to tell
you about. So first off the AnzoGraph
engine is now capable of loading data
into memory directly from any source
system all in parallel. So right within
your data loading queries you can define
connections to databases, you could apply
lightweight mappings, you can pull from
API's, you can from JSON at XML and CSV
formats all in memory, all in parallel
directly in the graph engine. And what
we've observed is that you can load data
really really fast from these data
sources, in some cases almost as fast as
you can load it from pre-positioned
local storage and that's going to make a
huge difference
to customers that want to rapidly
onboard data into the graph engine for
use in data integration and all kinds of
queries. In addition to loading the data
directly in the memory we can also do
pushdown query planning that minimizes
data movement so when a user issues a
query against AnzoGraph it can then in
turn apply views in the define query to
actually query that data directly in the
databases in real time. So the benefit of
all this is that users benefit from the
MPP in-memory query capabilities of
Anzo and AnzoGraph without having to
pre-position the data on disk first. And
so this leads to much faster cycle times,
faster deployment, more flexibility, all
the things the data fabric can really
provide are now available even faster.
So to summarize, with these new
capabilities
Anzo can actually support a hybrid
structure. The reality is that some
use cases you want to pre-position data.
Some use cases you want to load the data
into memory before querying it. Other
times you'll want to query it into
memory but then have that go directly
against the data source at query time.
With Anzo's graphmart capability
you'll be able to in a single graphmart
actually apply all three of these
capabilities to the same use case. You
can have some data and that's been
pre-positioned that's loaded in the
memory, you'll have some data that you
load directly into memory from your
sources and other data that you're
querying in real-time. And so depending
on your use case you can select the
approach that works best for the data
and the way it's queried for the ultimate
flexibility in data virtualization
within the data fabric. So for more
information please visit
cambridgesemantics.com and have a look at
some of our blogs and white papers that
are talking about these exciting new
features. Thank you.
|
Public Link for Sharing
VideoAsk allows you to have asynchronous video conversations with your customers. Learn more here!