Archive for April 2008

Dataupia

I spoke with Jim McManus, VP Channels and Alliances, from Dataupia last week and wanted to share my impressions of the company and primary offering, the Satori Server.  Dataupia is based in Cambridge, MA, and has received venture funding from Polais Venture Partners, Fairhaven Capital Partners, Valhalla Partners, including a recent $16M infusion last fall.  The company was founded in 2005, and went general availability with the Satori Server in May, 2007.  Foster Hinshaw, the CEO, is known as the “Father of the Data Warehouse Appliance”, and founded Netezza before leaving to start Dataupia.  They list several customer success stories on their web site, and currently have 4 customers according to Jim.  They recently released a press release with Subex announcing the deployment of a 150TB Oracle OSS system on Satori Server.

The Satori Server is their flagship product, and is touted as a true data warehouse appliance (think Netezza and Teradata).  It includes server, storage, and “optimization software” bundled into one package.  The hardware is built from common off the shelf components, but is engineered to handle data intensive operations. The “optimization software” includes a scaled back Postgress open source relational database management system, and a vanilla Red Hat Linux operating system.  The system is configured in a shared-nothing or MPP architecture.  They currently store data in a row alignment, but are considering offering column based storage.  According the Jim, the system is currently targeted for operational reporting purposes, which lend itself to row based processing.  Satori Server performance enhancement relies heavily on a dynamic aggregation capability that is defined at loading time.

The differentiator for Dataupia is their “Omniversal Transparency” – basically they support queries in Oracle, DB2, and SQL Server natively.  As an example, take a data warehouse sitting on an Oracle database.  The data warehouse might have a large number of tables, most of them reference or smaller volume tables along side several large “fact tables”.  Once the Satori Server is connected to the network, the Oracle DBA can map the larger volume table(s) and migrate all data.  After the migration is complete, queries run as normal within Oracle, with processing on the larger tables handled within the Satori Server.  The benefits of this approach are significant, as it requires no change to applications currently built to access the existing RDBMS.  The DBA can handle most of the application monitoring from the existing RDBMS console, although Dataupia does include a console for configuring and monitoring the Satori Server directly.

Dataupia provides a number of benchmark results on their web site, and include load times (70 MB/sec), refresh rate, and drill down capabilities (24 months of detail data accessed in 5 sec).  They have not as yet released TPC-H results (planned for this summer) which allow for a more direct comparison to their competitors in this space.  A potential concern is the reliance on data aggregation to boost performance results, which brings scalability into question.  A heavy reliance on aggregates can result in significantly increased data sizes to support a mixed reporting environment, particularly a heavy ad-hoc user base.  The Subex announcement provides one scalability data point, but I’d like to see some additional production implementations before eliminating this concern.

Dataupia offers consulting services, which consist of a 1 week jumpstart engagement that hooks up the Satori Server, does initial configuration, migrates selected data, and runs a number of tests to ensure everything is working properly.  They rely on their system integrator partners to handle any required work outside the scope of this jumpstart effort.

They have a well defined partnership strategy, with three categories:

Ø     Solution Partners that extend their product in areas such as business intelligence and data mining

Ø     System Integrators that work with customers to implement the Satori Server within the larger data warehouse initiative

Ø     Technology Partners that provide complementary hardware or software components that are integrated into the product

In summary, a customer with an existing legacy system should strongly consider this solution, due to the enormous cost savings associated with not having to migrate data integration, reporting, or other existing applications.  This represents not only a labor cost savings, but an opportunity savings reflected by limiting impact to the existing analytical business processes in place.  The greenfield situation is not as clear, due to the lack of industry benchmarks and the fact that Satori Server is not a stand alone solution.  In either case, scalability needs to be thoroughly tested as this is a new product with a small installed customer base.

Venture Capital activity in Business Intelligence

I pulled the following statistics from my vendor database 360DegreeVendor:

  • There are 218 Venture Firms with active investements in business intelligence, data warehouse, & data integration vendors
  • Average total investment in each vendor is US$10.3M
  • There are 135 vendors with active venture backing
  • $1.4B total estimated active venture funding in the business intelligence, data warehouse, & data integration space
  • 1.8 average vendor investments per venture firm
  • Business Activity Monitoring product vendors had the highest percentage (56%) of venture funding for offering categories containing 10 or more vendors
  • As expected, vendors with a Reporting offering had the highest total number of venture backed vendors (30)

Elixir Technology

I was introduced to Elixir by Brooks Rembert from JNetDirect recently and wanted to share my impressions of the company and their primary offering, Elixir Repertoire 7.  Elixir is a 10 year old privately held company based in the Republic of Singapore.  They claim over 600 customers world-wide, most in the Singapore and the Far East.  They have a very robust partnership program in place, with Channel and OEM partners that resell their product, and service partners that assist clients with implementations.  JNetDirect is the exclusive distribution partner in the United States. 

Their flagship product, Repertoire 7, is a Java based reporting and dashboard tool that has a data integration capability as well.  I was recently given a demo of their product by the CEO, Shih Hor-Lau.  He walked me through the reporting capability, which includes support for writing and distributing standard reports along with an OLAP capability to support power users.  They also have a very flexible dashboard capability, with a number of widgets and gadgets available out of the box.  The product was developed from the ground up with Java, but includes a set of APIs (“REST APIs”) which support integration with Java, .NET, Perl, AJAX, Ruby, Flex.  The data integration component, “Data ETL”, is actually an Enterprise Information Integration (EII) tool that allows source systems to be mapped, and data to be extracted and integrated at report run-time.  They do not currently have an underlying meta-data repository, but Hor-Lau said they are planning to add that in a future release.

Elixir has a small professional services organization that is capable of handling installation and configuration of the product.  They rely on their service partners or the client to handle the end-to-end implementation activities.

JNetDirect is currently handling Level 1 support in the US, and is ramping up to provide Level 2 support in the near future.  All Level 3 support will be handled by Elixir out of their Singapore offices.

In summary, Repertoire seems to be a viable alternative for someone looking for a low-cost integrated reporting and dashboard development tool.  The data integration component provides the ability to map into a variety of data source types, but is not a heavy duty ETL tool and should not be used to handle large data volumes, complex transformations, or data cleansing.  The lack of a meta data repository is a concern, as it limits visibility to what is available in the Elixir tool, and prevents you from integrating other tools such as a third-party ETL or meta-data management (MDM) product.

Reference:

-        Interview with CEO on SYS-CON TV at Java1 2006

Vendor Statistics from 360DegreeVendor database

The following are some statistics from my 360DegreeVendor database, which tracks and catalogs vendors in the business intelligence, data warehouse, and data integration space. I will publish these numbers on a monthly basis.  Detailed reports are currently available on my blog post, and will be available on my corporate web site in the near future.  There are currently 587 vendors in the database.

 

1. Vendor Category Summary

Partnerships and Venture-backed companies are the most prevalent, underscoring the tremendous growth and activity in this space.  I’ve calculated an average venture investment of approximately $10M per company (which I think is very conservative), which produces a current total venture investment amount of $1.35B in this sector.

Vendor Categories

Category Type Category Name Vendor Count

Large

Listed

68

Private, $100M+ Revenue

4

 

Division of Large Company

80

Medium

Venture Backed

135

Management Team

45

Small

Partnership

192

Individual Practitioner

63

  TOTAL: 587

 

2. Vendors by Offering Category

Vendor Offerings

Category Description Name Total Vendors Venture Backed

Product

Analytics Suite

32

9

 

Appliance

13

5

 

Business Activity Monitoring

18

10

 

Business Performance Management

61

13

 

Customer Data Integration

7

4

 

Data Cleansing

25

2

 

Data Mining

25

10

 

Data Profiling

1

0

 

Data Services

3

2

 

Database

13

2

 

DBA Support

12

4

 

Digital Asset Management

13

5

 

Enterprise Application Integration

2

2

 

Enterprise Information Integration

55

20

 

Extract, Transform, Load (ETL)

30

3

 

Geographical Information System (GIS)

2

1

 

Information Aggregation

2

2

 

Master Data Management (MDM)

5

1

 

MetaData

13

5

 

Online Analytical Processing (OLAP)

6

3

 

Reporting

125

30

 

Risk Management & Governance

13

3

 

Search

11

6

 

Sorting

2

0

 

System Management

1

0

Research

Best Practice

11

0

 

Vendor Evaluation

0

0

Service

Database Marketing

45

15

 

Education

2

0

 

Hosted Services

4

1

 

Outsourcing

7

1

 

Product Consulting

158

47

 

Seminars

2

0

 

Software Resale

3

0

 

Strategy

36

0

 

Support and Maintenance

6

1

 

System Integration

140

3

 

Training

104

29

The “Microtarget”

I read an interesting story in the Washington Post business section today, discussing the use of “microtargeting” in the current presidential election.  The author received input from a guy name Colin Shearer who works for SPSS.  They described the process of identifying “small but crucial groups of voters who might be won over to a given side…”.  This sounds similar to what corporations have been doing (or trying to do) for years, by targeting small groups or even individuals with tailored product or service offerings.  It would be interesting to know how effective this tactic has been in the current primaries, particularly the Democratic run-off.

I’m not sure that putting sophisticated marketing analytic tools in the hands of politicians (or more accurately in the hands of their campaign managers) is a good thing.  As the article points out, this capability increases the likelihood of deception since they presumably know much more about the target audience and can target associated hot buttons.  But I suppose as consumers we’ve all grown accustomed to the daily marketing barrage, and hopefully can see through these attempts.  After all, the stakes are significantly higher than the risk of purchasing the wrong shampoo.

Exasol AG

I’ve spoken with Stu Greenberg (VP, Business Development) from Exasol several times in the past month and wanted to share my impressions of the company and their primary product offerings, EXASolution and EXACluster OS.  Exasol is a privately held corporation based in Nuremberg, Germany.  They are currently opening up operations in the United States, an effort being spearheaded by Stu.

 

EXASolution is a parallel in-memory relational data base management system.  Key features include an in-memory clustered RDBMS, in memory query processing, massively parallel data processing, compression algorithms, and integration via industry standard interfaces (e.g., ODBC, ADO).  They also tout a self learning and self optimizing management system that reduces IT maintenance costs by “up to 50%“ by reducing or eliminating many of the time-consuming system administration tasks.  EXACluster OS is the underlying operating system, built from scratch starting with the Linux nano-kernel.  The combination of these two components, along with off-the-shelf Intel-based hardware, allows a customer to build a very scalable, low cost data warehouse appliance.

Their recently posted TPC-H marks are very impressive.  Exasol’s posted score of 580,729 queries per hour (QphH) in the 1,000 GB test was nearly twice that of the next fastest mark, by ParAccel.  Their Price/QphH was about a third of the ParAccel cost, and about 3% of Oracle’s cost.  Exasol posted even greater numbers in the 300 GB test, and was second only to newcomer Kickfire in the 100GB cost category (although with more than four times Kickfire’s QphH).  I realize that these benchmarks are not always indicative of real world processing, and that Exasol has yet to produce these numbers in the larger data size tests, but I’m sure there are a number of companies with 1TB systems out there that should at least add Exasol to their evaluation list.

Exasol offers implementation and technical consulting support, focused on the installation and configuration of the EXASolution.  The company is relying on either the client or systems integrators to handle the bulk of the implementation work, at least in the initial US installations.  They also offer a number of training courses on EXASolution.

Exasol has partnerships established with all of the major BI players such as Business Objects (SAP), Cognos (IBM), MicroStrategy, and SAS. They also have partnerships in place with other key players such as Oracle, Microsoft, and HP.

Exasol has identified two customer references, IMS Health and KarstadtQuelle Group, both European based operations.  They are looking to leverage these long standing relationships to build out their US business.

I think for someone evaluating data warehouse appliances, Exasol certainly deserves to be on the evaluation list.  As an evaluator of their technology, at a minimum, I’d want to understand the following:

  • What is the size and complexity of their current production systems, and how does their real-world performance compare to their TPC-H marks?

  • What happens to performance if my production data set grows larger than my available memory (forcing some queries to disk)?

  • What are the steps associated with implementing EXASolution, and how much support could I expect to get from Exasol throughout this process?

  • Has EXASolution been tested with my chosen toolset (e.g., ETL, BI, Modeling), and are there production systems currently running with those tools in use?

  • I’d like to see a demonstration of their “self learning and self optimizing management” capability.  In my mind, this is as (or more) important than the processing speeds and Price/QPhH, since hardware is increasingly cheap but your System Administrator/DBA time is not.

I’d be interested in other’s input on this subject, particularly someone who has implemented this product in a customer environment.

CDI (Customer Data Integration)

CDI is the use of technology and business processes to build a 360 degree view of the customer.  CDI is sometimes considered to be a subset of or related to Master Data Management (MDM). 

From a technical perspective, CDI involves a number of typical data warehouse activities, such as extract, transform, and load (ETL), data cleansing, and meta data management, among others.  These approaches have been discussed in great detail by numerous sources, so I won’t elaborate here.  The one key aspect of the technical approach that is critical to the solution: In-house or outsource?  You need to decide whether to develop this solution in-house with either custom code or one of the numerous products marketed under the CDI banner, or to outsource the CDI effort.  The pros and cons of each are outside the scope of this posting, but it’s safe to say this is critical decision points for the project, as it impacts all down stream decisions.

The major difference as I see it is in the business approach.  The typical data warehouse initiative may be very broad in nature (enterprise focus), or it may be targeted for a particular business unit and/or functional group (e.g., marketing).  The CDI initiative is all about integrating and eliminating duplicate customer data, to provide a clean, consistent view of the relationship the company has with each of their customers.  This picture shows the core data associated with a 360 degree customer view:

  • Unique identifiers such as customer id and SSN
  • Transaction data such as orders and payments
  • Account structure including parent child relationships
  • All Interactions between the company and the customer, including e-mail, phone calls, and snail mail
  • Marketing Attribution elements that allow the customer to be tied to specific marketing campaigns

The key issue in a CDI initiative will revolve around the proper definition of a “customer”, and the resulting effort to build a business model of the customer that accurately reflects the views of all business constituents.  Until this issue has been resolved, the technical implementation should not begin.

Please download my white paper 360 Degree Customer View for a more detailed discussion on this topic.

Reference Links:

-        Wikipedia Entry

-        DMReview Article

-        Gartner CDI Magic Quadrant

-        CDI Station

360DegreeVendor

Over the past 2+ years, I have developed a database of vendors (currently 575+) that play in the business intelligence, data warehouse, data integration, etc… space.  I’ve developed an “offering taxonomy” that defines vendor offerings along three broad categories: product, service, and research.  I’ve further broken those categories down into specific offerings, such as “Extract, Transform, Load” under “Product”, or “System Integration” under “Service”.  The third category, “Research”, has not been fully developed, but I felt it was necessary to break this out separately.  Since the space is so unwieldy (IMHO), a component of this database is a list of “buzzwords” that are commonly used within the industry.    Following is an overview and a list of the reports that I’ve produced to date.  I plan on adding a few more, with the long term goal of making the information in the database available via an ad-hoc reporting capability.  You’re free to download and use these reports, all I ask is that you shoot me an e-mail letting me know which reports you download, and provide feedback (both good and bad).  Additional reports, as well as updates to these reports, will be available on my web site in the next couple of weeks. 

  • 360DegreeVendor© is a database of 575+ vendors with product, service, and research offerings in the business intelligence, data warehouse, and data integration space
  • Contains specific vendor information including leadership, corporate structure,  URL, and parent organizations
  • All vendors are categorized as small, medium, or large, with further breakouts such as listing status, venture backing, and leadership structure
  • Catalog of all offerings, including services, software products, training, and research
  • Vendor offerings are further defined along taxonomy of over 40 detailed product and service categories
  • Venture backed organizations include venture fund name and total annual investment (where available)
  • Public companies include listing agency, market cap, and other summary financial information 

Vendor Focused Reports:     

  • Vendor Offerings - Alphabetical listing of all vendors, with product offerings and corporate information
  • Corporate Structure – Listing of vendors by their corporate type (e.g., Listed, Venture Backed, Partnership)
  • Vendor Category Summary - Number of vendors by category (small, medium, large) and type (e.g., individual, venture backed, publicly listed) 

Offering Focused Reports:

Venture Capital Focused Reports:

Reference Reports:

Turkey Hunting

I was able to get out for opening day of Maryland wild turkey season this morning.  I didn’t get a turkey, but it was a great morning to be in the woods.  In Maryland I hunt at McKee-Beshers Wildlife Management Area, which is a wonderful area for anyone who enjoys the great outdoors.  It’s nearly 2,000 acres of habitat in Poolsville, MD, along the Potomac river.  If you live in or near Maryland you should check it out sometime.  I saw geese with their goslings, wood and mallard ducks, and a doe who almost walked into my lap trying to figure out what my turkey decoys were doing under “her” oak tree.

VDI (Virtual Data Integration)

What is virtual integration and when is it appropriate to use? Virtual data integration (VDI) is the use of a software layer that interfaces with the reporting or end-user delivery application, in place of direct access of a data repository.  The software layer provides a mapping between the conceptual data model that the user interfaces with, and the physical model(s) and underlying data stores. The reason I say “model(s)” is that one of the most powerful aspects of VDI is the ability to make multiple data stores (whether they be transactional, ODS, data warehouse/marts, or others) look like one integrated repository.    Below are the pros and cons of using VDI verses physical data integration. 

Pros:

        Shortens implementation time since data does not have to be physically integrated

        Provides short term benefit to the business while buying time for technical team to integrate/retire existing redundant and fragmented data architecture

        Can reduce data storage requirements

        Source data updates are available immediately to end-users

        Eliminates cost and time associated with extraction and movement of data (e.g., no batch window)

        Business rules can be modified “on the fly” 

Cons:

        Data is not accessible if source system is unavailable

        Complex transformations and aggregations can impact report response time

        Updates to source systems can result in out of sync data on reports

        Complex match-merge routines may not be possible outside of batch processing

        Accessing large data sets can impact report response time

        Query processing can impact source system performance 

In my mind, the key differentiator is the volume of data being analyzed.  If end users are accessing relatively few numbers of records, a VDI solution is viable.  But when large data volumes need to be accessed, such as in a data mining or long term trending exercise, performance may become an insurmountable issue, both from the end user as well as the source system owner perspective.