Advice on how to scale and improve execution times of a “pivot-based query” on a billion rows table, increasing one million a day?

Perhaps this white paper by SQL Server CAT team on the pitfalls of Entity-Attribute-Value database model can help: sqlcat.com/whitepapers/archive/2008/09/0....

1 This white paper almost directly answers the poster's question with best practice advice. – Philip Rieck Jun 16 '09 at 15:23.

I'd start from posting exact tables metadata (along with indexing details), exact query text and the execution plan. With you current table layout, the query similar to this: SELECT FROM Items WHERE Size = @A AND Version = @B cannot benefit from using a composite index on (Size, Version), since it's impossible to build such an index. You cannot even build an indexed view, since it would contain a self-join on attributes.

Probably the best decision would be to denormalize the table like this: id name size version and create an index on (size, version).

Probably best to take the parameters, find out what the id is that you need from the lookup tables and then use the id in the query – SQLMenace Jun 16 '09 at 15:29 1 @SQLMenace: if 1,000,000 rows suffice size, 1,000,000 rows suffice version and 1,000 rows suffice both, the composite index would require 1,000 scans, and scanning the lookup tables twise would require 2,000,000 scans. – Quassnoi Jun 16 '09 at 15:35 thats right, the more filtering criteria you add the more it takes to execute – knoopx Jun 16 '09 at 15:47 @knoopx: impossibility to create composite indexes is the main drawback of such design. If you need multiple search criteria, you should keep your attributes as table columns – Quassnoi Jun 16 '09 at 15:54 if the values table is pretty narrow 2 * 4 bytes for the ids and another 20 bytes for the value and you have a index on value it might be doable since you could store a million rows in a little under 3500 pages but then again I don't know what the data looks like – SQLMenace Jun 16 '09 at 16:01.

Worked with such schemas a lot of time. They never perform well. The best thing is to just store the data as you need it, in the form: | ItemName | Size | Version | |----------|-------|---------| | Sample | 500mB | 1.0.0 | Then you don;t need to pivot.

And BTW, please do not call your original EAV schema "normalized" - it is not normalized.

Looks to me like issuing some OLAP queries on a database optimized for OLTP transactions. Don't knowing details, I'd recommend building a separate "datawarehouse" optimized for the kind of queries you are doing. That would involve aggregating data (if possible), denormalization and also having a data base, which is 1 day old or so.

You would incrementally update the data each day or at any interval you wish.

Please post exact DDL and indexes, if you have indexes on the ID columns then your query will result in a scan instead of something like this SELECT FROM Items WHERE Size = @A AND Version = @B you need to do this SELECT FROM Items WHERE ID = 1 in other words you need to grab the text values, find the ids that you are indexing on and then use that as your query to return results instead Probably also a good idea to look at partitioning function to distribute your data clustering is done for availability not performance, if one node dies (the active cluster) , the other node (the passive cluster) will become active....of course there is also active active clustering but that is another story.

A short term fix may be to use horizontal partitioning. I am assuming your largest table is Items_Attributes. You could horizontally partition this table, putting each partition on a separate filegroup on a separate disk controller.

That's assuming you are not trying to report across all ItemIds at once.

That's assuming you are not trying to report across all ItemIds at once. ", in fact we group by attribute value and sort by count of repeated values, so I guess this won't work? – knoopx Jun 16 '09 at 16:19.

You mention 50 tables in a single query. Whilst SQL server supports up to 256 tables in a single, monolithic query, taking this approach reduces the chances of the optimiser producing an efficient plan. If you are wedded to the schema as it stands, consider breaking your reporting queries down into a series of steps which materialise their results into temporary (#) tables.

This approach enables you to carry out the most selective parts of the query in isolation, and can, in my experience, offer big performance gains. The queries are generally more maintainable too. Also (a bit of a long shot, this) you don't say which SQL server version you're on; but if you're on SQL 2005, given the number of tables involved in your reports and the volume of data, it's worth checking that your SQL server is patched to at least SP2.

I worked on an ETL project using tables with rowcounts in the hundreds of millions, where we found that the query optimiser in SQL 2005 RTM/SP1 could not consistently produce efficient plans for queries joining more than 5 tables where one or more of the tables was of this scale. This issue was resolved in SP2.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Advice on how to scale and improve execution times of a “pivot-based query” on a billion rows table, increasing one million a day?

Related Questions

Pivot control dynamically adding pivot items throws ArgumentException was unhandled when select a index of pivot?

What is 6 times 6 times 6 times 6 times 6 times 6 times 6 times 6?

Xabi Alonso: €30 million, Benzema: €35 million, Kaka: €65 million, Cristiano Ronaldo: €80 million?

Advice advice advice advice advice!?

SQL Server - Select Multiple Rows Data From One Table And Update Multiple Rows Data On Another Table With First Selected Data?

In the power curve plot in a survival study, can I change the x-axis scale from log(HR) scale to HR scale?