-->

Mysql stored functions and groupwise min [closed]

2020-07-23 05:04发布

问题:

Schema

Database schema is simplified
Events table
This table stores events.

CREATE TABLE `Events` (
`event_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`isPublic` tinyint(1) NOT NULL DEFAULT '1',
PRIMARY KEY (`event_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Places table
Simple table that stores places. One event can be in more than one place.

CREATE TABLE `Places` (
`place_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`latitude` double NOT NULL,
`longitude` double NOT NULL,
PRIMARY KEY (`place_id`),
KEY `latind` (`latitude`,`longitude`)
) ENGINE=InnoDB CHARSET=latin1;

Rules table
Table that stores schedules of events. One event can have more that one schedule. All dates are in unixtimestamp format. Regular means that this rule has some repeating schedule that is stored in RegularRules table.

CREATE TABLE `Rules` (
`rule_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`start_date` int(11) NOT NULL,
`end_date` int(11) NOT NULL,
`regular` tinyint(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`rule_id`),
KEY `endindx` (`end_date`)
) ENGINE=InnoDB CHARSET=latin1;

RegularRules
Table that stores repeatable schedules in the following format. day_start/end means number of seconds from the beggining of the day (00:00) to the starting of the event. For example, event takes place every monday from 10:00 to 18:00. We will store start_date and end_date in Rules table, these values represent time limits of the event. In the RegularRules table we will have 36000 in mon_start and 64800 in mon_end.

CREATE TABLE `RegularRules` (
`repetition_id` bigint(11) unsigned NOT NULL AUTO_INCREMENT,
`rule_id` bigint(20) unsigned NOT NULL,
`mon_start` int(11) DEFAULT NULL,
`tue_start` int(11) DEFAULT NULL,
`wed_start` int(11) DEFAULT NULL,
`th_start` int(11) DEFAULT NULL,
`fr_start` int(11) DEFAULT NULL,
`sat_start` int(11) DEFAULT NULL,
`sun_start` int(11) DEFAULT NULL,
`mon_end` int(11) DEFAULT NULL,
`tue_end` int(11) DEFAULT NULL,
`wed_end` int(11) DEFAULT NULL,
`th_end` int(11) DEFAULT NULL,
`fr_end` int(11) DEFAULT NULL,
`sat_end` int(11) DEFAULT NULL,
`sun_end` int(11) DEFAULT NULL,
PRIMARY KEY (`repetition_id`),
KEY `fk_rule_id_regularrules_idx` (`rule_id`),
CONSTRAINT `fk_rule_id_regularrules` FOREIGN KEY (`rule_id`) 
REFERENCES `Rules` (`rule_id`) ON DELETE CASCADE ON UPDATE NO ACTION
) ENGINE=InnoDB  CHARSET=latin1;

Events-Places-Rules
Table that connects all of the above tables.

CREATE TABLE EPR (
`holding_id` bigint(30) NOT NULL AUTO_INCREMENT,
`event_id` bigint(20) unsigned NOT NULL,
`place_id` bigint(20) unsigned NOT NULL,
`rule_id` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`holding_id`),
UNIQUE KEY `compound` (`place_id`,`event_id`,`rule_id`),
KEY `FK_Places-Company Events-Rules_Events_event_id` (`event_id`),
KEY `FK_Places-Company Events-Rules_Places_place_id` (`place_id`),
KEY `FK_Places-Company Events-Rules_Rules_rule_id` (`rule_id`),
CONSTRAINT `FK_Places-Company Events-Rules_Events_event_id` 
FOREIGN KEY  (`event_id`) REFERENCES `Events` (`event_id`) ON DELETE CASCADE 
ON UPDATE  CASCADE,
CONSTRAINT `FK_Places-Company Events-Rules_Rules_rule_id` 
FOREIGN KEY  (`rule_id`) REFERENCES `Rules` (`rule_id`) ON DELETE CASCADE ON  
UPDATE CASCADE,
CONSTRAINT `fk_place_id_pcerc` FOREIGN KEY (`place_id`) 
REFERENCES `Places` (`place_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB CHARSET=latin1;

Stored functions

There are two stored functions. GETBEGINS and GETENDS. Parameters: rule_id, timestamp,curtimestamp.Timestamp is unixtimestamp of the day, curtimestamp is the unixtimestamp of the beggining of the current day. These functions work as follows. For each rule they are returning the beggining of the rule(begins) and the ending(ends). If the rule is not repeatable, they return start_date and end_date that are stored in the Rules table. If the rule is repeatable, they construct begins and ends of the closest non-null day_start/day_end of the RegularRules table. For instance, there is an event that has 2 rules. The first one is not repeatable with begins start_timestamp and ends end_timestamp. The second one is repeatable and has just two non-null fields: mon_start = 36000 and mon_end = 64800. GETBEGINS will turn mon_start in unixtimestamp based on current unixtimetamp and current unixtimestamp of the beggining of the day. GETBEGINS works simmilarly. Code of these functions will be provided if necessary.

Problematic query

This query should return ongoing geographically- and chronologically-closest events. Places should be distinct. So query should for each place return chronologically closest event and at the end sort resulting values depending on time and distance with some coefficients (I think sorting part will be moved to the server-side language like PHP. If you have suggestions about this sorting I am open to any solution). For example, there are 5 movies in 10 cinemas near by. Each cinema has 100 schedules. Query should return for each cinema the chronologically closest movie and then sort movies and cinemas depending on two values time and distance.

Intended query
latpoin,longpoint,r - are coordinates and radius that are passed to the script, curstamp - unixtimestamp of the beggining of the day, timestamp - current unixtimestamp

SELECT 
    epr.event_id,
    epr.place_id,
    epr.rule_id,
    (6371 * ACOS(COS(RADIANS(latpoint)) * COS(RADIANS(latitude)) *  
    COS(RADIANS(longitude) - RADIANS(longpoint)) + SIN(RADIANS(latpoint)) * 
    SIN(RADIANS(latitude)))) AS distance,
    p.latitude,
    p.longitude,
    GETBEGINS(r.rule_id, curstamp, timestamp) AS begins,
    GETENDS(r.rule_id, curstamp, timestamp) AS ends,
    MIN(ABS(GETBEGINS(r.rule_id, curstamp, timestamp) - timestamp)) AS   
    time_min
FROM
    Events e
        INNER JOIN
    EPR epr ON e.event_id = epr.event_id
        INNER JOIN
    Places p ON epr.place_id = p.place_id
        INNER JOIN
    Rules r ON epr.rule_id = r.rule_id
WHERE
    r.end_date >= timestamp
        AND latitude BETWEEN latpoint - (r / 111.045) AND latpoint + (r /   
        111.045)
        AND longitude BETWEEN longpoint - (r / (111.045 *  
        COS(RADIANS(latpoint)))) AND longpoint + (r / (111.045 * 
        COS(RADIANS(latpoint))))
        AND e.isPublic = 1
GROUP BY epr.place_id

As stated in the topic this query mixes returning values. To be more specific it matches wrong rule_id,begins,ends to the place_id group. Moreover this query performs quite poorly. Table's size: Events - 3000rows, Places- 8000rows, Rules 18000rows, EPR-15000rows. These query works approximately 1.8 second when using index hint (use index compound) and 1.2 without one. Without using index hint query makes full table scan. I have read official mysql docs regarding this subject. However their solution is not sutable because of user-calculated values (GETBEGINS and GETENDS).

Question

Query provided in the Intended query section has groupwise min problem because of the way mysql handles group by. So possible solution is to make functions GETBEGINS and GETENDS user-defined aggregated functions in this way mysql possibly will return appropiate result? Is this solution logical? Will making functions GETBEGINS and GETENDS aggregated help? Will mysql return appropiate data in that case?

Conclusion

Comments about provided solutions, new solutions, comments about indexing and about database architecture are appreciated and welcomed.

回答1:

The groupwise max is not guaranteed to work. In fact, MariaDB broke it, but provided a setting to get it back. This is what I am referring to:

SELECT  *
    FROM  
      ( SELECT  ...  ORDER BY ... )
    GROUP BY ...

where you want the first (or last) in each group from the inner query. The problem is that SQL is free to optimize away that intent.

The groupwise max code in the docs is terribly inefficient.

To speed up the query, a likely bit of help is to isolate the Rules or Places part of the WHERE clause and make that into a subquery which returns just the PRIMARY KEY of the corresponding table. Then put that into a JOIN with all the tables (including a JOIN back to the same table). You already have a "covering index" for that subquery so that it can be "Using index" (in the jargon used by EXPLAIN).

Is innodb_buffer_pool_size set to about 70% of available RAM?

BIGINT takes 8 bytes; you could probably live with MEDIUMINT UNSIGNED (0..16M). Smaller --> more cacheable --> less I/O --> faster.

The pair of DOUBLEs for lat/lng take 16 bytes. A FLOAT pair would take 8 bytes and have 6-foot / 2m resolution. Or DECIMAL(6,4) for latitude and (7,4) for longitude for 7 bytes and a 52 foot / 16m resolution. Good enough for "stores", especially since you are using a 'square' instead of a 'circle' for distance.

Code for "find the nearest ..." is hard to optimize. Here is the best I have come up with: http://mysql.rjweb.org/doc.php/latlng