Feeds, Cron, Nginx and bad gateways

Recently I noticed that although I have many Feeds nodes set up to import on the hour only 4 or 5 were firing off. And they tended to be the same ones all the time.

Running Cron from the console I realised that the webserver was timing out after a minute and throwing sometimes throwing an error. You can see this by Curling the cron URL or putting it in a browser.
http://example.com/admin/config/system/cron

So the timeout needs to go up I think. But having worked through all the timeouts I could find I still had the error. My final solution (see the second to last edit) was a patch to feeds to increase how long Drupal is allowed to run a queue.

Testing with curl

time curl --connect-timeout 600 http://example.com/cron.php?cron_key=xxxxxxxxxxxxxxxx

Configuring NGINX

Set read_timeout for cron path to 10 minutes. We need some overhead in here or we'll get collisions with the 403 Gateway timeout error.

# vi /etc/nginx/sites-enabled/default
location ~ ^/cron.php {
include fastcgi_params;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
fastcgi_intercept_errors on;
fastcgi_pass 127.0.0.1:9000;
fastcgi_read_timeout 600s;
}

Restart NGINX

# service nginx reload

Configuring PHP-FPM

Set max_execution_time in PHP to 5 minutes and check with phpinfo();

# vi /etc/php5/fpm/php.ini
max_execution_time = 360
# service php5-fpm reload

Configuring the Feeds hidden setting in sites/default/settings.php. The first (http_request_timeout) sets the timeout for Curl attempting to connect to a Feeds source. The default is 30 seconds. There is a patch to move this into UI. The second (feeds_process_limit) limits how many nodes will be imported each time. This defaults to 50. For large datasets this should go waaaay up if you're expecting them to come in every run.

$ vi sites/default/settings.php
$conf['http_request_timeout'] = 120;
$conf['feeds_process_limit'] = 150; # how many nodes to import per feed http://drupalcode.org/project/feeds.git/blob/HEAD:/README.txt

Overriding the hardcoded timelimit that Feeds sets for the Queue API call (module hack--this should to be in a patch or a conf variable really).

$ vi feeds/feeds.module
function feeds_cron_queue_info() {
$queues = array();
$queues['feeds_source_import'] = array(
'worker callback' => 'feeds_source_import',
'time' => 360,
);

And if you haven't already configure Cron itself to run every 10 minutes. If Drupal's cron collides with another cron run it can refuse to execute.

$ crontab -e
*/10 * * * * curl -q "http://example.com/cron.php?cron_key=xxxxxxxx" > /dev/null
 

In the end though all of this wasn't enough. I've had to resort to drush and have added some work to the Feeds drush integration effort. This means I can call feeds-import-all on a cron job directly and avoid all the queueing and web server timeouts all together.

http://drupal.org/node/608408#comment-7470782