Posts Tagged ‘RSS’
RSS to EPUB Converter: Create eBooks from RSS Feeds
Overview
This Python script (rss_to_ebook.py
) converts RSS or Atom feeds into EPUB format eBooks, allowing you to read your favorite blog posts and news articles offline in your preferred e-reader. The script intelligently handles both RSS 2.0 and Atom feed formats, preserving HTML formatting while creating a clean, readable eBook.
Key Features
- Dual Format Support: Works with both RSS 2.0 and Atom feeds
- Smart Pagination: Automatically handles paginated feeds using multiple detection methods
- Date Range Filtering: Select specific date ranges for content inclusion
- Metadata Preservation: Maintains feed metadata including title, author, and description
- HTML Formatting: Preserves original HTML formatting while cleaning unnecessary elements
- Duplicate Prevention: Automatically detects and removes duplicate entries
- Comprehensive Logging: Detailed progress tracking and error reporting
Technical Details
The script uses several Python libraries:
feedparser
: For parsing RSS and Atom feedsebooklib
: For creating EPUB filesBeautifulSoup
: For HTML cleaning and processinglogging
: For detailed operation tracking
Usage
python rss_to_ebook.py <feed_url> [--start-date YYYY-MM-DD] [--end-date YYYY-MM-DD] [--output filename.epub] [--debug]
Parameters:
feed_url
: URL of the RSS or Atom feed (required)--start-date
: Start date for content inclusion (default: 1 year ago)--end-date
: End date for content inclusion (default: today)--output
: Output EPUB filename (default: rss_feed.epub)--debug
: Enable detailed logging
Example
python rss_to_ebook.py https://example.com/feed --start-date 2024-01-01 --end-date 2024-03-31 --output my_blog.epub
Requirements
- Python 3.x
- Required packages (install via pip):
pip install feedparser ebooklib beautifulsoup4
How It Works
- Feed Detection: Automatically identifies feed format (RSS 2.0 or Atom)
- Content Processing:
- Extracts entries within specified date range
- Preserves HTML formatting while cleaning unnecessary elements
- Handles pagination to get all available content
- EPUB Creation:
- Creates chapters from feed entries
- Maintains original formatting and links
- Includes table of contents and navigation
- Preserves feed metadata
Error Handling
- Validates feed format and content
- Handles malformed HTML
- Provides detailed error messages and logging
- Gracefully handles missing or incomplete feed data
Use Cases
- Create eBooks from your favorite blogs
- Archive important news articles
- Generate reading material for offline use
- Create compilations of related content
Gist: GitHub
Here is the script:
#!/usr/bin/env python3 import feedparser import argparse from datetime import datetime, timedelta from ebooklib import epub import re from bs4 import BeautifulSoup import logging # Configure logging logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S' ) def clean_html(html_content): """Clean HTML content while preserving formatting.""" soup = BeautifulSoup(html_content, 'html.parser') # Remove script and style elements for script in soup(["script", "style"]): script.decompose() # Remove any inline styles for tag in soup.find_all(True): if 'style' in tag.attrs: del tag.attrs['style'] # Return the cleaned HTML return str(soup) def get_next_feed_page(current_feed, feed_url): """Get the next page of the feed using various pagination methods.""" # Method 1: next_page link in feed if hasattr(current_feed, 'next_page'): logging.info(f"Found next_page link: {current_feed.next_page}") return current_feed.next_page # Method 2: Atom-style pagination if hasattr(current_feed.feed, 'links'): for link in current_feed.feed.links: if link.get('rel') == 'next': logging.info(f"Found Atom-style next link: {link.href}") return link.href # Method 3: RSS 2.0 pagination (using lastBuildDate) if hasattr(current_feed.feed, 'lastBuildDate'): last_date = current_feed.feed.lastBuildDate if hasattr(current_feed.entries, 'last'): last_entry = current_feed.entries[-1] if hasattr(last_entry, 'published_parsed'): last_entry_date = datetime(*last_entry.published_parsed[:6]) # Try to construct next page URL with date parameter if '?' in feed_url: next_url = f"{feed_url}&before={last_entry_date.strftime('%Y-%m-%d')}" else: next_url = f"{feed_url}?before={last_entry_date.strftime('%Y-%m-%d')}" logging.info(f"Constructed date-based next URL: {next_url}") return next_url # Method 4: Check for pagination in feed description if hasattr(current_feed.feed, 'description'): desc = current_feed.feed.description # Look for common pagination patterns in description next_page_patterns = [ r'next page: (https?://\S+)', r'older posts: (https?://\S+)', r'page \d+: (https?://\S+)' ] for pattern in next_page_patterns: match = re.search(pattern, desc, re.IGNORECASE) if match: next_url = match.group(1) logging.info(f"Found next page URL in description: {next_url}") return next_url return None def get_feed_type(feed): """Determine if the feed is RSS 2.0 or Atom format.""" if hasattr(feed, 'version') and feed.version.startswith('rss'): return 'rss' elif hasattr(feed, 'version') and feed.version == 'atom10': return 'atom' # Try to detect by checking for Atom-specific elements elif hasattr(feed.feed, 'links') and any(link.get('rel') == 'self' for link in feed.feed.links): return 'atom' # Default to RSS if no clear indicators return 'rss' def get_entry_content(entry, feed_type): """Get the content of an entry based on feed type.""" if feed_type == 'atom': # Atom format if hasattr(entry, 'content'): return entry.content[0].value if entry.content else '' elif hasattr(entry, 'summary'): return entry.summary else: # RSS 2.0 format if hasattr(entry, 'content'): return entry.content[0].value if entry.content else '' elif hasattr(entry, 'description'): return entry.description return '' def get_entry_date(entry, feed_type): """Get the publication date of an entry based on feed type.""" if feed_type == 'atom': # Atom format uses updated or published if hasattr(entry, 'published_parsed'): return datetime(*entry.published_parsed[:6]) elif hasattr(entry, 'updated_parsed'): return datetime(*entry.updated_parsed[:6]) else: # RSS 2.0 format uses pubDate if hasattr(entry, 'published_parsed'): return datetime(*entry.published_parsed[:6]) return datetime.now() def get_feed_metadata(feed, feed_type): """Extract metadata from feed based on its type.""" metadata = { 'title': '', 'description': '', 'language': 'en', 'author': 'Unknown', 'publisher': '', 'rights': '', 'updated': '' } if feed_type == 'atom': # Atom format metadata metadata['title'] = feed.feed.get('title', '') metadata['description'] = feed.feed.get('subtitle', '') metadata['language'] = feed.feed.get('language', 'en') metadata['author'] = feed.feed.get('author', 'Unknown') metadata['rights'] = feed.feed.get('rights', '') metadata['updated'] = feed.feed.get('updated', '') else: # RSS 2.0 format metadata metadata['title'] = feed.feed.get('title', '') metadata['description'] = feed.feed.get('description', '') metadata['language'] = feed.feed.get('language', 'en') metadata['author'] = feed.feed.get('author', 'Unknown') metadata['copyright'] = feed.feed.get('copyright', '') metadata['lastBuildDate'] = feed.feed.get('lastBuildDate', '') return metadata def create_ebook(feed_url, start_date, end_date, output_file): """Create an ebook from RSS feed entries within the specified date range.""" logging.info(f"Starting ebook creation from feed: {feed_url}") logging.info(f"Date range: {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}") # Parse the RSS feed feed = feedparser.parse(feed_url) if feed.bozo: logging.error(f"Error parsing feed: {feed.bozo_exception}") return False # Determine feed type feed_type = get_feed_type(feed) logging.info(f"Detected feed type: {feed_type}") logging.info(f"Successfully parsed feed: {feed.feed.get('title', 'Unknown Feed')}") # Create a new EPUB book book = epub.EpubBook() # Extract metadata based on feed type metadata = get_feed_metadata(feed, feed_type) logging.info(f"Setting metadata for ebook: {metadata['title']}") # Set basic metadata book.set_identifier(feed_url) # Use feed URL as unique identifier book.set_title(metadata['title']) book.set_language(metadata['language']) book.add_author(metadata['author']) # Add additional metadata if available if metadata['description']: book.add_metadata('DC', 'description', metadata['description']) if metadata['publisher']: book.add_metadata('DC', 'publisher', metadata['publisher']) if metadata['rights']: book.add_metadata('DC', 'rights', metadata['rights']) if metadata['updated']: book.add_metadata('DC', 'date', metadata['updated']) # Add date range to description date_range_desc = f"Content from {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}" book.add_metadata('DC', 'description', f"{metadata['description']}\n\n{date_range_desc}") # Create table of contents chapters = [] toc = [] # Process entries within date range entries_processed = 0 entries_in_range = 0 consecutive_out_of_range = 0 current_page = 1 processed_urls = set() # Track processed URLs to avoid duplicates logging.info("Starting to process feed entries...") while True: logging.info(f"Processing page {current_page} with {len(feed.entries)} entries") # Process current batch of entries for entry in feed.entries[entries_processed:]: entries_processed += 1 # Skip if we've already processed this entry entry_id = entry.get('id', entry.get('link', '')) if entry_id in processed_urls: logging.debug(f"Skipping duplicate entry: {entry_id}") continue processed_urls.add(entry_id) # Get entry date based on feed type entry_date = get_entry_date(entry, feed_type) if entry_date < start_date: consecutive_out_of_range += 1 logging.debug(f"Skipping entry from {entry_date.strftime('%Y-%m-%d')} (before start date)") continue elif entry_date > end_date: consecutive_out_of_range += 1 logging.debug(f"Skipping entry from {entry_date.strftime('%Y-%m-%d')} (after end date)") continue else: consecutive_out_of_range = 0 entries_in_range += 1 # Create chapter title = entry.get('title', 'Untitled') logging.info(f"Adding chapter: {title} ({entry_date.strftime('%Y-%m-%d')})") # Get content based on feed type content = get_entry_content(entry, feed_type) # Clean the content cleaned_content = clean_html(content) # Create chapter chapter = epub.EpubHtml( title=title, file_name=f'chapter_{len(chapters)}.xhtml', content=f'<h1>{title}</h1>{cleaned_content}' ) # Add chapter to book book.add_item(chapter) chapters.append(chapter) toc.append(epub.Link(chapter.file_name, title, chapter.id)) # If we have no entries in range or we've seen too many consecutive out-of-range entries, stop if entries_in_range == 0 or consecutive_out_of_range >= 10: if entries_in_range == 0: logging.warning("No entries found within the specified date range") else: logging.info(f"Stopping after {consecutive_out_of_range} consecutive out-of-range entries") break # Try to get more entries if available next_page_url = get_next_feed_page(feed, feed_url) if next_page_url: current_page += 1 logging.info(f"Fetching next page: {next_page_url}") feed = feedparser.parse(next_page_url) if not feed.entries: logging.info("No more entries available") break else: logging.info("No more pages available") break if entries_in_range == 0: logging.error("No entries found within the specified date range") return False logging.info(f"Processed {entries_processed} total entries, {entries_in_range} within date range") # Add table of contents book.toc = toc # Add navigation files book.add_item(epub.EpubNcx()) book.add_item(epub.EpubNav()) # Define CSS style style = ''' @namespace epub "http://www.idpf.org/2007/ops"; body { font-family: Cambria, Liberation Serif, serif; } h1 { text-align: left; text-transform: uppercase; font-weight: 200; } ''' # Add CSS file nav_css = epub.EpubItem( uid="style_nav", file_name="style/nav.css", media_type="text/css", content=style ) book.add_item(nav_css) # Create spine book.spine = ['nav'] + chapters # Write the EPUB file logging.info(f"Writing EPUB file: {output_file}") epub.write_epub(output_file, book, {}) logging.info("EPUB file created successfully") return True def main(): parser = argparse.ArgumentParser(description='Convert RSS feed to EPUB ebook') parser.add_argument('feed_url', help='URL of the RSS feed') parser.add_argument('--start-date', help='Start date (YYYY-MM-DD)', default=(datetime.now() - timedelta(days=365)).strftime('%Y-%m-%d')) parser.add_argument('--end-date', help='End date (YYYY-MM-DD)', default=datetime.now().strftime('%Y-%m-%d')) parser.add_argument('--output', help='Output EPUB file name', default='rss_feed.epub') parser.add_argument('--debug', action='store_true', help='Enable debug logging') args = parser.parse_args() if args.debug: logging.getLogger().setLevel(logging.DEBUG) # Parse dates start_date = datetime.strptime(args.start_date, '%Y-%m-%d') end_date = datetime.strptime(args.end_date, '%Y-%m-%d') # Create ebook if create_ebook(args.feed_url, start_date, end_date, args.output): logging.info(f"Successfully created ebook: {args.output}") else: logging.error("Failed to create ebook") if __name__ == '__main__': main()
Twitter within GoogleReader
Let’s say immediatly there is no miracle solution and it required a bit of work…
If the title is not explicit enough, the goal of this post is to give a way to include Twitter flow within your GoogleReader. IMHO, the main features of GoogleReader over regular Twitter are the following:
- posts are stored on long term
- posts can be searched or filtered by keywords
- easy integration with Google Plus share button
- ability to give a priority to feeds (as in Google Plus’s circles)
- a simple interface to monitor all feeds
Unlike, following Twitter accounts on Google Reader (or even any other RSS aggregator) makes you lose the real time feature of Twitter.
This said, how to include your Twitter follow list within Google Reader syndication?
- For each of the accounts you follow on Twitter, get the userId.
- The easiest is to select a tweet > right click > source code > search the attribute “data-user-id”, get the associated number, eg: 813286 for @BarackObama, 50055701 for @MittRomney and 248309482 for my prefered one: @John_the_Cowboy ;-).
- Alternatively, you can request the following service: http://www.idfromuser.com
- In Google Reader, add a feed:
https://twitter.com/statuses/user_timeline/XYZ.rss
. Replace “XYZ
” with the actual number retrieved before
Of course, this is not easy at all, as said in the disclaimer above. It is long, takes much time, and can become fastidious if you follow tens of people. Anyway, Google Reader users are geeks rather than newbies. You may create a Groovy script of setup a Mule instance to automatize de process ;-).
I suggest you to create one or many folders to gather the Twitter feeds.